Git statistics

This post is going to go over generating statistics from a Git repository. All of the examples in the post were run against the Git repository for Git.

Total repository commits

Counting the total number of commits in a repository is just a case of using git log and wc:

$ git log --all --format=oneline|wc -l
46680

Alternatively you can also use git rev-list:

$ git rev-list --all | wc -l
46680

Total contributors

git log can also be used to list contributors:

$ git log --all --format='%aN' | sort -u
A Large Angry SCM
Aaron Crane
Aaron Schrab
...

At the time of writing there are currently 1437:

$ git log --all --format='%aN' | sort -u | wc -l
1437

Top committers

Working out the top committers is also relatively straight forward. You can use git log with a format string:

$ git log --all --format='%aN' | sort | uniq -c | sort -nr | head -n 5
  18794 Junio C Hamano
   2341 Jeff King
   1404 Shawn O. Pearce
   1112 Linus Torvalds
   1008 Nguy?n Thái Ng?c Duy

Alternatively git shortlog can also be used:

$ git shortlog --all -sn | head -n 5
 18794  Junio C Hamano
  2341  Jeff King
  1404  Shawn O. Pearce
  1112  Linus Torvalds
  1008  Nguy?n Thái Ng?c Duy

Note: In both the examples given above the .mailmap file is read to cope with alternative names and/or email addresses. Using %an instead of %aN to ignore .mailmap will produce slightly different results:

$ git log --all --format='%an' | sort | uniq -c | sort -nr | head
  18790 Junio C Hamano
   2341 Jeff King
   1334 Shawn O. Pearce
   1112 Linus Torvalds
    993 Nguy?n Thái Ng?c Duy

Top committers on a file

A very similar command can be used to calculate commit totals for a single file:

$ git log --all --format='%aN' README.md | sort | uniq -c | sort -nr | head
      6 Matthieu Moy
      1 Benjamin Dopplinger

Top committers this year

The --since option can be used to limit commits to a time period:

$ git log --all --format='%aN' --since='2016-01-01' | sort \
| uniq -c | sort -nr | head -n5
   1492 Junio C Hamano
    339 Jeff King
    183 Johannes Schindelin
    166 Nguy?n Thái Ng?c Duy
    107 Vasco Almeida

Commits over time

It's often interesting to know how active a codebase is. The following command shows total commits by year:

$ git log --all --format='%cd' --date='format:%Y' | sort | uniq -c \
| awk 'BEGIN{print "year","commits"}{print $ 2, " ", $1}'
year commits
2005   3215
2006   4601
2007   5496
2008   4120
2009   3835
2010   3883
2011   3521
2012   3782
2013   4319
2014   3103
2015   3289
2016   3516

A similar command can also be used to look at which hour of the day most commits are made:

$ git log --all --format='%cd' --date='format:%H' | sort |uniq -c \
| awk 'BEGIN{print "hour","commits"}{print $2, " ", $1}'
hour commits
00   1954
01   1292
02   780
03   415
04   177
05   67
06   108
07   340
08   878
09   1942
10   3284
11   4247
12   3983
13   3865
14   4319
15   3783
16   2573
17   1807
18   1389
19   1186
20   1146
21   2096
22   2577
23   2472

File level statistics

Looking at commits is fairly straightforward, however it's often more interesting to look at file based statistics. The git blame command is a great tool for doing this.

Lines per author

The following command will find the top five authors, based on the number of lines attributed to them in the HEAD revision of the repository:

$ git ls-tree -r  --name-only HEAD \
| xargs -d "\n" -n 1 git blame --line-porcelain \
| sed -n 's/^author / /p' | sort | uniq -c | sort -rn \
| head -n 5
 118793 Junio C Hamano
  36583 Jeff King
  27594 Jiang Xin
  23174 Shawn O. Pearce
  22671 Nguy?n Thái Ng?c Duy

Note: at the time of writing there are currently 840583 lines, split across 3000 files in the Git repository. As a result the command above took just over 15 minutes to run.

Lines per author for a file

Looking at attribution for a single file is slightly easier:

$ git blame --line-porcelain README.md \
| sed -n 's/^author //p' | sort | uniq -c | sort -r
     32 Matthieu Moy
      9 Nicolas Pitre
      9 Junio C Hamano
      5 Benjamin Dopplinger
      4 Christian Couder
      2 Stefano Lattarini

Changes by author

It's also possible to sort authors by the number of lines they've changed. The following command does this:

$ git log --all --format='%aN'|sort -u| xargs -d "\n" -n 1 -I {} \
bash -c 'echo "$(git log --format='' --author="{}" --numstat|awk "{total += (\$1  + \$2)}END{print total}") {}"' \
| sort -rn | head -n 5

366281 Junio C Hamano
154231 Linus Torvalds
145731 Jiang Xin
78303 Peter Krefting
78085 Shawn O. Pearce

Note: This is another slow command, it took about twelve minutes to run.

A word of warning

It's nice how easy it is to pull statistics from Git. However it's important to remember lines of code/number of commits is often a very poor metric to judge quality.

As always with statistics, this quote is relevant:

There are three kinds of lies: lies, damned lies, and statistics.