typedspace.com: tidbits on software

tools for data mining git repositories


Stream of conciousness on software I've found useful to mine git repositories.


The tool generates a treemap using d3.js to depict files that have not changed in a long time with a red segments. I enjoyed how simple it was to run. For a larger project, I needed to use Firefox to render the chart.

Git of Thesus

Erik Bernhardsson wrote a nice utility called git of thesus to explore the half life of code. It uses git history to summarize change over time in a compelling visualization. I have little python experience so found it tricker to get running. I ran it using docker run and cobbling shell commands.

The git of thesus source is here.


The hercules utility replicates git-of-thesus-analyze and a host of other analyses. As both projects attest, hercules is much faster on older git repositories. 5, 10 or more years!

I still needed to run the analysis in a docker container despite hercules being a single binary. I ended up with:

#!/usr/bin/env sh
set -x
project_over_the_years() {
	cd /tmp || exit 1
	# extract data from a git repository into a binary format
	# pb is short for protocol buffer
	hercules \
		--pb \
		--first-parent \
		--burndown \
		--blacklisted-prefixes "package-lock.json,yarn.lock" \
		"$1" / > "$output"
			# generate the visualization
			docker run --rm -i -v /tmp:/io srcd/hercules labours -f pb -m burndown-project -i "/io/$output" -o /io/"$1_over_the_years.png"

project_over_the_years ~/foss/zola

You can download single hercules binaries here. I felt overwhelmed by the numerous command options.