Am I fooling around? - Or the story of Git, part one

Welcome to the world of trees, branches, cherry-picking, orphans, detached heads, and many more interesting things, things to explore.

Well, this is not going to be a story from your childhood, although, it has everything a childhood story would have, or some of the things at least. And if you were that sort of a child who learned versioning control systems when growing up (not judging), well, I hope you like how I wrote my understanding of it.

This is the story, or, the series of stories about some internals of Git, e.g. how Git data structure works, and some of the things I found interesting to describe and explore. Motivation for this came from some git internals workshop I recently took and the curiosity to know git better, and explain it better, to me and to everyone who might want to read.

Topics that we're going to cover in the first part are blobs, trees and commits - what they are, how they interact with each other, what is their content, etc. I'm going to assume that you know basics of git - how to initialize or clone repo, create a branch, add files to staging, commit those files, push to remote repository, etc. For some basic stuff, checkout the basics of git in the docs.

Let us start with blobs.

What do you think when you hear blob? To me, this always sounded like a drop of some thick fluid, and, well, the Merriam-Webster dictionary defines this in kind of a similar sense - a small drop or lump of something viscid or thick.[1]

In case of Git, it is a bit different. A blob is a git object type which stores the content of the files you are staging (adding) to git repository. The important thing is that this content is hashed with SHA-1 algorithm, and it is saved in the .git/objects directory under two characters directory, something similar to below:

.git/objects/
├── 73
│   └── 709ba6866a30a566a38ca40aa81d5f0928bce0

What are all those digits? Well those are hashed values of the content. The directory name is the first two digits of the hash, and the name of the file is the rest of the hash, but more on that later. If you would look on this hash value, it will show you something like this:

$ git cat-file -p 73709ba6866a30a566a38ca40aa81d5f0928bce0
Testing

So, that is a blob. It is an object which represents the content of the file stored in git. What would happen if we would edit this file and add another line, and stage (add) it to git? Well, git will create another blob, with the new content, not the difference between the old and new file. It's all about the content! This way of storing files is called content addressable filesystem, or CAF. The content itself dictates what value will be stored in a file. This also means that if you would create another file with the same content and stage it in git, there will be no new blob created under .git/objects directory. Why? Well, with blobs it is all about the content!

What happens next? How can we know about the file names, permissions, location, etc? This is all done when we commit. The important function which is triggered before the commit however is write-tree. What does this function do? This function will write the current state of the working directory in another git object called tree. This object is somewhat similar to the UNIX directory entries - it contains one or more entries, each of which is a hashed value of a blob and/or a subtree (subdir) with its associated mode (permissions), type (blob or a tree), and filename. It will look somewhat similar to this:

$ git write-tree
8894cd99d735c5f89d8c1affbb744f074f47bf79

$ git cat-file -p 8894cd99d735c5f89d8c1affbb744f074f47bf79
100644 blob 73709ba6866a30a566a38ca40aa81d5f0928bce0    readme.md
040000 tree 3c92a605431c9538952ae053957ffd4a0ce6590f    temp

So we first have the mode, or permissions, next there is a type, followed by the hashed value and the name of the file. The important thing to have in mind is that the tree is an object that contains only data about one directory level. If we would want to see what is under temp directory, we can see it by cat-file-ing the content below:

$ git cat-file -p 3c92a605431c9538952ae053957ffd4a0ce6590f
100644 blob 73709ba6866a30a566a38ca40aa81d5f0928bce0    tst

And those are the trees. A bit more complex than a regular tree, with branches and fruit you might think. Well, to tell you the truth, it took me a while to understand it, especially this one-level thing.

Okay, we can now move on safely to the commit part. What is a commit? It is a way of recording the state of the current directory with all of the files and changes that you decided to store in git. What will it do? It will create some kind of a snapshot of the working directory. This snapshot will be yet another hashed object, stored in the .git/objects directory and it will have the information about why the snapshot was created (the commit message), who created it (author and commiter) and when it was created. It will look somewhat similar to this:

# Creating the commit
$ git commit -m "My first commit"
[main (root-commit) 2d83752] My first commit
 2 files changed, 2 insertions(+)
 create mode 100644 readme.md
 create mode 100644 temp/tst

# Showing the content of the commit hash
$ git cat-file -p 2d83752
tree 8894cd99d735c5f89d8c1affbb744f074f47bf79
author Test <test@example.com> 1644511932 +0000
committer Test <test@example.com> 1644511932 +0000

My first commit

So this is basically what a commit is. Each and every time the commit happens, two hashes are created - one for the tree, and the second one for the commit itself, and those are stored in the .git/objects directory.

All of the objects mentioned above - blobs, trees and commits, are immutable. If they are created, they cannot be changed. But, how Git stores those objects?

When git wants to save an object, it creates a header. That header starts by identifying the type of the object it wants to save (blob, tree or commit). To that first part of the header, Git adds a space, followed by the size in bytes of the content, and adds a final null byte. Then, it concatenates that header and the original content of the file and calculates the SHA-1 checksum of that new content. Git then compresses that new content with the zlib and writes the compressed content to disk, in a subdirectory with first two characters of SHA-1 value as it's name and last 38 characters being the filename within that directory.[2] And that's basically it. Pretty neat, isn't it?

When you come to think of it, a lot of things happen under that directory that we are not aware of. Don't be scared by all of the hashes and everything (don't fear the SHA-1[3]), when you get the gist of it, it really is easy to understand. And when you have the git cat-file -p <hashed-value> by your side, anything is possible!

This is it for now. If you want to add your comment to this, or provide feedback, feel free to contact me, else I hope you enjoyed it and see you next Friday with the next, hopefully interesting, article.

Footnotes


  1. https://www.merriam-webster.com/dictionary/blob ↩︎

  2. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects ↩︎

  3. https://www.youtube.com/watch?v=P6jD966jzlk ↩︎