Git rid of it: The case for removing sensitive data from Git

Have you ever wondered how to remove sensitive data from Git history? Look no further, this is the post for you!

It's a normal afternoon. You start to slowly wrap up your things at work. You decide to go through the changes made to the code base you're working on by checking out its history. Suddenly, lo and behold, you find a file in the repository that shouldn't be there! You find a password present!

Now, before going any further, let me answer the following - who the hell would decide to spend their afternoon at work by going through Git history? Or any afternoon? Who the hell looks at Git history? Well, you might not, but other people do. If not now, then, for sure later, when debugging a problem, exploring, researching... There are numerous occasions when one might go through Git history. Trust me.

What you shouldn't do?

Let's continue with the situation. You found out the committed password, what do you do? First, don't panic.

Next, you should just go and revert the commit, correct? Well, you might, but that will not remove the password completely. The revert will only convert all your + to - and vice-versa. For every file or line of code committed, git revert will do the opposite.

But that is what we wanted, right? Yes, we wanted to remove the file from the repository, sure. But if we return to the beginning of the post - what about the Git history?

To completely remove the file, you will need to delete the file from history.

What you should do?

Deleting a file from Git history might sound complex and scary - to delete it, you will need to re-write history. How the hell one does do that without breaking anything?

There are three ways to do that. Spoiler alert - two of them are safer approaches.

  1. By using git-filter-repo.
  2. By using BFG Repo-Cleaner.
  3. By using native git filter-branch.

Now, before diving more deeply into how to use each approach - let's first discuss the safety part. I've written above that two of the approaches to do this are safe. But which two? Definitely, the third option is good because it's native, right? Well, no, not actually.

If you follow the link in the third option, you will see a warning on the official Git documentation that this is not a recommended way to rewrite history. Recommended ways are the two tools mentioned before, the first and second options. These are the approaches we are going to further dissect. For brevity, and safety purposes.

Using git-filter-repo

  1. To start using it, we would need to install it first. Now, I tried following the (installation guide)[https://github.com/newren/git-filter-repo/blob/main/INSTALL.md]. I opted first for the installation through the package manager. But, on my RHEL9-based system, I wasn't able to find it in the package repository, so I turned to the simple installation. The steps I used to install it are described below.

    # Download the raw file to the /usr/local/bin/ directory
    $ curl -o /usr/local/bin/git-filter-repo https://raw.githubusercontent.com/newren/git-filter-repo/main/git-filter-repo
    
    # Add executable rights to the file
    $ chmod +x /usr/local/bin/git-filter-repo
    
    # Test out the installation
    $ git-filter-repo --version
    ae71fad1d03f
    
  2. Go into your repository working directory and run the following command.

    # Change to your repository path
    $ cd $YOUR_REPOSITORY
    $ git filter-repo --invert-paths --path COMPLETE_PATH_TO_YOUR_FILE
    ...
    Completely finished after 0.83 seconds.
    

    The above command does the following:

    • force Git to process, but not check out, the entire history of every branch and tag;
    • remove the specified file, as well as any empty commits generated as a result;
    • remove some configurations, such as the remote URL, stored in the .git/config file;
      • you may want to back up this file in advance for restoration later!
    • overwrite your existing tags.
  3. Make sure to ignore the file in .gitignore to prevent accidental commits.

    $ echo "FILE-WITH-SENSITIVE-DATA" >> .gitignore
    $ git add .gitignore
    $ git commit -m "Add FILE-WITH-SENSITIVE-DATA to .gitignore"
    
  4. Check if sensitive data is present in some other file. If yes, repeat steps 2 and 3. Make sure the history is updated correctly (without the sensitive data in it).

  5. When we finish with all of this, we'll need to force-push our changes to the remote, so we remove all the sensitive data from the remote Git history.

    $ git push origin --force --all
    ...
    

Using BFG Repo-Cleaner

  1. To use the BFG Repo-Cleaner, we need to install it. We need to download the JAR file from this link.

  2. Next, for simplicity, we would want to add an alias in our ~/.bashrc -> alias bfg=java -jar /location/of/bfg.jar

  3. Next, if we want to remove the file with sensitive data and leave latest commit untouched, we would need to run the following command.

    $ bfg --delete-files FILE-WITH-SENSITIVE-DATA
    
  4. If you want to replace all text listed in passwords.txt wherever it can be found in your repository's history, run the below command.

    $ bfg --replace-text passwords.txt
    
  5. After we removed all the sensitive data, we would need once again to perform a force push to rewrite the remote history.

    $ git push --force
    

Long story short - git filter-branch is quite complex to use. You will need to know what you are doing to use it properly. And not mess something up while using it.

Another thing to take into account is its performance. git filter-branch is a lot slower compared to the solutions described above.

Takeaways

Removal of passwords, API keys, or any kind of sensitive data from the Git history is possible. Some time ago, this was quite complex, by using git filter-branch. It was a bit sluggish and you needed to really know your stuff.

Now, it's made easier with the tooling at hand. Following are some things to have in mind when doing this.

  • Don't beat yourself up if you committed sensitive files in the first place, it happens to all of us. To err is human.
  • Analyze the approach that is most suitable for you (by using either the git-filter-repo or bfg).
  • Go through the tools documentation and carefully follow the instructions mentioned there.
  • Inform your team about the changes you are about to make.
  • Go ahead and remove the sensitive files from the repository.

More Information