Easy way to backup up git a repository (in a pipeline)

From previous posts, you probably understood how paranoid I am about having multiple copies of my data in an open format and how much I am concerned about vendor lock-in.

Many of us use hosted Git platforms such as GitHub and GitLab. Although they seem very open source friendly, we know how companies are getting down the enshittification route these days. It doesn’t mean that GitHub/GitLab are going down that route now, but you’ll never know. What do they say? Better safe than sorry.

Recently, a customer approached me with a request to prepare a disaster recovery copy of their data. This served as a reminder of the importance of being proactive in disaster recovery, a practice that can save us from potential data loss and its associated consequences. That rang the bell that copying their (and my) GitLab repositories would be a good idea.

In the 4th episode of my Hermit Project, I learnt that git can create an archive file (aka “bundle”) that holds the entire repo. That was conceived to transfer git objects without needing a server. I experimented with it on my Hermit and discovered that a file bundle can also be specified as a git remote url so you can clone or pull/push from that bundle. Isn’t it handy for a backup?

But how can I automate that? And mostly, when? A good moment to take a backup is when a given project has been deployed in production. That usually happens through a CI/CD pipeline when a (version) tag is applied to the project. So, I created a script to run under a pipeline when the project was “promoted” to production and a tag was created.

Both my customers and my public projects are on GitLab. The lines below belong to a .gitlab-ci.yml file: the script is run after the build/tag/deploy pipeline:

stages:
   - git-backup

git-backup:
  stage: git-backup
  image: registry.gitlab.com/group/devops/buildimage:latest
  only:
    - tags
  variables:
    GIT_STRATEGY: "clone"
    GIT_DEPTH: "0"
  script:
    - sh git_backup.sh

And the actual script:

#!/usr/bin/env bash
##
## git_backup.sh -> Backing up the git repository in the git bundle format
##

  
if [ -z ${CI_PROJECT_NAME+x} ]
then
   echo "This script is not running in a pipeline!"
   exit 1
fi
  
# We run backups only for prod/tagged releases
echo "Backing up git"

if [ ! -z "${CI_COMMIT_TAG+x}" ]
then

   # Download all the branches
   for branch in $(git branch -a | grep remotes | grep -v HEAD)
   do
      echo "Tracking branch $branch ..."
      git checkout -b ${branch#remotes/origin/} --track $branch
   done

   # Get default branch and checkout as last step
   # So the bundle would default to the master/main brach
   # in case of both branches, get one random. xargs trim the string
   default_branch=$(git branch -a | grep remotes | grep -v HEAD | egrep "master|main" | head -n 1 | xargs)
   git checkout ${default_branch#remotes/origin/}

   # Create bundle
   git bundle create /tmp/$CI_PROJECT_NAME.bundle --all

   # Copy bundle to the bucket
   aws s3 cp /tmp/$CI_PROJECT_NAME.bundle s3://$GIT_BACKUP_BUCKET/$CI_PROJECT_NAME.bundle  
fi

All the branches are downloaded before the bundle is created. Alas, even git bundle abide by the fact that the branches are not automatically downloaded with the git clone command.

The bundle is copied on an AWS S3 bucket. And that bucket is subsequentially copied by a machine through rclone. I use ZFS snapshots to have an encrypted archive of customer’s data.

It’s worth noting that the process is flexible, allowing for the bundle to be copied via scp or any other means that best suit your needs.

Edit of 2024-06-20

The first version of the script was taken from my customer’s pipeline, but I made the mistake of not testing it as a standalone script and in a separate pipeline. It wasn’t exactly working as intended, so I made the modifications and tested it accordingly.

This script contains fixes/contributions by parvXtl and winterschon

2024-06-17