Skip to content

Implement Git strategy #23

@alecthomas

Description

@alecthomas

Git Caching Proxy — Design Doc

Overview

Two serving strategies from a single mirror:

  1. Protocol proxy — intercept git requests, serve from a local mirror via git http-backend. Fast-path with ls-remote diff check; only fetch upstream when refs diverge.
  2. Snapshot distribution — periodically produce self-contained tar.zst archives (full checkout + .git history) for fast bootstrapping. Clients untar and are ready immediately, skipping git's expensive checkout computation.

Mirror (shared upstream)

git clone --mirror <upstream> /srv/git/repo.git

Config

# Protocol
git config protocol.version 2
git config uploadpack.allowFilter true
git config uploadpack.allowReachableSHA1InWant true

# Bitmaps — biggest win for upload-pack
git config repack.writeBitmaps true
git config pack.useBitmaps true
git config pack.useBitmapBoundaryTraversal true

# Commit graph (no --changed-paths; Bloom filters don't help upload-pack)
git config core.commitGraph true
git config gc.writeCommitGraph true
git config fetch.writeCommitGraph true

# Multi-pack-index (avoids full repack on every fetch)
git config core.multiPackIndex true

# Never unpack loose — keep fetched objects as packs
git config transfer.unpackLimit 1
git config fetch.unpackLimit 1

# Disable auto GC — maintenance is explicit
git config gc.auto 0

# Pack performance
git config pack.threads 0
git config pack.deltaCacheSize 512m
git config pack.windowMemory 1g

Maintenance

Use git maintenance for routine tasks — it handles incremental repacks, commit-graph writes, loose object packing, and ref compaction with sensible scheduling:

git maintenance register
git maintenance start
git config maintenance.strategy incremental

This sets up systemd timers / cron automatically. The incremental strategy runs:

  • commit-graph — incremental split graph writes.
  • incremental-repack — consolidates packs via multi-pack-index using geometric size progression. Avoids expensive full repacks.
  • loose-objects — packs stale loose objects.
  • pack-refs — compresses refs.

Keep a separate cron job for a periodic full repack (daily/weekly, during low-traffic windows) — git maintenance deliberately avoids these, but a single optimally-deltified pack is the best state for upload-pack serving:

# Full repack — schedule during low traffic
git repack -adb --write-midx --write-bitmap-index

Fetching

git fetch --prune --prune-tags

Locking considerations

  • upload-pack (serving clients) — read-only, no locks. Safe to run concurrently.
  • git fetch — briefly locks packed-refs. The proxy server serializes fetches with its own internal lock.
  • commit-graph write, multi-pack-index write — atomic file renames. Safe anytime.
  • Full repack -adb — deletes old packs and swaps in new ones. In-flight upload-pack processes are safe (open fds survive unlink on Linux), but new readers during the swap window could fail. The multi-pack-index mitigates this via atomic midx updates. Schedule full repacks during low-traffic windows.

Strategy 1: Protocol Proxy

Before proxying a client request, check if upstream has new refs:

UPSTREAM=$(git ls-remote <upstream> | sort)
LOCAL=$(git show-ref | sort)
[ "$UPSTREAM" != "$LOCAL" ] && git fetch --prune --prune-tags

ls-remote only exchanges the ref advertisement — cheap when nothing changed. Then serve via git http-backend against the mirror.


Strategy 2: Snapshot Distribution

Setup

Local clone from mirror — git hardlinks objects by default, so no disk duplication:

git clone /srv/git/repo.git /srv/snapshots/repo-full

Updating the snapshot clone

Over time a long-lived clone's object store drifts from the mirror's — the mirror repacks and deletes old packs, while the clone retains stale hardlinked pack files alongside new fetch packs.

Recommended: re-clone before each snapshot. Delete and re-clone from the mirror. Cheap because it's a local hardlink clone — essentially just cp -al on the object store plus checkout. Guarantees the snapshot always has a clean, compact object store matching the mirror's repacked state.

Snapshot cycle

rm -rf /srv/snapshots/repo-full
git clone /srv/git/repo.git /srv/snapshots/repo-full
cd /srv/snapshots/repo-full

REV=$(git rev-parse --short HEAD)
tar -cf - . | zstd -T0 -3 -o "/srv/snapshots/out/repo-${REV}.tar.zst"

tar resolves hardlinks into real file content automatically, so the archive is fully self-contained.

Client usage

zstd -dc repo-abc123.tar.zst | tar xf -
git remote set-url origin <proxy-or-upstream>
git pull  # catch up to latest if snapshot is slightly stale

Key decisions

Decision Rationale
--mirror for object store Single source of truth; bare repo with all refs
git maintenance for routine tasks Handles incremental repack, commit-graph, loose objects, pack-refs with sensible scheduling
Separate full repack cron git maintenance avoids full repacks; a single optimal pack is best for upload-pack serving
Local clone for snapshots Hardlinks objects from mirror — no disk duplication, self-contained from the start
Re-clone before snapshot Clean object store matching mirror's repacked state; cheap because local clone just hardlinks
No --changed-paths on commit graph Bloom filters are expensive to build and only help git log -- <path>, not upload-pack
--split commit graph Incremental layers; only processes new commits per fetch
ls-remote diff check Avoids unnecessary fetches when refs haven't changed
zstd -T0 -3 Good compression/speed tradeoff; -T0 uses all cores
Full repack in low-traffic windows Pack swap can briefly affect new readers; mitigated by multi-pack-index

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions