Siddharth Shrimali
CS undergrad · GSoC 2026 @ Git
Recent posts
Posts
This blog documents my Google Summer of Code 2026 journey working on Git, mentored by Christian who works at GitLab and Siddharth.
16th May 2026 · Introduction
#01 - Introduction!
Hi,
My name is Siddharth, and I'm from India.
I am 20 years old and currently in my second year of Computer Science engineering.
I love low-level programming and building systems from scratch. It helps me understand systems better. Most of my programming is in C and, I have recently started learning Go. I also have experience with Java.
In my free time, I play chess with a strategy so unpredictable that even I don't know what my next move is, and it's a DOS attack on my opponent, hoping that they resign :)
I have been loving to contribute to open-source. It helped me understand how to read large codebases and communicate better with communities. Though the process is very time-consuming, and the results are not always immediately visible, I think it is absolutely worth it.
A lot of my knowledge about Linux and open source came from my college's tech club, and I will always be grateful to it for introducing me to these domains!
So I recently started contributing to open source around December 2025, first by fixing documentation and eventually contributing to Git. Somewhere along the way, I decided to apply for GSoC 2026 with the Git organization..
and well... I got selected!
I wanted to keep this blog short and simple, mainly to introduce myself to the community. This is also my first time writing a blog, so I am still learning. I don't expect my blogs to be perfect, but I do want to be consistent and eventually learn how to write a perfect blog ;)
The next blog will cover my GSoC selection experience and a brief introduction to my project, assuming I don't spend more time thinking about the title than writing the blog.
20th May 2026 · GSoC Journey
#02 - GSoC Selection & Project Introduction
Welcome back! And yes, good news, I actually managed to pick a title for this post!
Why Git?
Honestly, I found Git internals fascinating. To emphasize that, it was actually the very first statement I said to all of the mentors of Git during our first meeting! Since I was already using Git, I thought, why not give it a try? I went through the setup, figured out the requirements, and started sending my patches to the mailing list. The Git mailing list looks incredibly intimidating at first (it definitely did to me!), but eventually, you get the hang of it (mostly). ;)
To be honest, when I first decided to contribute, I didn't even know how to properly generate or send a patch via the command line. The email-based workflow felt completely new to me, and it is likely to feel that way to anyone else starting out.
Thanks to my friend, Bhargav, who was already well-versed in the CLI patch workflow from his experience contributing to the Linux kernel, I didn't have to figure it out entirely alone. He assisted me and showed me how to format and send patches correctly. And I shouldn't forget to mention, he was the one who told and convinced me to try contributing to Git.
When I started, I didn't actually pick a GSoC project right away, mostly because I didn't know much about any of them yet. I focused entirely on writing patches. My goal was to keep contributing to Git regardless of whether I got into programs like GSoC or Outreachy.
My very first patch was unforgettable. Because I was completely new to how the open-source community worked, I didn't know who anyone was. When someone replied to my first patch with detailed feedback, I went to check who this reviewer was... only to realize it was Junio Hamano, the core maintainer of Git himself! I was incredibly happy and completely shocked at the same time. I immediately started double-checking my logic, desperate to make sure I hadn't submitted something incredibly dumb to someone who basically built the tool the entire world uses daily! :)
The open-source community is genuinely helpful. They teach and guide you along the way, just like Junio guided me. I learned something new from every single patch review that Junio or other contributors sent back. Every critique was a chance to improve and submit a V2, V3, or maybe Vn (where n is a number I'm too proud to admit). It might look like a tiring task to send a new version every time you change something in the code, but it keeps the mailing list systematic and organized. That structure helps new contributors like me, and potentially future contributors like you, if you are reading this blog. :)
The Ultimate Shoutout
If you are ever confused at any point about anything, please reach out to the open-source community. I personally reached out to previous years' contributors-people who worked with Git through GSoC, Outreachy, or even independently outside of any program, to ask them how the mailing list works, how the community operates, and how to create a solid proposal. I sincerely thank all of them for helping me get over those initial hurdles.
The Project: Improving Disk Space Recovery for Partial Clones
This summer, I'll be working on "Improve Disk Space Recovery for Partial Clones" at Git under the mentorship of Christian Couder who works at GitLab and Siddharth Asthana. Both of them have been fantastic past contributors and are now guiding me as mentors.
To understand the project, you have to understand how massive repositories handle scale. When you clone a giant repository, Git allows you to use a Partial Clone. This lets you apply a filter so you don't have to download every single blob (Binary Large Object) unless you actually need it. Instead, they are downloaded on-demand from "promisor remotes." In short, promisor remotes are exactly what they sound like-remotes that have promised to deliver specific blobs when requested. It is as simple as that. :)
The Problem
It's a lifesaver for disk space, and you might not fully realize it until you actually start using it. As you check out different branches, or grep through history, Git dynamically downloads those missing blobs onto your local machine.
Over time, your local storage fills right back up.
Currently, there is no safe, built-in way to delete these locally stored blobs when you no longer need them, short of deleting the entire repository and cloning it all over again. Commands like git gc or git prune don't safely drop them because Git treats them as a permanent part of the local history.
The Goal
My project aims to build a mechanism (potentially via a new command or by expanding an existing tool like git maintenance) that allows developers to safely prune large, unneeded local blobs if they are already backed up securely on a promisor remote. This means you can aggressively reclaim your disk space while maintaining the ability to transparently re-fetch those objects whenever you need them down the road.
Sounds amazing, right? It did to me as well, and that is exactly what made me select this project, since I personally love working with low-level memory and storage management (no regrets). ;)
Next, I'll be diving into the community bonding phase. Stay tuned!
24th May 2026 · GSoC Timeline
#03 - Community Bonding & Core Internals
Welcome back!
I have officially hit the Community Bonding phase. To be completely honest, my college semester exams ate up almost an entire month with some delays, all thanks to my college :). So I couldn't get as much done as I initially wanted. :)
However, whenever I got some time, I cracked open the codebase. I spent my time reading promisor-remote.c and digging into how Git packs files, wrapping my head around several core concepts: loose objects vs. packfiles, delta compression engines, promisor remotes, and Protocol V2 wire constraints. (more about them in detail later in the blog)
I reached out to my mentors, Christian and Siddharth, who were incredibly helpful, telling me to focus entirely on my exams until the coding period kicked off. During this downtime, the main priorities were setting up this blog layout, establishing our Slack channels, and preparing a dedicated repository for clean branch reviews before hitting the Git mailing list. I personally chose GitLab, since i have never used it before, wanted to give it a try, and of course it is open-source! (You guys can also check it out.)
Beyond emails, I also initiated an informal sync-up with a fellow GSoC contributor, Jayatheerth. Since he's been contributing to Git for a year, he gave me some golden advice: stop trying to read the massive Git codebase line-by-line (which I was absolutely doing :) ) and take a more strategic approach. Soon after, we had our first official global kick-off meeting. It was fascinating to see four contributors and our mentors logging in from completely different time zones across the globe, all coming together to build and improve Git.
During the peak exam grind, I used my short breaks to draft my previous two logs introducing myself and breaking down the partial clone problem—definitely must-read blogs (because ofc, i wrote them :) ).
Diving Into the Deep End: What I Learned
Here is a beginner-friendly breakdown of the technical components I spent my study breaks analyzing.
1. Loose Objects vs. Packfiles
When you run git add, Git acts as a Content-Addressable Storage filesystem.
- • Loose Objects: Git wraps your staged file in a metadata header (
blob <size>\0<content>). This bundle is hashed via SHA-1 to generate a unique 40-character hexadecimal ID, compressed instantly usingzlib, and written directly to disk under.git/objects/XX/. This ensures fast staging, but creates extreme file fragmentation. - • Packfiles & Indexes: To save space, optimization routines like
git gcorgit repackcompress loose fragments into a single binary archive called a Packfile (.pack). Adjacent to it, Git generates a look-up Index File (.idx) listing all sorted object hashes paired with their exact byte-offsets inside the.packarchive, enabling microsecond lookups via binary search.
2. Delta Compression Mechanics (LibXDiff)
To drastically reduce repository size, the packing engine uses delta compression to store only the differences between files:
- • The Window: Git sorts objects by file type, path name, and size so historical versions of a file land next to each other. It then slides a Window Size (default: 10 entries) down this list, evaluating differences only within that narrow block to avoid choking your CPU.
- • The Chain: When matching sequences are found, Git replaces the duplicate content with bite-sized modification instructions (
Copy Offset/LengthorInsert Bytes). - • The Traversal Rule: Git encodes these chains backwards—storing the newest file version completely whole as a raw Base Object, while turning older historical iterations into dependent deltas. It caps this sequence at a maximum Chain Depth (default: 50 links) to keep reconstruction fast.
3. Partial Clones & Promisor Remotes
In massive enterprise architectures tracking millions of files, downloading a full repository history to a local machine is completely impractical.
- • The Protocol: Git overcomes this scale bottleneck through Partial Clones, applying a structural filter at download time that tells the server to skip fetching bulky assets or historical blobs unless they are explicitly called for.
- • The Promise: The local repository relies on configured Promisor Remotes—upstream servers that formally promise to pack and deliver missing object data on-demand the exact millisecond a local terminal operation (like
git checkoutorgit log -p) queries them.
4. Wire Optimization: Protocol v2 object-info
Historically, if a local partial clone simply needed to check a missing file's size via git cat-file -s <hash>, it was forced to download the entire missing file across the network just to extract that metric.
Protocol v2 introduces the object-info capability to create an instantaneous metadata lookup channel over lightweight packet line (pkt-line) network streams:
// Client Request Stream 0017command=object-info\n <- Request capability 0001 <- Delimiter Packet (starts arguments) 0009size\n <- Request metadata size attribute 0031oid <target-hash>\n <- Target object ID 0000 <- Flush Packet (ends request)
On the server side, this request routes into protocol-caps.c where cap_object_info() parses the arguments and invokes oid_object_info_extended(). Because only the object size is requested, it skips slow decompressions entirely. It searches the .idx file, jumps to the byte offset in the .pack archive, reads the raw header bytes stating the uncompressed size, and streams it back in milliseconds:
// Server Response Stream 0009size\n 002e<target-hash> <integer-size-bytes>\n 0000 <- Flush Packet (ends stream)
On another exciting note, I recently saw an email on the Git mailing list and heard from my mentors about Git Merge 2026, which is happening in Portugal this year! Working on systems code is cool, but getting to meet the community in person sounds incredible, so I am definitely planning to apply for travel support to attend.
As of today, my exams are officially behind me, and I have already sent over my Week 1 design proposal email to my mentors. I'll be covering that architecture breakdown in deep detail in the next blog.
The coding period officially begins tomorrow. Let's do this!
31st May 2026 · GSoC Week 1
#04 - Week 1: The Rollercoaster Begins
The official Google Summer of Code coding period kicked off this week!
For folks new to this series: my project is on the core Git system - Improving Disk Space Recovery for Partial Clones. Partial clones are great, they let you work with massive repos without downloading every blob upfront. But over time, your local repo accumulates blobs through lazy fetches in the background.
Currently there's no built-in way to drop those accumulated blobs and reclaim disk space short of doing a full re-clone. My project is about giving Git a safe way to do that.
My first week was a total rollercoaster. I went from a highly structured plan, to a mid-week crisis, and finally to a working prototype. (hopefully that gets approved, but we will see about that :) )
Start of the week: The masterplan
I went into Monday feeling good. My proposal had a clean technical plan: build a custom blob enumerator from scratch that walks every locally-held object, filters by type and size, and produces a list of drop candidates.
The whole thing had to stay strictly local, not triggering lazy fetches during enumeration, which meant passing OBJECT_INFO_SKIP_FETCH_OBJECT to oid_object_info_extended() on every single call.
I pushed those 200 lines with the unearned confidence of a developer blissfully unaware that this massive, elegant codebase had already solved my problem years ago.
The pipeline looked like this:
THE INITIAL PLAN (CUSTOM ENUMERATOR)
====================================
[ Incoming Object ID ]
│
▼
1. Is it a Promisor object? ───(Yes)───► [ Skip & Track ]
│ (No)
▼
2. Is it a Blob? ──────────────(No)────► [ Skip & Track ]
│ (Yes)
▼
3. Is it big enough? ──────────(No)────► [ Skip & Track ]
│ (Yes)
▼
4. Is it already indexed? ─────(Yes)───► [ Skip & Track ]
│ (No)
▼
[ Append to Candidates List ]
I spent Monday and most of Tuesday writing this. About 200 lines of fresh enumeration code, a candidate list structure, callbacks for loose and packed object iteration, the works. And then I sent it to my mentors.
And... Then came the mentor feedback. :_)
Mid-week: the pivot
My mentor, Christian, came back with a piece of feedback that completely reframed the project. The short version: don't duplicate machinery that already exists.
git repack --filter already evaluates and filters objects safely. Taylor Blau had just refactored that whole pipeline in October 2025 (commits 9251edd257..7ac4231b42), almost certainly to make it reusable. Writing a parallel walker wasn't just redundant, it was exactly the kind of code bloat the Git maintainers would push back on during review.
He also flagged something more important. He told me about when he and John Cai first worked on git repack --filter years ago, their early versions deleted filtered objects without checking they were safely available on the promisor remote first.
Junio, Git's core maintainer rejected the work outright. The real challenge of my project isn't writing an enumerator, it's making sure that everything I delete is guaranteed to be fetchable back from the remote. That's the part that has to be airtight.
THE ARCHITECTURAL EPIPHANY
==========================
Instead of: [ My Custom Walker ] ──(Duplicates)──► [ pack-objects ]
Do this: [ Git's existing repack machinery ]
│
(Filters objects)
│
▼
[ write_filtered_pack() ] ◄── My new hook
Honestly? I was stressed. Throwing away 200 LOC right at the start of the week felt like I did something massively wrong. I was worried that I was already lacking behind. But Christian assured me that this was actually a great thing. It meant I had learned enough about the codebase to recognize when I was reinventing the wheel, and that I could pivot to a much more elegant solution.
Reading before writing
There's a famous joke that "weeks of coding can save you hours of reading." Christian kindly suggested I try the reverse. So Wednesday I didn't write any code. I just read.
I worked through Taylor Blau's refactor commit by commit, mapped out which functions had been extracted, and identified write_filtered_pack() in repack-filtered.c as the entry point I needed. I traced how it calls finish_pack_objects_cmd(), how the names string list flows between repack's internal stages, how the install loop at the end of cmd_repack moves temp packs into .git/objects/pack/.
This reading kept paying off all week. Multiple times I almost wrote code based on assumptions that ten minutes of reading would have corrected. (And the few times I didn't read first, I paid for it. More on that below.)
Putting the pieces together
I started fresh on a drop-filtered branch off master. Commit 1 was small: just add the --drop-filtered and --dry-run flags to git repack with proper validation (--drop-filtered requires --filter, can't combine with --filter-to, etc.). Just flag parsing, no real work yet.
By the end of the week, I had a working prototype that perfectly executed the dry-run behavior. Here is how my final architecture for Week 1 actually works:
THE WEEK 1 FINAL ARCHITECTURE (DRY-RUN)
=======================================
User: git repack --drop-filtered --dry-run --filter=blob:limit=10k
│
▼
┌────────────────────────────────────────────────────────┐
│ 1. Existing machinery generates a temp .pack of │
│ objects matching the filter (e.g., blobs > 10k). │
│ Done by write_filtered_pack(). │
└───────────────────────────┬────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ 2. My enumerate_filtered_objects() opens the temp │
│ pack and reads out its OIDs. No network calls. │
└───────────────────────────┬────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ 3. Print the candidate OIDs to stdout (dry-run). │
└───────────────────────────┬────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ 4. Cleanup: remove the filtered pack's hash from the │
│ 'names' list so the install loop doesn't move it │
│ into packdir, then unlink_pack_path() the temp │
│ files so nothing leaks onto disk. │
└────────────────────────────────────────────────────────┘
The cleanup step (4) was important to get right. The shared names list inside cmd_repack is what drives the final install loop that moves packs into .git/objects/pack/.
If I left the filtered pack's hash in there, the user would run --dry-run and get a new pack installed in their repo containing exactly the objects they said "don't touch."
A dry-run that isn't dry. Catching that side effect required tracing what happens to names after my code runs, all the way to the end of cmd_repack.
Rabbit holes I fell into
Testing was where the week got genuinely painful. I set up a bare clone of git/git (~300MB, ~413k objects) for a realistic test. The first run printed 413,263 OIDs. Ah yes, my brilliant code successfully identified literally every single object in Git's history as a candidate for deletion.
Very space-efficient, I guess?
Obviously wrong. I spent most of the next 2 days chasing it down.
1. The --filter-to prefix confusion:
Was the names list corrupted? Was I reading the wrong pack? Was the upstream filter machinery broken? At one point, I was convinced I'd found a bug in git repack itself-until I noticed the filtered packs were showing up in /tmp/ as filtered-git-<hash>.pack, acting as siblings to the directory I expected them in.
Turns out, --filter-to is a prefix, not a directory. The docs say "Write the pack containing filtered out objects to the directory <dir>". Naturally, I assumed this meant "write into the directory." Nope! It actually means "use this exact string as a filename prefix." I love the English language. That misunderstanding cost me hours.
2. The optimal pack short-circuit:
The actual issue was much smaller. When I ran my command without -a -d -f --no-write-bitmap-index, pack-objects would short-circuit because the freshly cloned repo was already optimally packed. It produced no filtered pack at all. Because of this, the names list only contained the main pack's hash, which my code happily opened and dumped all 413k OIDs from.
With the right flags, the command correctly identified 98,798 candidate blobs, all verified to be > 10,240 bytes on a fresh clone.
I added a names.nr > names_before guard to handle this no-op case gracefully.
3. Strict C89 Compliance:
I also got a friendly scolding from -Werror=declaration-after-statement for daring to put an if condition before some variable declarations. Git enforces strict C89 style.
Lesson learned!
Looking ahead to Week 2
Week 1 was a lot. I went from confidently building a standalone engine, to throwing it away, reading for two days, and then carefully wiring my logic into existing machinery instead.
The end result is a much smaller diff with a significantly higher chance of getting merged.
Week 2 is where the real challenge starts: implementing the promisor-remote verification. Before any blob can actually be deleted, I need to prove it is fetchable back from the promisor remote.
This is the critical safety check that Christian's first attempt at --filter lacked years ago, and it is the exact component Junio will scrutinize the hardest when this eventually goes to the mailing list. I'll be modeling it closely on remove_fetched_oids() in promisor-remote.c—and I'll be writing the tests first this time.
My goal for Week 2 is simple: write code that actually works, and try to ensure that the only thing getting deleted is my 'todo' list rather than the user's entire repository. If I manage that, I’ll consider it a standing ovation-worthy success. :)
More next week.
08th June 2026 · GSoC Week 2
#05 - Week 2: Steady Building
Welcome back!
This blog is shorter than my others, since I didn't do a massive amount of coding this week. Week 2 was much calmer than the rollercoaster of Week 1, more steady building, punctuated by a couple of debugging detours that taught me a lot about how git repack actually behaves under the hood.
Where I started the week
Coming out of Week 1, I had two main commits ready to go. The first was the basic flag scaffolding to accept the new commands from the user (--drop-filtered and --dry-run).[1] The second was the enumeration logic that reads the filtered pack and prints out the candidate objects.[2] Both worked on a quick manual test during Week 1, but Week 2's job was to harden them.
So I spent a good amount of time hardening the validation for different flag combinations and thoroughly testing my implementation locally against the Git codebase.[3]
Once I felt confident, I shared the progress with my mentors, and the feedback was encouraging. Christian looked it over and gave me a green light. We had a brief chat about the command names, he suggested that what I was calling --dry-run might be better named --enumerate-filtered, since it could technically work with or without actually dropping the filtered objects. But we agreed to look at the UI discussions for the mailing list later. For now my direction is clear: move forward and implement the actual checks to verify whether the filtered objects can safely be removed, by confirming them against the promisor remote.
The parse-options side quest
What started as a one-line review comment, turned into an API contribution outside the GSoC project itself.
While looking at how I was checking whether certain flags were used together, Christian asked if I could use an existing helper called die_for_incompatible_opt2(). I had been using plain die() statements everywhere, simply because I didn't know the helper existed.
The distinction turned out to be meaningful. There are two kinds of option errors:
THE TWO KINDS OF OPTION ERRORS ============================== Prerequisite: "X requires Y" → plain die() (--drop-filtered requires --filter) Incompatibility: "X can't go with Y" → die_for_incompatible_opt2() (--drop-filtered + --filter-to)
So I converted the genuine incompatibilities (combining --drop-filtered with --filter-to, or with bitmap writing) to use the helper.
But here's the interesting part: the existing helpers in Git only covered mutually exclusive options, not prerequisites. There was no clean helper for the "X requires Y" case. So this turned into a small side-patch: I sent a short series introducing a new helper to the Git parse-options API to centralize prerequisite checks.[4]
The feedback was great. Jean-Noël Avila pointed out that my original function name was a bit of a misnomer and suggested a clearer alternative that would also be easier on translators. Christian suggested further improvements. My rebase with all these suggestions is ready, and I'm preparing to send the updated version soon.
The -a requirement: a safety constraint hiding in plain sight
While testing, I discovered something critical. If you run --drop-filtered without the -a flag, the filtered pack doesn't contain just the large blobs, it contains everything in the repository.
To understand why, you have to know how the filtered pack is actually computed. It isn't built by directly asking "which objects are large?" Instead, git repack builds it by subtraction: the filtered pack is "every object in the existing packs, minus whatever the main pack already holds."
So the contents of the filtered pack depend entirely on what ended up in the main pack first. And that's where -a comes in.
With -a, repack packs everything reachable into a single new main pack, with the filter applied. The filter keeps the small objects (commits, trees, small blobs) in the main pack and leaves the large blobs out. Then the subtraction runs:
existing objects: [commit, tree, small, small, LARGE, LARGE]
main pack (-a): [commit, tree, small, small] ← filter kept these
filtered pack = existing − main
= [LARGE, LARGE] ← exactly right
Without -a, repack runs in incremental mode. It only looks at loose (unpacked) objects, it ignores everything already sitting in existing packs. On a normal, already-packed repository there are no loose objects, so the main pack-objects run produces nothing. No main pack gets built. Now the subtraction has nothing to subtract:
existing objects: [commit, tree, small, small, LARGE, LARGE]
main pack: (empty — incremental mode packed nothing)
filtered pack = existing − nothing
= [commit, tree, small, small, LARGE, LARGE] ← everything!
With no main pack to subtract against, the "filtered" pack swallows the entire repository.
WHY -a MATTERS
==============
With -a: pack-objects splits objects into two piles
┌─────────────┐ ┌──────────────┐
│ main pack │ │ filtered pack│
│ (small) │ │ (large blobs)│ ← what we drop
└─────────────┘ └──────────────┘
Without -a: no main pile to split against
┌──────────────┐
│ filtered pack│
│ (EVERYTHING) │ ← dropping this = disaster
└──────────────┘
Right now my code only prints object IDs, so a wrong list is harmless. But once deletion lands, acting on a filtered pack that contains the whole repo would delete the entire repository. So I made --drop-filtered outright refuse to run without -a. This is exactly the kind of safety this project lives and dies on. I also had to reject bitmap writing, because filtering breaks the "all objects in one pack" closure that bitmap indexes need.
Writing the tests, and the debugging detours
The test file is where Week 2 got fiddly. I hit two main detours:
Detour 1: grep and the leading dashes. Several validation tests failed mysteriously. The cause was that test patterns starting with dashes (like searching for the error message --drop-filtered requires...) were being parsed by grep as command-line options instead of search patterns. The fix was to rephrase the patterns so they don't begin with dashes. This was obvious, but I was baffling for twenty minutes.
Detour 2: the empty filtered pack. My functional test kept producing empty output. I burned a while on this before realizing that a freshly-cloned bare repo is already optimally packed, so repack decides there's nothing to do and the filter never runs. The fix was to run a standard git repack -a -d first in the setup phase to consolidate the repo, then run the filter operation. With a clean baseline pack to filter against, the enumeration produced exactly the right object IDs.
Where Week 2 ended
I now have three clean commits on my branch (validation flags, enumeration logic, and tests) with all 8 tests passing. That's the complete enumeration half of the feature. You can run the repack command with the right flags and see exactly which blobs would be dropped, with the command refusing to run in any configuration where the answer would be unsafe or meaningless.
To see it in action, here's the dry-run pointed at the Git source repository itself, asking which blobs over 100 KiB it would drop:
$ git repack --drop-filtered --filter=blob:limit=100k --dry-run -a -f Total candidates: 9734 00008b50f517e2f91483f76f908bf3663ff824e7 blob 770118 bytes 00024313e727b3c3d6b5698b573d6035ddd29ad4 blob 848522 bytes 00037976970d57888e2db09eec77d836b2215bbb blob 285381 bytes 0009d733f6376e208b5e31ecd53f41f542626aa6 blob 109684 bytes 000e95e9815a3afda917c1d5f5af71405dda4a29 blob 477426 bytes ... (9,729 more) Every candidate is a blob, and every one is >= 100 KiB.
My plan for Week 3
Diving into the promisor-remote verification. Before any object gets deleted, I need to confirm the promisor remote actually has it and would hand it back on a lazy fetch. I'll build this verification using Protocol v2's object-info capability to ask the remote "do you have these?" without actually downloading anything.
Getting this right, and failing safe when in doubt (always keep an object rather than risk dropping one the remote lacks), is what determines whether this feature can ever be merged. Before I write any of it, I'll be reading the existing logic first.
[1]: https://gitlab.com/siddharthshrimali/git/-/commit/968ac4a609def757c27c2a4823aa2d5dcf6de10b
[2]: https://gitlab.com/siddharthshrimali/git/-/commit/c5250b7ce3259fc40cac100fc60481633aa675f3
[3]: https://gitlab.com/siddharthshrimali/git/-/commit/49d7f64037eb1fbcaffe293ee9a1ca816ec8f521
[4]: https://lore.kernel.org/git/20260603111044.39116-1-r.siddharth.shrimali@gmail.com/T/#u
22nd June 2026 · GSoC Weeks 3 & 4
#06 - Weeks 3 & 4: The Scenic Route to Obvious Solutions
Welcome back!
I know, I know, it's been a long time, and I haven't posted anything, but... honestly speaking, I wasn't very productive recently because I had the privilege to attend the Open Source Summit in India, held by the Linux Foundation. I got to meet a few Linux maintainers and be honest... when do you get to listen to Linus himself in person! It was an incredible experience, so I decided to combine the last two weeks into one big update.
Nevertheless, coming straight to the point: these two weeks were the heart of the project so far. In Week 3, I stopped writing code and started reading it, chasing one simple question: "How do I know it's safe to delete a blob?" Then, in Week 4, I discovered the foundation I'd carefully built my enumeration on couldn't actually see the objects I needed. Yup, you read that right :), I faced a wall again. Well, that’s always a fun thing to learn after you’ve already told your mentor it works, lol!
Here's the whole arc.
Part 1: How do I know it's safe to delete a blob?
After Week 2 wrapped up the enumeration (listing which blobs could be dropped), the next job was the safety check: before deleting any blob, I must be sure the promisor remote actually has it, so it can be fetched back. Get this wrong, and you corrupt someone's repository. No pressure :)
This is the exact concern that got my mentor Christian's (who works at GitLab) original git repack --filter work rejected by Junio years ago. So when your mentor tells you "be careful here," you know it's something serious and he has the scar tissue to prove it, you listen.
Two ways to check "does the remote have this blob?"
Reading through is_promisor_object() and promisor-remote.c, I found two fundamentally different ways to answer the safety question:
- • Local check — ask my own machine "is this object marked as coming from the promisor remote?"
- • Network check — ask the remote server directly "do you have this object?"
The network check sounds more authoritative and grown-up. It is also, as I discovered, currently impossible. Understanding why taught me a lot about how Git talks to remotes.
How Git talks to a remote
When you run git fetch, your computer has a two-way conversation with the server:
Client (your laptop) Server (GitHub/GitLab remote)
| |
| "I have X, what's new?" |
|----------------------------->|
| |
| "Here are the objects" |
|<-----------------------------|
| |
This conversation follows a format called Protocol v2. Each kind of question you can ask is called a capability.
What object-info is?
object-info is a capability that asks a different kind of question. Instead of "give me these objects" (a download), it asks:
"Do you have these objects? Don't send them, just tell me yes or no."
Client Server | | | "Do you have object ABC?" | |----------------------------->| | | | "Yes, I have it" | |<-----------------------------| | |
This is exactly what my project needs: confirm the remote has a blob before deleting it locally, without downloading it. Perfect. Solved. Blog over. Thanks for reading!
Except, no!
What actually exists today
For a capability to work, both computers need matching code: the server needs code to answer the question, and the client needs code to ask it. Here's today's state:
- • Server code (answer the question) → BUILT, merged, works today
- • Client code (ask the question) → NOT built for real use
The server half is finished. The client half is missing.
When you hear "object-info asks the server a question," your natural reaction is: "So the client CAN ask, right?" That was my reaction too. Here's the catch: there are two different things called "client code."
1. A test program that lives in Git's test folder. It can send one object-info request. Its entire job is to prove the server's answering code works, so the server side could be tested before it got merged.
2. Real client code inside everyday commands like git fetch or git repack, the code a normal command would use to send the request during actual work. This does not exist.
Test program → can send object-info (exists, only to test the server) Real git commands → can send object-info (does NOT exist)
So yes, something can send the question, but it's a test fixture, not something my git repack can call. I have no function like ask_remote(oids) to invoke.
That's what "the client can't ask" really means: not that the question is undefined, but that no usable function exists in real Git commands to send it.
The decision: use the local check
Since I can't phone the server, I check locally. When Git lazy-fetches an object from the promisor remote, it records on disk: "this came from the remote." is_promisor_object() reads that record. If the object came from the remote, the remote still has it, so it's safe to delete and re-fetch later. No network required.
One-line summary:
The server can answer "do you have object X?", but real Git commands have no code to ask it (only a test program can). So I use the local promisor check instead.
My mentors confirmed this is the right first step. Pablo, who is another GSoC contributor @Git from Spain is, in fact, building the real client-side code, the missing ask_remote(), so that one day any command can ask the network authoritatively.
Thanks, Cheers Pablo! :)
When it lands, I can upgrade. Until then: local check it is.
Mapping the safety guards
With the approach decided, I mapped out the full safety decision tree:
--drop-filtered invoked
│
▼
Is a promisor remote configured?
│
NO ──┴──► die() ← no remote to fetch from = permanent data loss
│
YES
│
▼
(future) Does remote support object-info?
│
NO ──┴──► warn() + fall back to local check
│
YES
│
▼
(future) Ask remote: does it have these OIDs?
│
MISSING ─┴──► die() ← object not recoverable
│
ALL PRESENT
│
▼
Safe to drop
The top of the tree (no promisor remote → die()) is what I implement now. The bottom branches wait for that client-side object-info code to exist.
Part 2: The plot twist (or: my code worked for entirely the wrong reasons)
With the safety model clear, I went to test the enumeration properly. And here is where the week turned into a detective story.
The problem (yet another problem): empty output on a real partial clone
My --drop-filtered --dry-run worked beautifully on a full clone of git/git, listing thousands of large blobs. I was feeling great about myself.
Then I tested it on a real partial clone, you know, the actual use case the entire feature exists for, and the output was empty. Nothing. Zero blobs.
Nothing humbles you quite like your code confidently doing absolutely nothing on the one input that matters.
Tracing the cause
I tested git repack with GIT_TRACE and inspected packs with verify-pack. Turns out git repack handles promisor objects through a completely separate path, before my filter ever runs:
git repack -a runs these steps IN ORDER:
Step 1: repack_promisor_objects()
→ grabs ALL promisor objects (the lazy-fetched blobs)
→ repacks them into a separate promisor pack
Step 2: main pack-objects --exclude-promisor-objects --filter
→ builds the main pack, EXCLUDING promisor objects
→ the filter only ever sees non-promisor objects
Step 3: write_filtered_pack() ← my code read from here
→ "existing packs MINUS main pack"
→ but the promisor blobs already left the building
→ result: EMPTY
In plain terms:
git repack --filteris built to separate non-promisor objects. But--drop-filteredneeds the promisor objects. These two sets never overlap. My code was diligently searching the one place the objects were guaranteed not to be.
WHAT --filter SEES: WHAT I NEED:
┌──────────────────┐ ┌──────────────────┐
│ non-promisor │ │ promisor blobs │
│ objects │ │ (lazy-fetched) │
└──────────────────┘ └──────────────────┘
↑ ↑
the filter what I want to
operates here drop is HERE
└──────── no overlap ───────┘
Why it "worked" on a full clone?
A full clone has zero promisor objects. So the filter happily separated large blobs and my code listed them, except those blobs weren't promisor objects and couldn't be safely dropped anyway. My demo had been passing by testing the one scenario where being wrong looks identical to being right. Classic.
The pivot (and a familiar idea coming back around)
I took this to my mentors, and their guidance reshaped the design. The interesting twist: way back in Week 1, my original instinct had been to walk the object database directly and pull out promisor blobs myself. Then early on, the advice had been to lean on Git's existing repack --filter machinery rather than a whole walk, which is generally excellent advice, reuse beats reinvention.
But this particular built-in machinery is built for the opposite of what I need. So my mentors suggested moving away from leaning on repack's internal filtered-pack output, and back toward direct enumeration, essentially my Week 1 design, but done properly.
After two weeks I had taken the scenic route to arrive exactly where I started, except now I actually understood why it was right. Sometimes the long way around is just the tutorial.
The fix: enumerate promisor objects directly
Instead of fetching promisor blobs out of a filter that excludes them by design, I walk them directly:
For each object in promisor packs (ODB_FOR_EACH_OBJECT_PROMISOR_ONLY):
is it a blob? → if not, skip
is it bigger than limit? → if not, skip
otherwise → it's a drop candidate
In code:
odb_for_each_object(repo->objects, NULL, collect_promisor_blob, &cb,
ODB_FOR_EACH_OBJECT_PROMISOR_ONLY);
Every object found this way is a promisor object by construction, so it's guaranteed recoverable from the remote. The walk itself is the safety guarantee.
Reusing the filter machinery (the right way)
My mentors made one more important point: even though I'm no longer using --filter to drive the enumeration, I still shouldn't reinvent the size-checking logic. So instead of hand-writing "is this blob bigger than the limit?", I'm building on Git's existing filter code with a small helper:
int list_objects_filter__filter_oidset(struct repository *r,
struct list_objects_filter_options *opts,
const struct oidset *in,
struct oidset *omitted);
Give it a set of OIDs, it returns the ones the filter selects. My enumeration collects all promisor blobs, hands them to this helper, and gets back the ones over the threshold. So it's the best of both: my own walk (so I actually find the objects), plus Git's battle-tested filter logic (so I'm not duplicating it). This list_objects_filter__* work is what I'm building out now.
The safety test that matters
The crucial test: a locally-created large blob, one that exists only on my machine and never on the remote, must never be listed as a drop candidate. Dropping it would be unrecoverable data loss, the cardinal sin.
Setup: remote-big.bin → lazy-fetched from remote (promisor) → should be LISTED local-big.bin → created locally (not promisor) → must NOT be listed Result: remote-big: listed (correct) <- works local-big: not listed (correct) <- works
The promisor walk protects locally-created objects for free, they're simply not in any promisor pack, so the walk never sees them. The safest code is the code that never even looks.
Looking ahead
The safety model is decided, the enumeration finally works on the inputs that actually matter, and it leans on Git's existing filter machinery instead of my own arithmetic.
Next up: the actual deletion path, and wiring in the safety guards from the decision tree, the part where the stakes go from "prints nothing" to "deletes everything :) Lol, no, I'm kidding, deletes blobs that should be deleted." Should be relaxing. (I hope so).
Thanks for reading, see you next week!
29th June 2026 · GSoC Week 5
#07 - Week 5: The Plan That Almost Wrote Itself
Welcome back!
This week was calmer (and so this will be a relatively smaller blog :) (emphasis on relatively :) )), mostly consisting of reviews, feedbacks, and test writing.
Review cleanup
A couple of review comments first. My commit message claimed a helper "reuses the existing filter machinery", except it doesn't, it reimplements the size check by hand, because the real machinery only works during an object walk and I just had a set of OIDs. So I rewrote the commit message to explain the problem first (the real filter API only works during a walk), then what the commit does, with a NEEDSWORK comment marking it for a future refactor.
The test file I'd been ignoring
My test file was still the Week 1 version, testing the old approach on a full clone, the one scenario where being wrong looks identical to being right. The whole point of the feature is partial clones, so I rewrote it.
The fun part was learning how Git's test suite fakes a promisor pack.. ;), yeah, you read it right:
pack_as_from_promisor () {
HASH=$(git -C repo pack-objects .git/objects/pack/pack) &&
>repo/.git/objects/pack/pack-$HASH.promisor &&
echo $HASH
}
You pack an object and drop an empty .promisor file next to it. That >file creates a completely empty file. Zero bytes, and that's enough to mark the pack's objects as "promised." (Hold that thought.)
I added a safety check in the tests, since the project lives or dies on never deleting something you can't get back:
remote-big (lazy-fetched, promisor) ──► LISTED local-big (created locally) ──► NOT LISTED (deleting = data loss)
The local blob is protected, it's not in any promisor pack, so the enumeration never sees it.
The plot that almost didn't twist
Then I started designing deletion. My proposal had a careful write-before-delete-then-fsync to survive a crash mid-deletion, and I was planning to build all of it, until I realised git repack already does exactly that when it rebuilds a pack!
1. write the new pack to a temp location 2. fsync 3. install 4. only then delete the old one Crash-safe ;)
Since you can't remove one object out of a pack, you rebuild the pack without it. So deletion collapses into: rebuild the promisor pack excluding my to_drop list, and let repack handle the dangerous parts.
promisor pack: [ commit | tree | BIG1 | BIG2 | small ]
to_drop = { BIG1, BIG2 }
rebuild → [ commit | tree | small ] + .promisor marker
BIG1/BIG2 gone locally, but remote.origin.promisor = true
→ lazy-fetches them back
I had also assumed I needed to write the dropped OIDs into the .promisor sidecar to keep them re-fetchable. But opening a real one showed it just contains the fetch's ref tips (<oid> HEAD, <oid> refs/heads/master), provenance, not a fetch list. Re-fetchability comes entirely from the promisor remote config. So my assumption in the proposal was wrong.
...and then it twisted anyway
I sent this to my mentors. They agreed on the mechanism, then raised two things I had waved off:
First: what happens when "recoverable" stops being true? If a branch is later deleted on the remote, the dropped object becomes unrecoverable, and a future lazy fetch fails with a generic could not fetch <oid> error, the user has no idea they dropped it on purpose. The fix is a persistent, reflog-style drop log: "dropped object X on date D, matched filter F, attested recoverable from remote R", in its own file. One day that will turn a baffling error into a clear explanation :)
Second: are the blobs always packed? I had assumed yes. But.. large files could sometimes stay loose right? So I tested, and found out that loose objects produced were zero.
60 KB blob, defaults → packed 60 KB blob, fetch.unpackLimit=1 → still packed 5 MB blobs, core.bigFileThreshold=1m → still packed loose objects produced: zero
In every config, the lazy-fetch path stored a packed promisor pack, the .promisor marker is pack-level, so the fetch keeps a pack to attach it. So droppable blobs are always packed in practice. I am still reading more on this..
Looking ahead
Assuming that I go in the correct direction.. wire the exclusion into repack's rebuild, append to the drop log before deleting, add the remaining safety guards, and write the test that matters, drop a blob for real :)
The stakes should go from "prints a list" to "removes things from your disk."
See you next week, presumably from a repository that still exists. :)
Thanks for reading!