Sifting Through the Archive: Private Set Membership for Blossom

Sifting Through the Archive: Private Set Membership for Blossom

Blossom is a protocol for storing blobs on media servers, using SHA-256 hashes as content addresses - simple, elegant, and effective. But lurking beneath this simplicity is a privacy problem: every time a client checks whether a file exists via the HEAD /<sha256> endpoint, the server learns exactly which file the client is looking for. For users who wish to synchronize thousands of files without revealing their entire library to the server, a different approach is needed.

The Problem

Imagine you have 10,000 files and want to determine which ones already exist on a Blossom server before uploading. The naive approach - issuing 10,000 individual HEAD requests - certainly works, but it hands the server a complete manifest of your interests. The server can build a profile of what you store, correlate your queries over time, and potentially share or monetize that information. What we need is a privacy-preserving solution with specific properties: minimal server computation (ideally zero per query), minimal bandwidth consumption, the ability to push computation to the client, and most importantly, a guarantee that the server learns nothing about which hashes the client queries.

Why Probabilistic Filters Work

The key insight comes from understanding how probabilistic filters like Bloom, Cuckoo, and Binary Fuse filters behave. These data structures exhibit an asymmetric error profile: they never produce false negatives, meaning that when the filter declares an item "not present," that answer is absolutely reliable. However, they can produce false positives, occasionally claiming an item is "maybe present" when it actually is not. While zero false positives would obviously be preferable, this asymmetry turns out to be acceptable for our use case.

For file deduplication, this asymmetry aligns beautifully with our needs. When the filter says "not present," we know with certainty the file is missing and should be uploaded. When it says "maybe present," the file probably exists and we can safely skip the upload. The occasional false positive means we might skip uploading something that is actually missing, but this merely delays the upload until the next synchronization cycle after the filter rebuilds - an acceptable trade-off for most applications.

How Binary Fuse Filters Work

Binary Fuse filters, introduced by Graf and Lemire in 2022, represent the current state of the art in static set membership data structures. Understanding their elegance requires a brief journey through their construction.

The filter works by mapping each element to three distinct positions in an array through hash functions. During construction, the algorithm finds a "peeling order" - a sequence in which elements can be processed such that each element, when its turn comes, has at least one array position that no remaining element touches. Think of it like carefully removing books from a precarious stack: you must find the right order so that each book you remove doesn't disturb the others still waiting.

Once this peeling order is discovered, the filter assigns fingerprints (small hash values) to array positions in reverse order. For each element, the algorithm stores a value at one of its three positions such that XORing the values at all three positions yields the element's fingerprint. This XOR-based construction is what gives the filter its name and its efficiency.

To query whether an element exists, the client simply hashes the element to find its three positions, XORs the values stored there, and checks if the result matches the element's expected fingerprint. A match indicates the element is probably in the set; a mismatch proves definitively that it is not. The entire query requires just three memory lookups and two XOR operations - extraordinarily fast.

The "binary fuse" innovation specifically refers to how the array is partitioned into segments with a binary structure, which dramatically improves construction success rates and reduces the space overhead compared to earlier Xor filter designs.

Three Variants for Different Needs

Binary Fuse filters come in three variants that trade space for accuracy:

Variant False Positive Rate Size per element Use case
Binary Fuse8 0.4% (1/256) ~9 bits General purpose, best space efficiency
Binary Fuse16 0.0015% (1/65536) ~18 bits When higher accuracy is required
Binary Fuse32 0.00000002% (1/2^32) ~36 bits Effectively perfect, for critical applications

All three variants share remarkably fast construction times - approximately 350 milliseconds for 10 million keys - and query times of roughly 25 nanoseconds per lookup.

Binary Fuse8 offers the best space efficiency, but with a 0.4% false positive rate, roughly 40 out of every 10,000 queries might incorrectly return "maybe present."

One crucial characteristic deserves emphasis: these filters are deterministic. The same hash queried against the same filter will always return the same result, which means a false positive will not resolve itself through repeated queries. It persists stubbornly until the server rebuilds the filter with a fresh structure. This is precisely why periodic rebuilds matter - they incorporate new files while also reshuffling which hashes happen to collide into false positives.

Binary Fuse16 is the sweet spot for most applications. It reduces the false positive rate to 0.0015%, nearly 256 times better than Fuse8, at the cost of doubling the filter size. At this rate, you would expect roughly one false positive per 65,000 queries - essentially negligible in practice.

Binary Fuse32 achieves what is effectively perfection for practical purposes - you would need billions of queries before encountering a single false positive. The filter grows to four times the size of Fuse8, but for applications where correctness is paramount, this trade-off is worthwhile.

For a server hosting 10 million files, the concrete numbers look like this:

Variant Filter size False positives per 10K queries
Binary Fuse8 ~11 MB ~40
Binary Fuse16 ~22 MB ~0.15
Binary Fuse32 ~45 MB ~0.000002

Once the client downloads the filter, all queries happen locally with perfect privacy - the server never learns which hashes the client checks.

The Dynamic Data Problem

Blossom servers, however, are not static archives. Files arrive constantly - perhaps one every ten seconds on a busy server - and Binary Fuse filters are immutable by design. You cannot incrementally add or remove elements; the entire structure must be rebuilt. Does this limitation doom the approach?

Not at all. The mathematics work decisively in our favor. Rebuilding a filter over 10 million hashes costs merely 350 milliseconds of computation. Performing this rebuild every hour consumes just 0.01% of a single CPU core - utterly negligible. Between hourly rebuilds, approximately 360 new files might arrive, and these files simply will not appear in the filter until the next rebuild cycle.

When a client queries for one of these recently-added files, the filter correctly responds "not present" (after all, the file was not present when the filter was constructed), and the client attempts to upload it. The server, already possessing the file, deduplicates the upload and responds accordingly. The cost is minor bandwidth waste, not a correctness failure.

The Hybrid Solution

For applications demanding near-real-time accuracy without the overhead of constant filter rebuilds, the elegant solution combines the filter with compact delta lists:

GET /filter
  Returns:
    - Binary Fuse filter (rebuilt hourly/daily)
    - List of hashes added since filter was built
    - List of hashes removed since filter was built
    - Timestamp of filter generation

The client logic becomes straightforward: download the filter and delta lists (caching them with ETags for efficiency), then for each hash to check, first consult the removal list (if present, the file is definitely gone), then the addition list (if present, the file definitely exists), and finally the filter itself. Periodic refreshes keep the client synchronized with server state.

The beauty of this approach lies in the size disparity between the components. The delta lists remain tiny - a few hundred hashes accumulating between rebuilds amounts to perhaps 10-20 KB - while the filter efficiently handles the bulk of millions of hashes. Together, they deliver real-time accuracy with minimal bandwidth overhead.

Server Implementation

The server's responsibilities are minimal. It maintains three components: a base filter rebuilt on a schedule (hourly for busy servers, daily for quieter ones), an addition list accumulating hashes of files uploaded since the last filter build, and a removal list tracking deletions over the same period.

When a file arrives, its hash is appended to the addition list. When a file is deleted, its hash joins the removal list. When rebuild time comes, the server regenerates the filter from its database and clears both lists. The total server cost per client query is zero - all the computational work happens on the client side.

A Potential BUD

This scheme could be formalized as a Blossom Upgrade Document:

GET /filter
  Headers:
    X-Filter-Type: binary-fuse-16
    X-Filter-Timestamp: 2024-01-15T12:00:00Z
    X-Element-Count: 10485760
    ETag: "abc123"
  
  Body: {
    "filter": <base64-encoded Binary Fuse filter>,
    "added": ["<sha256>", "<sha256>", ...],
    "removed": ["<sha256>", "<sha256>", ...]
  }

Clients indifferent to privacy can continue using HEAD /<sha256> as before, while privacy-conscious clients fetch the filter and query entirely locally.

Implementation

Binary Fuse filter libraries exist for most popular languages:

  • C: github.com/FastFilter/xor_singleheader
  • Go: github.com/FastFilter/xorfilter
  • Rust: docs.rs/xorf
  • Zig: github.com/hexops/fastfilter
  • JavaScript: Various npm packages

All of these libraries support serialization, making transmission of the filter over HTTP straightforward.

Conclusion

The combination of Binary Fuse filters and delta lists solves private set membership for Blossom with remarkable elegance. Clients enjoy perfect privacy since queries never leave their machines. Servers expend zero computation per query since all work happens client-side. Bandwidth remains minimal - a compact filter plus tiny delta lists. Real-time accuracy comes from the delta lists covering changes since the last rebuild. And implementation is straightforward thanks to well-tested libraries available in most languages.

The server rebuilds a filter periodically - a few hundred milliseconds of work - serves it to clients alongside two small lists, and that is all. Clients download once, query locally as often as they wish, and refresh periodically. No cryptographic complexity, no multi-round protocols, just efficient data structures doing precisely what they were designed to do.

Sometimes the simplest solution is also the right one.