01 — Object Storage Abstraction

Topic: object storage abstraction. Refines ADR-0005. The seam between BitVault and the five providers (MinIO, S3, R2, GCS, Azure Blob). The design problem: expose enough power to be efficient, hide enough difference to stay portable, and never silently emulate a capability a provider lacks.

1. The interface (narrow, capability-flagged)

The abstraction is a small surface; everything else is built on top in BitVault code (chunking, dedup, packing are ours, not the provider’s).

Core (every adapter MUST implement): Put · Get(range) · Head · Delete · List(prefix, cursor) · Copy(intra-provider) · Presign(method, key, constraints, ttl) · multipart group (InitMultipart · PresignPart · CompleteMultipart · AbortMultipart · ListParts).

Capability-flagged (advertised, queried by callers, never emulated): ConditionalPut (write-once / If-None-Match) · ObjectLock (WORM) · ServerSideCopyCrossBucket · BatchDelete · StorageClasses[] (tier set) · ChecksumOnPut[] (CRC32C/SHA-256/CRC64) · PresignContentLengthRange · StrongListAfterWrite · RemoteTieringILM.

Principle: a missing capability is surfaced, never faked. If ConditionalPut is absent, the caller chooses a documented fallback (e.g. content-address makes writes idempotent anyway — §4), it does not get a slow read-modify-write pretending to be atomic.

2. Capability matrix (the differences that actually bite)

Capability	MinIO	S3	R2	GCS	Azure Blob	How we cope
Presigned URL (GET/PUT)	✓	✓	✓	✓ (V4)	✓ (SAS)	required of all; core to ADR-0011
Multipart upload	✓ (S3 API)	✓	✓ (S3 API)	✓ (XML MPU / resumable / compose)	✓ (Put Block + Block List)	adapter maps to native; see 05
Part-size limits	S3-like	5 MiB–5 GiB, ≤10k parts	S3-like	differs	block ≤4000 MiB, ≤50k blocks	adapter exposes `MaxPart`,`MaxParts`
Conditional write-once (If-None-Match)	partial	✓ (2024)	✓	✓ (`x-goog-if-generation-match:0`)	✓ (`If-None-Match:*`)	optimization only; correctness rests on CAS idempotency (§4)
Object Lock / WORM	✓	✓	partial	✓ (retention/bucket lock)	✓ (immutability policy)	used for compliance retention (07)
Storage classes / tiers	1 (+ remote ILM)	Std/IA/Glacier…	1 (no archive)	Std/Nearline/Coldline/Archive	Hot/Cool/Cold/Archive	tiering policy adapts to available classes (10)
Checksum on PUT	CRC/SHA	CRC32C/SHA-256	partial	CRC32C/MD5	CRC64/MD5	belt-and-suspenders to our own hash (04)
Batch delete	✓	✓ (1000/call)	✓	per-object / batch API	per-object / batch	GC batches where supported (11)
Strong read-after-write	✓	✓ (since 2020)	✓	✓	✓	we still verify with Head (04)

The matrix is the artifact that keeps the abstraction honest. Every adapter ships a populated capability set + passes one conformance suite (ADR-0005) so “supported” means “tested”, not “claimed”.

3. Addressing scheme (bucket & key layout)

Keys are content-addressed and tenant-prefixed, designed for isolation, even request distribution, and cheap lifecycle scoping:

<tenant_id>/<class>/<hash[0:2]>/<hash[2:4]>/<hash>
                 │          └─ 2-level fan-out prefix: spreads load, bounds list size
                 └─ class ∈ { chunk, pack, manifest } (manifests may live in DB instead)

Tenant prefix → isolation (ADR-0007), per-tenant lifecycle rules, per-tenant usage accounting, and clean tenant deletion (drop the prefix).
Hash fan-out (hash[0:2]/hash[2:4]) → avoids hot prefixes and unbounded flat listings. With a uniform hash, 2 hex levels = 65 536 prefixes, keeping any single List bounded even at billions of objects (the classic S3 “sequential key prefix = hot partition” failure is avoided by construction).
Bucket strategy: few buckets per (provider, region, tier); not a bucket per tenant (providers cap buckets ~100–1000; millions of tenants is impossible). Tenancy is a prefix, isolation is enforced by scoped credentials + presign, not by bucket boundaries.

Tradeoffs. Prefix fan-out adds key length and a tiny indirection vs flat keys, but is mandatory at scale. Alternative (bucket-per-tenant) gives stronger blast-radius isolation and per-bucket policies but collapses past a few thousand tenants — reserved as an enterprise dedicated-deployment option, not the default. Scaling: with content-hash keys, write load is uniformly distributed for free; no manual partitioning needed.

4. Consistency & idempotency (how CAS rescues us from provider quirks)

Content addressing turns most consistency problems into non-problems:

Writes are idempotent. Put(hash, bytes) for an existing object is a no-op by definition (same bytes, same name). Two racing uploads of the same chunk converge; no locking, no conditional write required for correctness. (This is why ConditionalPut is an optimization, not a dependency.)
No update-in-place, ever. Objects are immutable; “change” = write a new hash. This removes read-after-update consistency from the design entirely.
Read-after-write: all five now offer strong read-after-write, but we still Head-verify (size + checksum) before committing a manifest (SI-1), so even a weakly-consistent or buggy provider cannot produce a committed-but-absent chunk.
List consistency is only used by maintenance (GC/scrub/packer), never on the hot path, and those jobs are designed to tolerate eventual/lagging listings (grace periods, re-scan) (11).

5. Tradeoffs / Alternatives / Scaling (the abstraction itself)

Tradeoffs. A capability-flagged interface pushes some branching to callers (e.g. “tier to Archive” only where StorageClasses includes it). That is the honest cost of multi-cloud; the alternative (hidden emulation) produces silent correctness/perf cliffs.

Alternatives considered.

Adopt a third-party multi-cloud blob library wholesale (e.g. a gocloud-style package): fine as an adapter implementation detail, but we keep the interface + capability model + conformance ours so provider quirks never leak upstream and so we can express BitVault-specific needs (presign constraints, tiering, checksums) the lib may not.
Use each provider’s higher-level features (provider-side lifecycle, provider-side dedup): rejected as the primary mechanism — dedup/chunking/packing are core value and must behave identically across providers; we use provider lifecycle only as an executor of our policy where available (10).
Lowest-common-denominator interface: rejected — would forfeit multipart, tiering, WORM, and checksums that materially affect cost and durability.

Scaling concerns.

Request cost & rate limits dominate at scale, not capacity. S3 ≈ 3,500 PUT/5,500 GET per-prefix-second; hash fan-out spreads this. Packing (§ 02, 11) collapses billions of chunk requests into far fewer pack requests — the main lever.
Listing at scale is expensive and paginated; we never rely on List for authoritative state (that’s the Chunk/Pack Index in Postgres, 08). List is a reconciliation tool only.
Per-provider quotas (buckets, multipart lifetime, max object) are encoded as adapter constants and respected by Placement (09).

References

AWS S3 multipart limits & quick facts: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
S3 strong consistency (2020): https://aws.amazon.com/s3/consistency/
GCS conditional requests (x-goog-if-generation-match): https://cloud.google.com/storage/docs/request-preconditions
Azure Blob block blobs (Put Block / Put Block List): https://learn.microsoft.com/azure/storage/blobs/