01 — Object Storage Abstraction
Topic: object storage abstraction. Refines ADR-0005. The seam between BitVault and the five providers (MinIO, S3, R2, GCS, Azure Blob). The design problem: expose enough power to be efficient, hide enough difference to stay portable, and never silently emulate a capability a provider lacks.
1. The interface (narrow, capability-flagged)
The abstraction is a small surface; everything else is built on top in BitVault code (chunking, dedup, packing are ours, not the provider’s).
Core (every adapter MUST implement):
Put · Get(range) · Head · Delete · List(prefix, cursor) · Copy(intra-provider) ·
Presign(method, key, constraints, ttl) · multipart group (InitMultipart ·
PresignPart · CompleteMultipart · AbortMultipart · ListParts).
Capability-flagged (advertised, queried by callers, never emulated):
ConditionalPut (write-once / If-None-Match) · ObjectLock (WORM) ·
ServerSideCopyCrossBucket · BatchDelete · StorageClasses[] (tier set) ·
ChecksumOnPut[] (CRC32C/SHA-256/CRC64) · PresignContentLengthRange ·
StrongListAfterWrite · RemoteTieringILM.
Principle: a missing capability is surfaced, never faked. If ConditionalPut
is absent, the caller chooses a documented fallback (e.g. content-address makes
writes idempotent anyway — §4), it does not get a slow read-modify-write pretending
to be atomic.
2. Capability matrix (the differences that actually bite)
| Capability | MinIO | S3 | R2 | GCS | Azure Blob | How we cope |
|---|---|---|---|---|---|---|
| Presigned URL (GET/PUT) | ✓ | ✓ | ✓ | ✓ (V4) | ✓ (SAS) | required of all; core to ADR-0011 |
| Multipart upload | ✓ (S3 API) | ✓ | ✓ (S3 API) | ✓ (XML MPU / resumable / compose) | ✓ (Put Block + Block List) | adapter maps to native; see 05 |
| Part-size limits | S3-like | 5 MiB–5 GiB, ≤10k parts | S3-like | differs | block ≤4000 MiB, ≤50k blocks | adapter exposes MaxPart,MaxParts |
| Conditional write-once (If-None-Match) | partial | ✓ (2024) | ✓ | ✓ (x-goog-if-generation-match:0) |
✓ (If-None-Match:*) |
optimization only; correctness rests on CAS idempotency (§4) |
| Object Lock / WORM | ✓ | ✓ | partial | ✓ (retention/bucket lock) | ✓ (immutability policy) | used for compliance retention (07) |
| Storage classes / tiers | 1 (+ remote ILM) | Std/IA/Glacier… | 1 (no archive) | Std/Nearline/Coldline/Archive | Hot/Cool/Cold/Archive | tiering policy adapts to available classes (10) |
| Checksum on PUT | CRC/SHA | CRC32C/SHA-256 | partial | CRC32C/MD5 | CRC64/MD5 | belt-and-suspenders to our own hash (04) |
| Batch delete | ✓ | ✓ (1000/call) | ✓ | per-object / batch API | per-object / batch | GC batches where supported (11) |
| Strong read-after-write | ✓ | ✓ (since 2020) | ✓ | ✓ | ✓ | we still verify with Head (04) |
The matrix is the artifact that keeps the abstraction honest. Every adapter ships a populated capability set + passes one conformance suite (ADR-0005) so “supported” means “tested”, not “claimed”.
3. Addressing scheme (bucket & key layout)
Keys are content-addressed and tenant-prefixed, designed for isolation, even request distribution, and cheap lifecycle scoping:
<tenant_id>/<class>/<hash[0:2]>/<hash[2:4]>/<hash>
│ └─ 2-level fan-out prefix: spreads load, bounds list size
└─ class ∈ { chunk, pack, manifest } (manifests may live in DB instead)
- Tenant prefix → isolation (ADR-0007), per-tenant lifecycle rules, per-tenant usage accounting, and clean tenant deletion (drop the prefix).
- Hash fan-out (
hash[0:2]/hash[2:4]) → avoids hot prefixes and unbounded flat listings. With a uniform hash, 2 hex levels = 65 536 prefixes, keeping any singleListbounded even at billions of objects (the classic S3 “sequential key prefix = hot partition” failure is avoided by construction). - Bucket strategy: few buckets per (provider, region, tier); not a bucket per tenant (providers cap buckets ~100–1000; millions of tenants is impossible). Tenancy is a prefix, isolation is enforced by scoped credentials + presign, not by bucket boundaries.
Tradeoffs. Prefix fan-out adds key length and a tiny indirection vs flat keys, but is mandatory at scale. Alternative (bucket-per-tenant) gives stronger blast-radius isolation and per-bucket policies but collapses past a few thousand tenants — reserved as an enterprise dedicated-deployment option, not the default. Scaling: with content-hash keys, write load is uniformly distributed for free; no manual partitioning needed.
4. Consistency & idempotency (how CAS rescues us from provider quirks)
Content addressing turns most consistency problems into non-problems:
- Writes are idempotent.
Put(hash, bytes)for an existing object is a no-op by definition (same bytes, same name). Two racing uploads of the same chunk converge; no locking, no conditional write required for correctness. (This is whyConditionalPutis an optimization, not a dependency.) - No update-in-place, ever. Objects are immutable; “change” = write a new hash. This removes read-after-update consistency from the design entirely.
- Read-after-write: all five now offer strong read-after-write, but we still
Head-verify (size + checksum) before committing a manifest (SI-1), so even a weakly-consistent or buggy provider cannot produce a committed-but-absent chunk. - List consistency is only used by maintenance (GC/scrub/packer), never on the hot path, and those jobs are designed to tolerate eventual/lagging listings (grace periods, re-scan) (11).
5. Tradeoffs / Alternatives / Scaling (the abstraction itself)
Tradeoffs. A capability-flagged interface pushes some branching to callers
(e.g. “tier to Archive” only where StorageClasses includes it). That is the
honest cost of multi-cloud; the alternative (hidden emulation) produces silent
correctness/perf cliffs.
Alternatives considered.
- Adopt a third-party multi-cloud blob library wholesale (e.g. a
gocloud-style package): fine as an adapter implementation detail, but we keep the interface + capability model + conformance ours so provider quirks never leak upstream and so we can express BitVault-specific needs (presign constraints, tiering, checksums) the lib may not. - Use each provider’s higher-level features (provider-side lifecycle, provider-side dedup): rejected as the primary mechanism — dedup/chunking/packing are core value and must behave identically across providers; we use provider lifecycle only as an executor of our policy where available (10).
- Lowest-common-denominator interface: rejected — would forfeit multipart, tiering, WORM, and checksums that materially affect cost and durability.
Scaling concerns.
- Request cost & rate limits dominate at scale, not capacity. S3 ≈ 3,500 PUT/5,500 GET per-prefix-second; hash fan-out spreads this. Packing (§ 02, 11) collapses billions of chunk requests into far fewer pack requests — the main lever.
- Listing at scale is expensive and paginated; we never rely on
Listfor authoritative state (that’s the Chunk/Pack Index in Postgres, 08).Listis a reconciliation tool only. - Per-provider quotas (buckets, multipart lifetime, max object) are encoded as adapter constants and respected by Placement (09).
References
- AWS S3 multipart limits & quick facts: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
- S3 strong consistency (2020): https://aws.amazon.com/s3/consistency/
- GCS conditional requests (
x-goog-if-generation-match): https://cloud.google.com/storage/docs/request-preconditions - Azure Blob block blobs (Put Block / Put Block List): https://learn.microsoft.com/azure/storage/blobs/