UnixFS CID Profiles
While CIDs and UnixFS DAGs are cryptographically verifiable, the same file or directory can produce different CIDs across UnixFS implementations, because DAG construction parameters like chunk size, DAG width, and layout vary between tools. Often, these parameters are not even configurable by users. This creates two problems: - **Broken hash semantics:** Unlike standard hash functions where iden
No reviewsSpecification
Summary
This proposal introduces configuration profiles for CIDs that represent files and directories using UnixFS. The legacy profiles table also documents non-UnixFS implementations for reference.
Motivation
While CIDs and UnixFS DAGs are cryptographically verifiable, the same file or directory can produce different CIDs across UnixFS implementations, because DAG construction parameters like chunk size, DAG width, and layout vary between tools. Often, these parameters are not even configurable by users.
This creates two problems:
- Broken hash semantics: Unlike standard hash functions where identical input produces identical output, UnixFS CIDs depend on DAG construction parameters. Simple CID comparison leads to false-negatives.
- Verification overhead: Without knowing the original parameters, users must retrieve and compare entire DAGs to verify content, adding storage, bandwidth, and complexity.
A potential solution is to define configuration profiles: well-known parameter presets that implementations can adopt when common conventions for DAG creation are desired.
See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507
UnixFS parameters
The following UnixFS parameters were identified as factors that affect the resulting CID:
- CID version, e.g. CIDv0 or CIDv1
- Multibase encoding for the CID, e.g.
base32 - Hash function used for all nodes in the DAG, e.g.
sha2-256 - UnixFS file chunking algorithm and chunk size (e.g., fixed-size chunks of 256KiB)
- UnixFS DAG layout:
balanced: builds a balanced tree where all leaf nodes are at the same depth. Optimized for random access, seeking, and range requests within files (e.g., video).balanced-packed: variant ofbalancedthat may produce different tree structure for large files. See Balanced DAG layout variants below.trickle: builds a tree optimized for on-the-fly one-time streaming, where data can be consumed before the entire file is available. Useful for logs and other append-only data structures where random access is not important.
- UnixFS DAG width (max number of links per
Filenode) - HAMTDirectory fanout: the branching factor at each level of the HAMT tree (e.g., 256 leaves).
- HAMTDirectory threshold: max
Directorysize before converting toHAMTDirectory, based onPBNode.Linkscount or estimated serialized dag-pb size. See Historical inconsistency in HAMT sharding below.links-count:PBNode.Linkslength (child count). Simple but ignores varying entry sizes.links-bytes: sum ofPBNode.Links[].NameandPBNode.Links[].Hashbyte lengths. Underestimates actual size by ignoring UnixFS Data, Tsize, and protobuf overhead.block-bytes: full serialized dag-pb node size. Most accurate, accounts for varintTsizeand optional metadata such asmodeormtime.
- Leaves: either dag-pb wrapped or raw
- Whether empty directories are included in the DAG. Some implementations may apply filtering.
- Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering.
- Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a
Directorywith link to the file. - Presence and accurate setting of
Tsize(correct UnixFS hasTsizeof child sub-DAGs). - Symlink handling: preserved as UnixFS Type=4 nodes, or followed (dereferenced to target).
- Mode: optional POSIX file permissions.
- Mtime: optional modification timestamp.
Balanced DAG layout variants
The balanced DAG layout has implementation variants that affect CID determinism for large files. CID mismatches have been observed and investigated when comparing [kubo][] and [Singularity][singularity] outputs for files exceeding 1 GiB. This IPIP introduces the name balanced-packed to distinguish Singularity's variant from the original balanced layout.
Implementations adopting a profile SHOULD specify which balanced variant they use. The unixfs-v1-2025 profile uses balanced for maximum compatibility with existing implementations.
balanced
The original balanced layout used by [kubo][]/[boxo][], [helia][], and others in the ecosystem. Builds the tree incrementally as chunks stream in:
- Starts with first chunk as root, grows tree upward as needed
- Uses explicit depth tracking to fill nodes recursively
- All leaf nodes end up at the same depth from the root
- Reference:
boxo/ipld/unixfs/importer/balanced/builder.go
balanced-packed
Name introduced by this IPIP for [Singularity][singularity]'s variant. Groups pre-computed links in batch:
- Takes all chunk links as input, then packs them into parent nodes (up to max width)
- Repeats packing level-by-level until single root remains
- Trailing nodes may have fewer children, causing leaf depth to vary
- Optimized for batch processing of pre-chunked data in CAR files
- Reference:
singularity/pack/packutil/util.goAssembleFileFromLinks()
According to Singularity issue #525, "in Singularity's DAG, the last leaf node is not at the same distance from the root as the others." This structural difference causes CID mismatches for files larger than chunk_size * dag_width (e.g., >1 GiB with 1 MiB chunks and 1024 links per node), even when all other parameters match.
Historical inconsistency in HAMT sharding
The IPFS ecosystem was never fully consistent in HAMT directory sharding behavior. This section documents the implementation history to explain why standardization through profiles is necessary.
Timeline of Go implementation changes:
-
2017-03: kubo#3042 introduced HAMT sharding with a global
Experimental.ShardingEnabledflag. When enabled, all directories were sharded regardless of size. This is why historical snapshots like/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze(Wikipedia) have HAMTDirectory nodes even for parent directories with few entries. -
2021-05: go-unixfs#91 introduced
HAMTShardingSizethreshold for automatic sharding based on estimated directory size, usinglinks-bytesestimation. This was part of the work tracked in kubo#8106. -
2021-11: go-unixfs#94 added size-based unsharding (switching from HAMT back to basic directory), completing bidirectional automatic sharding with
>=comparison. The go-unixfs repository has since been archived; its code now lives in boxo. -
2021-12: go-ipfs v0.11.0 (now Kubo) shipped with automatic HAMT autosharding, deprecating the global
Experimental.ShardingEnabledflag. -
2023-03: boxo created via Über Migration, inheriting the
>=comparison behavior from go-unixfs. -
2026-01: boxo#1088 fixed threshold comparison from
>=to>, aligning with JS implementation and documentation. Shipped in Kubo 0.40.
Timeline of JavaScript implementation changes:
-
2017-03: js-ipfs-unixfs#14 added HAMT data types to UnixFS protobuf definitions.
-
2018-12: js-ipfs#1734 added HAMT sharding support to MFS, using entry count threshold (
shardSplitThreshold, default 1000 entries). -
2019-01: js-ipfs v0.34.0 shipped with HAMT support in MFS. The threshold was based on entry count, not size, which differed from Go's size-based approach.
-
2022-10: Helia created as the successor to js-ipfs.
-
2023-02: js-ipfs-unixfs#171 changed from entry count to DAGNode size threshold (
shardSplitThresholdBytes, default 256 KiB), aligning with Go implementation. Uses>comparison. This was tracked in js-ipfs-unixfs#149. The js-ipfs-unixfs library remains active and is used by Helia. -
2023-05: js-ipfs archived; Helia became the recommended JS implementation.
The JavaScript implementation in Helia uses size > threshold (strictly greater than) in is-over-shard-threshold.ts, consistent with Go after the 2026 fix.
These inconsistencies between Go and JS implementations over the years, combined with differing threshold methods (entry count vs size) and comparison operators (>= vs >), meant cross-implementation CID determinism for large directories was never reliably achievable. The unixfs-v1-2025 profile addresses this by standardizing on block-bytes estimation and explicit > comparison.
Divergences across ecosystem
We analyzed the default settings across the most popular UnixFS implementations in the ecosystem. The table below documents the divergences that prevent deterministic CID generation today:
| Parameter | [kubo][] (CIDv0) | [helia][] | [storacha][] | [kubo][] (CIDv1) | [singularity][] | [dasl][] | [pinata][] | [filebase][] |
| ----------------------------- | ------------------------ | -------------------- | ------------------ | ----------------------------- | ----------------------------------- | ------------ | ------------ | ------------------- |
| Based on | v0.39 (unixfs-v0-2015) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | v0.39 (test-cid-v1 profile) | v0.6.0-RC4 (454b630) | spec 2025-12 | ? | add via rpc |
| CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | ? | CIDv0 |
| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | ? | sha2-256 |
| Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | fixed-size | N/A | ? | fixed-size |
| Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | 1MiB | N/A | ? | 256KiB |
| DAG layout | balanced | balanced | balanced | balanced | balanced-packed | N/A | ? | ? |
| DAG width (children per node) | 174 | 1024 | 1024 | 174 | 1024 | N/A | ? | ? |
| HAMTDirectory fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | 256 blocks (boxo) | N/A | ? | ? |
| HAMTDirectory threshold | 256KiB (links-bytes) | 256KiB (links-bytes) | 1000 (links-count) | 256KiB (links-bytes) | 256KiB (links-bytes) (boxo) | N/A | ? | ? |
| HAMT switch comparison | >= | > | > | >= | >= (boxo) | N/A | ? | ? |
| Leaves | dag-pb | raw | raw | raw | raw | N/A | ? | ? |
| Empty directories | included | included | excluded | included | included | N/A | ? | ? |
| Hidden entities | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | included (rclone) | N/A | ? | ? |
| Symlinks | preserved | followed | followed | preserved | skipped (rclone) | N/A | ? | ? |
| Mode (permissions) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | not supported | N/A | ? | ? |
| Mtime (modification time) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | not supported | N/A | ? | ? |
Terminology:
included: Always included in the DAG (no option to exclude)excluded: Always excluded from the DAG (no option to include)opt-in: Excluded by default; implementations provide a flag to include (e.g.,--hiddenin Kubo/Storacha,hidden: truein Helia)opt-out: Included by default; implementations provide a flag to excludepreserved: Symlinks stored as UnixFS Type=4 nodes with target path (per UnixFS spec). Note: Kubo (v0.39)--dereference-argsonly follows symlinks passed as CLI arguments; symlinks found during recursive traversal are always preserved.followed: Symlinks dereferenced and treated as target files/directoriesskipped:
[Content truncated — view full spec at source]
Discussion (0 threads)
Loading discussions...