← Back to InterPlanetary Improvement Proposals
IPIP 499specificationratifiedkey-managementipfscontent-addressing

UnixFS CID Profiles

While CIDs and UnixFS DAGs are cryptographically verifiable, the same file or directory can produce different CIDs across UnixFS implementations, because DAG construction parameters like chunk size, DAG width, and layout vary between tools. Often, these parameters are not even configurable by users. This creates two problems: - **Broken hash semantics:** Unlike standard hash functions where iden

No reviews
Unknown·Updated Mar 29, 2026·0 reviews·0 attestations·View source
Collections:IPIPs — Merged

Specification

Summary

This proposal introduces configuration profiles for CIDs that represent files and directories using UnixFS. The legacy profiles table also documents non-UnixFS implementations for reference.

Motivation

While CIDs and UnixFS DAGs are cryptographically verifiable, the same file or directory can produce different CIDs across UnixFS implementations, because DAG construction parameters like chunk size, DAG width, and layout vary between tools. Often, these parameters are not even configurable by users.

This creates two problems:

  • Broken hash semantics: Unlike standard hash functions where identical input produces identical output, UnixFS CIDs depend on DAG construction parameters. Simple CID comparison leads to false-negatives.
  • Verification overhead: Without knowing the original parameters, users must retrieve and compare entire DAGs to verify content, adding storage, bandwidth, and complexity.

A potential solution is to define configuration profiles: well-known parameter presets that implementations can adopt when common conventions for DAG creation are desired.

See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507

UnixFS parameters

The following UnixFS parameters were identified as factors that affect the resulting CID:

  1. CID version, e.g. CIDv0 or CIDv1
  2. Multibase encoding for the CID, e.g. base32
  3. Hash function used for all nodes in the DAG, e.g. sha2-256
  4. UnixFS file chunking algorithm and chunk size (e.g., fixed-size chunks of 256KiB)
  5. UnixFS DAG layout:
    • balanced: builds a balanced tree where all leaf nodes are at the same depth. Optimized for random access, seeking, and range requests within files (e.g., video).
    • balanced-packed: variant of balanced that may produce different tree structure for large files. See Balanced DAG layout variants below.
    • trickle: builds a tree optimized for on-the-fly one-time streaming, where data can be consumed before the entire file is available. Useful for logs and other append-only data structures where random access is not important.
  6. UnixFS DAG width (max number of links per File node)
  7. HAMTDirectory fanout: the branching factor at each level of the HAMT tree (e.g., 256 leaves).
  8. HAMTDirectory threshold: max Directory size before converting to HAMTDirectory, based on PBNode.Links count or estimated serialized dag-pb size. See Historical inconsistency in HAMT sharding below.
    • links-count: PBNode.Links length (child count). Simple but ignores varying entry sizes.
    • links-bytes: sum of PBNode.Links[].Name and PBNode.Links[].Hash byte lengths. Underestimates actual size by ignoring UnixFS Data, Tsize, and protobuf overhead.
    • block-bytes: full serialized dag-pb node size. Most accurate, accounts for varint Tsize and optional metadata such as mode or mtime.
  9. Leaves: either dag-pb wrapped or raw
  10. Whether empty directories are included in the DAG. Some implementations may apply filtering.
  11. Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering.
  12. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a Directory with link to the file.
  13. Presence and accurate setting of Tsize (correct UnixFS has Tsize of child sub-DAGs).
  14. Symlink handling: preserved as UnixFS Type=4 nodes, or followed (dereferenced to target).
  15. Mode: optional POSIX file permissions.
  16. Mtime: optional modification timestamp.

Balanced DAG layout variants

The balanced DAG layout has implementation variants that affect CID determinism for large files. CID mismatches have been observed and investigated when comparing [kubo][] and [Singularity][singularity] outputs for files exceeding 1 GiB. This IPIP introduces the name balanced-packed to distinguish Singularity's variant from the original balanced layout.

Implementations adopting a profile SHOULD specify which balanced variant they use. The unixfs-v1-2025 profile uses balanced for maximum compatibility with existing implementations.

balanced

The original balanced layout used by [kubo][]/[boxo][], [helia][], and others in the ecosystem. Builds the tree incrementally as chunks stream in:

  • Starts with first chunk as root, grows tree upward as needed
  • Uses explicit depth tracking to fill nodes recursively
  • All leaf nodes end up at the same depth from the root
  • Reference: boxo/ipld/unixfs/importer/balanced/builder.go

balanced-packed

Name introduced by this IPIP for [Singularity][singularity]'s variant. Groups pre-computed links in batch:

  • Takes all chunk links as input, then packs them into parent nodes (up to max width)
  • Repeats packing level-by-level until single root remains
  • Trailing nodes may have fewer children, causing leaf depth to vary
  • Optimized for batch processing of pre-chunked data in CAR files
  • Reference: singularity/pack/packutil/util.go AssembleFileFromLinks()

According to Singularity issue #525, "in Singularity's DAG, the last leaf node is not at the same distance from the root as the others." This structural difference causes CID mismatches for files larger than chunk_size * dag_width (e.g., >1 GiB with 1 MiB chunks and 1024 links per node), even when all other parameters match.

Historical inconsistency in HAMT sharding

The IPFS ecosystem was never fully consistent in HAMT directory sharding behavior. This section documents the implementation history to explain why standardization through profiles is necessary.

Timeline of Go implementation changes:

  • 2017-03: kubo#3042 introduced HAMT sharding with a global Experimental.ShardingEnabled flag. When enabled, all directories were sharded regardless of size. This is why historical snapshots like /ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze (Wikipedia) have HAMTDirectory nodes even for parent directories with few entries.

  • 2021-05: go-unixfs#91 introduced HAMTShardingSize threshold for automatic sharding based on estimated directory size, using links-bytes estimation. This was part of the work tracked in kubo#8106.

  • 2021-11: go-unixfs#94 added size-based unsharding (switching from HAMT back to basic directory), completing bidirectional automatic sharding with >= comparison. The go-unixfs repository has since been archived; its code now lives in boxo.

  • 2021-12: go-ipfs v0.11.0 (now Kubo) shipped with automatic HAMT autosharding, deprecating the global Experimental.ShardingEnabled flag.

  • 2023-03: boxo created via Über Migration, inheriting the >= comparison behavior from go-unixfs.

  • 2026-01: boxo#1088 fixed threshold comparison from >= to >, aligning with JS implementation and documentation. Shipped in Kubo 0.40.

Timeline of JavaScript implementation changes:

  • 2017-03: js-ipfs-unixfs#14 added HAMT data types to UnixFS protobuf definitions.

  • 2018-12: js-ipfs#1734 added HAMT sharding support to MFS, using entry count threshold (shardSplitThreshold, default 1000 entries).

  • 2019-01: js-ipfs v0.34.0 shipped with HAMT support in MFS. The threshold was based on entry count, not size, which differed from Go's size-based approach.

  • 2022-10: Helia created as the successor to js-ipfs.

  • 2023-02: js-ipfs-unixfs#171 changed from entry count to DAGNode size threshold (shardSplitThresholdBytes, default 256 KiB), aligning with Go implementation. Uses > comparison. This was tracked in js-ipfs-unixfs#149. The js-ipfs-unixfs library remains active and is used by Helia.

  • 2023-05: js-ipfs archived; Helia became the recommended JS implementation.

The JavaScript implementation in Helia uses size > threshold (strictly greater than) in is-over-shard-threshold.ts, consistent with Go after the 2026 fix.

These inconsistencies between Go and JS implementations over the years, combined with differing threshold methods (entry count vs size) and comparison operators (>= vs >), meant cross-implementation CID determinism for large directories was never reliably achievable. The unixfs-v1-2025 profile addresses this by standardizing on block-bytes estimation and explicit > comparison.

Divergences across ecosystem

We analyzed the default settings across the most popular UnixFS implementations in the ecosystem. The table below documents the divergences that prevent deterministic CID generation today:

| Parameter | [kubo][] (CIDv0) | [helia][] | [storacha][] | [kubo][] (CIDv1) | [singularity][] | [dasl][] | [pinata][] | [filebase][] | | ----------------------------- | ------------------------ | -------------------- | ------------------ | ----------------------------- | ----------------------------------- | ------------ | ------------ | ------------------- | | Based on | v0.39 (unixfs-v0-2015) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | v0.39 (test-cid-v1 profile) | v0.6.0-RC4 (454b630) | spec 2025-12 | ? | add via rpc | | CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | ? | CIDv0 | | Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | ? | sha2-256 | | Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | fixed-size | N/A | ? | fixed-size | | Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | 1MiB | N/A | ? | 256KiB | | DAG layout | balanced | balanced | balanced | balanced | balanced-packed | N/A | ? | ? | | DAG width (children per node) | 174 | 1024 | 1024 | 174 | 1024 | N/A | ? | ? | | HAMTDirectory fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | 256 blocks (boxo) | N/A | ? | ? | | HAMTDirectory threshold | 256KiB (links-bytes) | 256KiB (links-bytes) | 1000 (links-count) | 256KiB (links-bytes) | 256KiB (links-bytes) (boxo) | N/A | ? | ? | | HAMT switch comparison | >= | > | > | >= | >= (boxo) | N/A | ? | ? | | Leaves | dag-pb | raw | raw | raw | raw | N/A | ? | ? | | Empty directories | included | included | excluded | included | included | N/A | ? | ? | | Hidden entities | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | included (rclone) | N/A | ? | ? | | Symlinks | preserved | followed | followed | preserved | skipped (rclone) | N/A | ? | ? | | Mode (permissions) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | not supported | N/A | ? | ? | | Mtime (modification time) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | not supported | N/A | ? | ? |

Terminology:

  • included: Always included in the DAG (no option to exclude)
  • excluded: Always excluded from the DAG (no option to include)
  • opt-in: Excluded by default; implementations provide a flag to include (e.g., --hidden in Kubo/Storacha, hidden: true in Helia)
  • opt-out: Included by default; implementations provide a flag to exclude
  • preserved: Symlinks stored as UnixFS Type=4 nodes with target path (per UnixFS spec). Note: Kubo (v0.39) --dereference-args only follows symlinks passed as CLI arguments; symlinks found during recursive traversal are always preserved.
  • followed: Symlinks dereferenced and treated as target files/directories
  • skipped:

[Content truncatedview full spec at source]

Discussion (0 threads)

Loading discussions...