Skip to content

Conversation

@mishmosh
Copy link
Contributor

@mishmosh mishmosh commented Apr 3, 2025

Currently, CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID.

This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. They can be used to verify data across implementations, provide recommended settings depending on retrieval performance goals, and more.

@mishmosh mishmosh requested a review from a team as a code owner April 3, 2025 14:03
@mishmosh mishmosh changed the title Create ipip-0000.md: CID profiles IPIP 0499: CID Profiles Apr 3, 2025
lidel added a commit to ipfs/kubo that referenced this pull request Apr 15, 2025
lets make the fanout match the max links from files
and rename profile to `-wide`

this will make it easier to discuss in ipfs/specs#499
lidel and others added 2 commits April 15, 2025 23:41
Co-authored-by: Bumblefudge <bumblefudge@learningproof.xyz>
Import.* config params for controlling DAG width were added in:
ipfs/kubo#10774
@lidel
Copy link
Member

lidel commented Apr 15, 2025

Thank you for kicking this off, and filling initial state.

I've incorporated specific "dag width" settings for File, Directory and HAMTDirectory nodes,
and updated the table to reflect state from ipfs/kubo#10774
and profiles that exist in Kubo master branch: legacy-cid-v0, test-cid-v1 and test-cid-v1-wide:

Next:

  • agree what "cid-2025" profile should look like
    • this will be new default in "Kubo v1.0"
    • we have test-cid-v1 and test-cid-v1-wide in Kubo as potential candidates
  • switch to PR from local branch (so we have build preview)
  • figure out how to render the information (currently the table is not supported by https://github.com/ipfs/spec-generator)

@SethDocherty

This comment was marked as off-topic.

clarify empty directories and hidden entities handling with precise
terminology based on kubo v0.39, helia, and storacha implementations:

- `included`: always in DAG, no option to exclude (kubo/helia empty dirs)
- `excluded`: never in DAG, no option to include (storacha empty dirs)
- `opt-in`: excluded by default, flag to include (all hidden entities)
- `opt-out`: included by default, flag to exclude

add terminology note to explain these terms
add "Based on" row with package/tool versions and kubo profile names
- unixfs-2025: mark threshold as TODO, prefer Helia's block size approach
- unixfs-2025: note kubo needs opt-out flag for empty directories
- legacy profiles: add estimation method to kubo profiles
- parameters section: add backticks, clarify threshold estimation methods
- add Symlinks parameter to UnixFS parameters list
- add Symlinks row to unixfs-2025 (TODO) and legacy profiles tables
- kubo: preserved, helia/storacha: followed, dasl: not specified
- add terminology for preserved/followed with UnixFS spec reference
- clarify kubo --dereference-args behavior
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick update: I've pushed several commits addressing feedback and gaps in the document:

Resolved / Research done

  • Documented Syhmlink behavior as suggested by @icidasset
    • Only Kubo 0.39 preserves symlinks, everything else dereferences on the fly by default, turning symlinks into real files and directories a symlink pointed at
  • Added Chunking algorithm row to both profile tables for completeness
  • Fixed kubo-legacy-2025 profile: corrected Leaves from raw to dag-pb (verified against kubo v0.39 legacy-cid-v0 profile where UnixFSRawLeaves=false)
  • Documented filtering behavior with clear terminology:
    • included: always in DAG (no option to exclude)
    • excluded: always excluded (no option to include)
    • opt-in: excluded by default, flag to include (e.g., --hidden)
    • opt-out: included by default, flag to exclude
  • Added Based on row with implementation versions and kubo profile names (legacy-cid-v0, test-cid-v1, test-cid-v1-wide)
  • Clarified HAMTDirectory threshold estimation methods in the parameters section: link count (naive), PBNode.Links size (name + CID), or full dag-pb block size (most accurate)
  • Noted that legacy table includes non-UnixFS implementations (DASL) in Summary section
  • Added estimation method suffix (est:links[name+cid]) to kubo profiles in legacy table

Remaining TODOs in unixfs-2025

Parameter Status
HAMTDirectory threshold TODO - fix kubo: likely based on full block size estimation (Helia approach)
Empty directories TODO - use kubo? needs opt-out flag + Import.*
Hidden entities TODO - use kubo? needs opt-in flag + Import.*
Symlinks TODO - use kubo? needs flag + Import.* for controlling if all symlinks in imported directory tree are preserved or dereferenced)
Test fixtures TODO - reuse kubo: will reuse once kubo has them for *-2025 profiles

Other:

Implementation Plan (Kubo 0.40, ETA 2026 Q1)

To finalize this IPIP, Kubo needs to support additional Import.* configuration flags for:

  1. Empty directories: opt-out flag to exclude them from DAG
  2. Hidden files: already has --hidden, just need to wire it up from config
  3. HAMTDirectory threshold: configurable to support both legacy estimation (name + CID size) and Helia-style full block size calculation

Test fixtures will likely be included in the same Kubo PR that adds these missing features.

I also think we may replace two kubo-2025 and kubo-2025-wide profiles with a single one, that makes decision on what remains narrow and what is wide, but will update once Kubo changes land. (now that we have convention of doing IPIPs with profiles, we can always course-correct in `-202

recommend full serialized PBNode size, link to dag-pb spec

ref: ipfs#499 (comment)
- rename to UnixFS CID Profiles
- add lidel as editor
- add thanks section with PR reviewers
@lidel lidel changed the title IPIP 0499: CID Profiles IPIP-499: UnixFS CID Profiles Dec 13, 2025
lidel added 3 commits January 13, 2026 23:26
- add `links-count`, `links-bytes`, `block-bytes` estimation methods
- fill unixfs-2025 profile: 256KiB (block-bytes), empty dirs included (opt-out), hidden opt-in, symlinks preserved
- update legacy profiles table with cleaner method names
- clarify profile naming allows YYYY or YYYY-MM suffix
- add reference from unixfs.md to IPIP-499 for threshold methods
- fix broken markdown links and formatting
- clarify compliance requirements: MUST support unixfs-2025, MAY support legacy profiles
adds GFM table support for markdown
- rename `unixfs-2025` to `unixfs-v1-2025` for clarity on CID version
- add `unixfs-v0-2015` legacy profile for backward compatibility with kubo
- move divergence analysis from separate section into Motivation
- clarify motivation: CIDs are verifiable, problem is DAG construction variance
- consolidate three problems into two: broken hash semantics, verification overhead
- add Mode and Mtime parameters to all profile tables
- improve UnixFS parameters section with inline links and explanations
- add balanced vs trickle DAG layout descriptions
- update unixfs.md to reference IPIP-499 in a note block with full URL
@lidel
Copy link
Member

lidel commented Jan 14, 2026

Some progress on spec side:

  • specs.ipfs.tech website generator now supports Github-style tables (preview)
  • renamed unixfs-2025 to unixfs-v1-2025 for clarity on CIDv1 version
    • resolved TODOs by picking least controversial / disruptiv values (i think)
  • added unixfs-v0-2015 legacy profile for backward compatibility with kubo CIDv0 default
    • after discussions with some stakeholders, I believe we have to document and standardize legacy CIDv0 behavior, as people relied on ti as unspoken standard for close to a decade
  • moved divergence analysis from separate section into Motivation
  • consolidated three problems into two: broken hash semantics, verification overhead
  • added Mode and Mtime parameters to all profile tables
  • improved UnixFS parameters section with inline links and explanations

Remaining work is in Kubo/Boxo (expose configuration for things that were previously hardcoded, create two profiles from this IPIP.

…eters

add user benefit explaining why 1 MiB chunks with 1024 links per node
results in shallower DAG trees, fewer total nodes, faster seeking,
and reduced DHT announcement overhead compared to legacy 256 KiB/174 params
1. [Mode](https://specs.ipfs.tech/unixfs/#mode-field): optional POSIX file permissions.
1. [Mtime](https://specs.ipfs.tech/unixfs/#mtime-field): optional modification timestamp.

### Divergence in current implementations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obo20 any chance someone from Pinata yeam can provide what defaults you use currently, so we can list your "CID profile" next to Storacha one? Understanding divergence would be very useful.

1. [Mode](https://specs.ipfs.tech/unixfs/#mode-field): optional POSIX file permissions.
1. [Mtime](https://specs.ipfs.tech/unixfs/#mtime-field): optional modification timestamp.

### Divergence in current implementations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acejam any chance someone from Filebase team can provide what defaults you use currently when user data is onboarded and chunked into UnixFS? (or are you using default from specific Kubo version behind the scenes?)

We would like to document your "CID profile" next to other popular ones in the ecosystem.

@ipfs ipfs deleted a comment from github-actions bot Jan 16, 2026
@github-actions
Copy link

github-actions bot commented Jan 16, 2026

🚀 Build Preview on IPFS ready

include singularity as example showing balanced layout has implementation
variants that affect CID determinism for large files:
- document balanced-packed DAG layout variant
  (data-preservation-programs/singularity#525)
- note boxo defaults for HAMT parameters
- note rclone defaults for hidden files and symlinks
This structural difference causes CID mismatches for files larger than `chunk_size * dag_width` (e.g., >1 GiB with 1 MiB chunks and 1024 links per node), even when all other parameters match.

### Divergence in current implementations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included Singularity as example showing that even among different "balanced" layout implementations we can see different variants that affect CID determinism for large files.

  • documented balanced and balanced-packed DAG layout variants
  • noted implicit boxo defaults for HAMT parameters that the project seems to be using
  • assumed it uses rclone's defaults for hidden files and symlinks

cc data-preservation-programs/singularity#525 @SethDocherty @parkan @2color to proofread if the "singularity" column here reflects reality or if there is more nuance to what Singularity does

Copy link
Contributor Author

@mishmosh mishmosh Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks.

lidel added a commit to ipfs/go-ipfs-cmds that referenced this pull request Jan 17, 2026
add new --dereference-symlinks boolean flag that recursively resolves
all symlinks to their target content during file collection. this works
on symlinks inside directories, not just CLI arguments.

the flag is wired through cli/parse.go to boxo's SerialFileOptions.DereferenceSymlinks.

deprecate --dereference-args which only worked on symlinks passed directly
as CLI arguments. the help text now indicates it is deprecated and directs
users to use --dereference-symlinks instead.

ref: ipfs/specs#499
lidel added a commit to ipfs/kubo that referenced this pull request Jan 17, 2026
add CLI flags for controlling file collection behavior during ipfs add:

- `--dereference-symlinks`: recursively resolve symlinks to their target
  content (replaces deprecated --dereference-args which only worked on
  CLI arguments). wired through go-ipfs-cmds to boxo's SerialFileOptions.
- `--empty-dirs` / `-E`: include empty directories (default: true)
- `--hidden` / `-H`: include hidden files (default: false)

these flags are CLI-only and not wired to Import.* config options because
go-ipfs-cmds library handles input file filtering before the directory
tree is passed to kubo. removed unused Import.UnixFSSymlinkMode config
option that was defined but never actually read by the CLI.

also:
- wire --trickle to Import.UnixFSDAGLayout config default
- update go-ipfs-cmds to v0.15.1-0.20260117043932-17687e216294
- add SYMLINK HANDLING section to ipfs add help text
- add CLI tests for all three flags

ref: ipfs/specs#499
lidel added a commit to ipfs/kubo that referenced this pull request Jan 17, 2026
add CLI flags for controlling file collection behavior during ipfs add:

- `--dereference-symlinks`: recursively resolve symlinks to their target
  content (replaces deprecated --dereference-args which only worked on
  CLI arguments). wired through go-ipfs-cmds to boxo's SerialFileOptions.
- `--empty-dirs` / `-E`: include empty directories (default: true)
- `--hidden` / `-H`: include hidden files (default: false)

these flags are CLI-only and not wired to Import.* config options because
go-ipfs-cmds library handles input file filtering before the directory
tree is passed to kubo. removed unused Import.UnixFSSymlinkMode config
option that was defined but never actually read by the CLI.

also:
- wire --trickle to Import.UnixFSDAGLayout config default
- update go-ipfs-cmds to v0.15.1-0.20260117043932-17687e216294
- add SYMLINK HANDLING section to ipfs add help text
- add CLI tests for all three flags

ref: ipfs/specs#499
lidel added a commit to ipfs/kubo that referenced this pull request Jan 17, 2026
add CLI flags for controlling file collection behavior during ipfs add:

- `--dereference-symlinks`: recursively resolve symlinks to their target
  content (replaces deprecated --dereference-args which only worked on
  CLI arguments). wired through go-ipfs-cmds to boxo's SerialFileOptions.
- `--empty-dirs` / `-E`: include empty directories (default: true)
- `--hidden` / `-H`: include hidden files (default: false)

these flags are CLI-only and not wired to Import.* config options because
go-ipfs-cmds library handles input file filtering before the directory
tree is passed to kubo. removed unused Import.UnixFSSymlinkMode config
option that was defined but never actually read by the CLI.

also:
- wire --trickle to Import.UnixFSDAGLayout config default
- update go-ipfs-cmds to v0.15.1-0.20260117043932-17687e216294
- add SYMLINK HANDLING section to ipfs add help text
- add CLI tests for all three flags

ref: ipfs/specs#499
| DAG layout | balanced |
| DAG width (children per node) | 1024 |
| HAMTDirectory fanout | 256 blocks |
| HAMTDirectory threshold | 256KiB (block-bytes) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Helia can currently do block-bytes. How much of a problem is this for making this the "modern" profile? Or does that just mean there's work to do.

https://github.com/ipfs/helia/blob/005c2a7a5e45349398cf750fd73f3c47591bb00a/packages/unixfs/src/commands/utils/is-over-shard-threshold.ts#L34-L45

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see from your table that links-bytes is the most common across the implementations, so why not just use it? It's not precise, but it means that you don't need to change implementations as much to conform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.