Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions docs/concepts/upath.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,9 +239,70 @@ process_file(
)
```

## Path Equality and Filesystem Identity

Unlike `pathlib.Path` which compares paths by their string representation alone, `UPath` considers **filesystem identity** when comparing paths. Two UPaths are equal if they refer to the same file on the same filesystem.

### How Equality Works

```python
from upath import UPath

# Same path, same filesystem -> equal (even with different options)
UPath('s3://bucket/file.txt') == UPath('s3://bucket/file.txt', anon=True) # True

# Same path, different filesystem -> not equal
UPath('s3://bucket/file.txt') != UPath('s3://bucket/file.txt',
endpoint_url='http://localhost:9000') # True
```

### Filesystem Identity (fsid)

UPath uses **fsid** (filesystem identifier) to determine if two paths are on the same filesystem. If a cached filesystem exists and implements fsid, that value is used. Otherwise, fsid is computed from the protocol, storage_options, and fsspec global config (`fsspec.config.conf`), **without instantiating the filesystem**. This allows path comparison to work abstractly without requiring credentials or network access.

Unlike fsspec filesystems which raise `NotImplementedError` when fsid is not implemented, `UPath.fsid` returns `None` if the filesystem identity cannot be determined (e.g., for unknown protocols or wrapper filesystems). When fsid is `None`, path comparison falls back to comparing `storage_options` directly:

| Filesystem | Identity Based On |
|------------|-------------------|
| Local (`file://`, paths) | Always `"local"` |
| HTTP/HTTPS | Always `"http"` |
| S3 | `endpoint_url` (AWS endpoints normalized) |
| GCS | Always `"gcs"` (single global endpoint) |
| Azure Blob | `account_name` |
| SFTP/SSH | `host` + `port` |
| SMB | `host` + `port` |

Options like authentication (`anon`, `key`, `token`), performance settings (`block_size`), and behavior flags (`auto_mkdir`) don't affect filesystem identity.

### Impact on Path Operations

Filesystem identity affects `relative_to()`, `is_relative_to()`, and parent comparisons:

```python
from upath import UPath

base = UPath('s3://bucket/data')
child = UPath('s3://bucket/data/file.txt', anon=True)

# Works: same filesystem despite different storage_options
child.relative_to(base) # PurePosixPath('file.txt')
child.is_relative_to(base) # True
base in child.parents # True
```

### Comparison with pathlib.Path

| Aspect | `pathlib.Path` | `UPath` |
|--------|----------------|---------|
| Equality based on | Path string only | Protocol + path + filesystem identity |
| `storage_options` | N/A | Ignored if fsid can be determined |
| Different credentials | N/A | Equal (same filesystem) |
| Different endpoints | N/A | Not equal (different filesystem) |

## Learn More

- **pathlib concepts**: See [pathlib.md](pathlib.md) for details on the pathlib API
- **fsspec backends**: See [filesystems.md](fsspec.md) for information about available filesystems
- **API reference**: Check the [API documentation](../api/index.md) for complete method details
- **fsspec details**: Visit [fsspec documentation](https://filesystem-spec.readthedocs.io/) for filesystem-specific options
- **Migration guide**: See [migration.md](../migration.md) for version-specific changes
98 changes: 98 additions & 0 deletions docs/migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,104 @@ This guide helps you migrate to newer versions of universal-pathlib.
and this guide is missing information.


## Migrating to v0.4.0

Version `0.4.0` changes how `UPath` determines path equality. Previously, paths with different `storage_options` were always considered unequal. Now, equality is based on **filesystem identity** (fsid), which ignores options that don't affect which filesystem is being accessed.

### Background: The Problem with storage_options Equality

In versions prior to `0.4.0`, `UPath.__eq__` compared `storage_options` directly:

```python
# Pre-0.4.0 behavior (unintuitive)
from upath import UPath

# Same S3 file, but different auth options -> NOT equal
UPath('s3://bucket/file.txt') == UPath('s3://bucket/file.txt', anon=True) # False

# Same local file, but different behavior options -> NOT equal
UPath('/tmp/file.txt') == UPath('/tmp/file.txt', auto_mkdir=True) # False
```

This caused subtle bugs when comparing paths that referred to the same filesystem resource. Methods like `relative_to()` and `is_relative_to()` would fail unexpectedly:

```python
# Pre-0.4.0: This raised ValueError despite referring to the same S3 bucket
p1 = UPath('s3://bucket/dir/file.txt', anon=True)
p2 = UPath('s3://bucket/dir')
p1.relative_to(p2) # ValueError: incompatible storage_options
```

### New Behavior: Filesystem Identity (fsid)

Starting with `0.4.0`, equality is based on filesystem identity. Two UPaths are equal if they have the same protocol, path, and filesystem identity—regardless of authentication or performance options:

```python
# v0.4.0+ behavior
from upath import UPath

# Same filesystem, different options -> equal
UPath('s3://bucket/file.txt') == UPath('s3://bucket/file.txt', anon=True) # True
UPath('/tmp/file.txt') == UPath('/tmp/file.txt', auto_mkdir=True) # True

# Different filesystems -> not equal
UPath('s3://bucket/file.txt') != UPath('s3://bucket/file.txt',
endpoint_url='http://localhost:9000') # True (MinIO vs AWS)
```

**Options ignored for equality** (don't affect filesystem identity):

- Authentication: `anon`, `key`, `secret`, `token`, `profile`
- Performance: `default_block_size`, `default_cache_type`, `max_concurrency`
- Behavior: `auto_mkdir`, `default_acl`, `requester_pays`

**Options that affect equality** (change which filesystem is accessed):

- S3: Different `endpoint_url` (e.g., AWS vs MinIO vs LocalStack)
- Azure: Different `account_name`
- SFTP/SMB/FTP: Different `host` or `port`

### Impact on Path Operations

The `relative_to()` and `is_relative_to()` methods now use filesystem identity:

```python
from upath import UPath

p1 = UPath('s3://bucket/dir/file.txt', anon=True)
p2 = UPath('s3://bucket/dir') # Different storage_options, same filesystem

# v0.4.0+: Works because both paths are on the same S3 filesystem
p1.is_relative_to(p2) # True
p1.relative_to(p2) # PurePosixPath('file.txt')

# Different endpoints are correctly rejected
p3 = UPath('s3://bucket/dir', endpoint_url='http://localhost:9000')
p1.is_relative_to(p3) # False (different filesystem)
p1.relative_to(p3) # ValueError: incompatible filesystems
```

### Migration Checklist

If your code relied on the previous behavior where different `storage_options` meant different paths:

1. **Review equality checks**: Code that expected `UPath(url, opt1=x) != UPath(url, opt1=y)` may now return `True` if they're on the same filesystem.

2. **Check set/dict usage**: Paths that were previously distinct dict keys or set members may now collide. Note that `__hash__` already ignored `storage_options`, so this is unlikely to be a new issue.

3. **Update tests**: Tests that asserted inequality based on `storage_options` differences may need updating.

### Fallback Behavior

For filesystems where UPath cannot determine identity (e.g., memory filesystem, unknown protocols), it falls back to comparing `storage_options` directly—preserving pre-0.4.0 behavior:

```python
from upath import UPath

# Memory filesystem: no fsid, falls back to storage_options comparison
UPath('memory:///file.txt', opt=1) != UPath('memory:///file.txt', opt=2) # True
```

## Migrating to v0.3.0

Version `0.3.0` introduced a breaking change to fix a longstanding bug related to `os.PathLike` protocol compliance. This change affects how UPath instances work with standard library functions that expect local filesystem paths.
Expand Down
129 changes: 129 additions & 0 deletions upath/_fsid.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
"""Filesystem identity (fsid) fallback computation.

This module provides `_fallback_fsid` to compute filesystem identity from
protocol, storage_options, and fsspec global config (`fsspec.config.conf`)
without instantiating the filesystem.

The fsid is used by __eq__, relative_to, and is_relative_to to determine
if two paths are on the same filesystem. The key insight is that many
storage_options (like authentication or performance settings) don't affect
*which* filesystem is being accessed, only *how* it's accessed.

For filesystems where fsid cannot be determined (e.g., memory filesystem,
unknown protocols), returns None and callers fall back to comparing
storage_options directly.
"""

from __future__ import annotations

from collections import ChainMap
from collections.abc import Mapping
from typing import Any

from fsspec.config import conf as fsspec_conf
from fsspec.utils import tokenize

__all__ = ["_fallback_fsid"]


def _fallback_fsid(protocol: str, storage_options: Mapping[str, Any]) -> str | None:
"""Compute fsid from protocol, storage_options, and fsspec global config."""
global_opts = fsspec_conf.get(protocol)
opts: Mapping[str, Any] = (
ChainMap(storage_options, global_opts) # type: ignore[arg-type]
if global_opts
else storage_options
)

match protocol:
# Static fsid (no instance attributes needed)
case "" | "file" | "local":
return "local"
case "http" | "https":
return "http"
case "memory" | "memfs":
return None # Non-durable, fall back to storage_options
case "data":
return None # Non-durable

# Host + port based
case "sftp" | "ssh":
host = opts.get("host", "")
port = opts.get("port", 22)
return f"sftp_{tokenize(host, port)}" if host else None
case "smb":
host = opts.get("host", "")
port = opts.get("port", 445)
return f"smb_{tokenize(host, port)}" if host else None
case "ftp":
host = opts.get("host", "")
port = opts.get("port", 21)
return f"ftp_{tokenize(host, port)}" if host else None
case "webhdfs" | "webHDFS":
host = opts.get("host", "")
port = opts.get("port", 50070)
return f"webhdfs_{tokenize(host, port)}" if host else None

# Cloud object storage
case "s3" | "s3a":
endpoint = opts.get("endpoint_url", "https://s3.amazonaws.com")
# Normalize AWS endpoints
from urllib.parse import urlparse

parsed = urlparse(endpoint)
if parsed.netloc.endswith(".amazonaws.com"):
return "s3_aws"
return f"s3_{tokenize(endpoint)}"
case "gcs" | "gs":
return "gcs" # Single global endpoint
case "abfs" | "az":
account = opts.get("account_name", "")
return f"abfs_{tokenize(account)}" if account else None
case "adl":
tenant = opts.get("tenant_id", "")
store = opts.get("store_name", "")
return f"adl_{tokenize(tenant, store)}" if tenant and store else None
case "oci":
region = opts.get("region", "")
return f"oci_{tokenize(region)}" if region else None
case "oss":
endpoint = opts.get("endpoint", "")
return f"oss_{tokenize(endpoint)}" if endpoint else None

# Git-based
case "git":
path = opts.get("path", "")
ref = opts.get("ref", "")
return f"git_{tokenize(path, ref)}" if path else None
case "github":
org = opts.get("org", "")
repo = opts.get("repo", "")
sha = opts.get("sha", "")
return f"github_{tokenize(org, repo, sha)}" if org and repo else None

# Platform-specific
case "hf":
endpoint = opts.get("endpoint", "huggingface.co")
return f"hf_{tokenize(endpoint)}"
case "lakefs":
host = opts.get("host", "")
return f"lakefs_{tokenize(host)}" if host else None
case "webdav":
base_url = opts.get("base_url", "")
return f"webdav_{tokenize(base_url)}" if base_url else None
case "box":
return "box"
case "dropbox":
return "dropbox"

# Wrappers - delegate to underlying
case "simplecache" | "filecache" | "blockcache" | "cached":
return None # Complex, fall back

# Archive filesystems - need underlying fs info
case "zip" | "tar":
return None # Complex, fall back

# Default: unknown protocol, fall back to storage_options
case _:
return None
Loading