Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions mkdocs/docs/geospatial.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,11 +84,11 @@ point_wkb = bytes.fromhex("0101000000000000000000000000000000000000")

1. **WKB/WKT Conversion**: Converting between WKB bytes and WKT strings requires external libraries (like Shapely). PyIceberg does not include this conversion to avoid heavy dependencies.

2. **Spatial Predicates**: Spatial filtering (e.g., ST_Contains, ST_Intersects) is not yet supported for query pushdown.
2. **Spatial Predicates Execution**: Spatial predicate APIs (`st-contains`, `st-intersects`, `st-within`, `st-overlaps`) are available in expression trees and binding. Row-level execution and metrics/pushdown evaluation are not implemented yet.

3. **Bounds Metrics**: Geometry/geography columns do not currently contribute to data file bounds metrics.
3. **Without geoarrow-pyarrow**: When the `geoarrow-pyarrow` package is not installed, geometry and geography columns are stored as binary without GeoArrow extension type metadata. The Iceberg schema preserves type information, but other tools reading the Parquet files directly may not recognize them as spatial types. Install with `pip install pyiceberg[geoarrow]` for full GeoArrow support.

4. **Without geoarrow-pyarrow**: When the `geoarrow-pyarrow` package is not installed, geometry and geography columns are stored as binary without GeoArrow extension type metadata. The Iceberg schema preserves type information, but other tools reading the Parquet files directly may not recognize them as spatial types. Install with `pip install pyiceberg[geoarrow]` for full GeoArrow support.
4. **GeoArrow planar ambiguity**: In GeoArrow metadata, `geometry` and `geography(..., 'planar')` can be encoded identically (no explicit edge metadata). PyIceberg resolves this ambiguity at the Arrow/Parquet schema-compatibility boundary by treating them as compatible when CRS matches, while keeping core schema compatibility strict elsewhere.

## Format Version

Expand Down
79 changes: 79 additions & 0 deletions mkdocs/docs/plans/Geospatial_PR_How_To_Review.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# How To Review: Geospatial Compatibility, Metrics, and Expressions PR

## Goal

This PR is large because it spans expression APIs, Arrow/Parquet conversion, metrics generation, and documentation.
Recommended strategy: review in focused passes by concern, not file order.

## Recommended Review Order

1. **Decisions and intended behavior**
- `mkdocs/docs/dev/geospatial-types-decisions-v1.md`
- Confirm the intended policy, especially Decision 9 (planar ambiguity boundary handling).

2. **Core geospatial utility correctness**
- `pyiceberg/utils/geospatial.py`
- `tests/utils/test_geospatial.py`
- Focus on envelope extraction, antimeridian behavior, and bound encoding formats.

3. **Metrics integration and write/import paths**
- `pyiceberg/io/pyarrow.py`
- `tests/io/test_pyarrow_stats.py`
- Focus on:
- geospatial bounds derived from row WKB values
- skipping Parquet binary min/max for geospatial columns
- partition inference not using geospatial envelope bounds

4. **GeoArrow compatibility and ambiguity boundary**
- `pyiceberg/schema.py`
- `pyiceberg/io/pyarrow.py`
- `tests/io/test_pyarrow.py`
- Confirm:
- planar equivalence enabled only at Arrow/Parquet boundary
- spherical mismatch still fails
- fallback warning when GeoArrow dependency is absent

5. **Spatial expression surface area**
- `pyiceberg/expressions/__init__.py`
- `pyiceberg/expressions/visitors.py`
- `tests/expressions/test_spatial_predicates.py`
- Confirm:
- bind-time type checks (geometry/geography only)
- visitor plumbing is complete
- conservative evaluator behavior is explicit and documented

6. **User-facing docs**
- `mkdocs/docs/geospatial.md`
- Check limitations and behavior notes match implementation.

## High-Risk Areas To Inspect Closely

1. **Boundary scope leakage**
- Ensure planar-equivalence relaxation is not enabled globally.

2. **Envelope semantics**
- Geography antimeridian cases (`xmin > xmax`) are expected and intentional.

3. **Metrics correctness**
- Geospatial bounds are serialized envelopes, not raw value min/max.

4. **Conservative evaluator behavior**
- Spatial predicates should not accidentally become strict in metrics/manifest evaluators.

## Quick Validation Commands

```bash
uv run --extra hive --extra bigquery python -m pytest tests/utils/test_geospatial.py -q
uv run --extra hive --extra bigquery python -m pytest tests/io/test_pyarrow_stats.py -k "geospatial or planar_geography_schema or partition_inference_skips_geospatial_bounds" -q
uv run --extra hive --extra bigquery python -m pytest tests/io/test_pyarrow.py -k "geoarrow or planar_geography_geometry_equivalence or spherical_geography_geometry_equivalence or logs_warning_once" -q
uv run --extra hive --extra bigquery python -m pytest tests/expressions/test_spatial_predicates.py tests/expressions/test_visitors.py -k "spatial or translate_column_names" -q
```

## Review Outcome Checklist

1. Geometry/geography bounds are present and correctly encoded for write/import paths.
2. `geometry` vs `geography(planar)` is only equivalent at Arrow/Parquet compatibility boundary with CRS equality.
3. `geography(spherical)` remains incompatible with `geometry`.
4. Spatial predicates are correctly modeled/bound; execution and pushdown remain intentionally unimplemented.
5. Missing GeoArrow dependency degrades gracefully with explicit warning.
6. Docs match implemented behavior and limitations.
189 changes: 187 additions & 2 deletions pyiceberg/expressions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,11 @@

from __future__ import annotations

import builtins
from abc import ABC, abstractmethod
from collections.abc import Callable, Iterable, Sequence
from functools import cached_property
from typing import Any, TypeAlias
from typing import Any, TypeAlias, cast
from typing import Literal as TypingLiteral

from pydantic import ConfigDict, Field, SerializeAsAny, model_validator
Expand All @@ -29,7 +30,7 @@
from pyiceberg.expressions.literals import AboveMax, BelowMin, Literal, literal
from pyiceberg.schema import Accessor, Schema
from pyiceberg.typedef import IcebergBaseModel, IcebergRootModel, L, LiteralValue, StructProtocol
from pyiceberg.types import DoubleType, FloatType, NestedField
from pyiceberg.types import DoubleType, FloatType, GeographyType, GeometryType, NestedField
from pyiceberg.utils.singleton import Singleton


Expand All @@ -48,6 +49,16 @@ def _to_literal(value: L | Literal[L]) -> Literal[L]:
return literal(value)


def _to_bytes(value: bytes | bytearray | memoryview) -> bytes:
if isinstance(value, bytes):
return value
if isinstance(value, bytearray):
return bytes(value)
if isinstance(value, memoryview):
return value.tobytes()
raise TypeError(f"Expected bytes-like value, got {type(value)}")


class BooleanExpression(IcebergBaseModel, ABC):
"""An expression that evaluates to a boolean."""

Expand Down Expand Up @@ -109,6 +120,14 @@ def handle_primitive_type(cls, v: Any, handler: ValidatorFunctionWrapHandler) ->
return StartsWith(**v)
elif field_type == "not-starts-with":
return NotStartsWith(**v)
elif field_type == "st-contains":
return STContains(**v)
elif field_type == "st-intersects":
return STIntersects(**v)
elif field_type == "st-within":
return STWithin(**v)
elif field_type == "st-overlaps":
return STOverlaps(**v)

# Set
elif field_type == "in":
Expand Down Expand Up @@ -1106,3 +1125,169 @@ def __invert__(self) -> StartsWith:
@property
def as_bound(self) -> type[BoundNotStartsWith]: # type: ignore
return BoundNotStartsWith


class SpatialPredicate(UnboundPredicate, ABC):
type: TypingLiteral["st-contains", "st-intersects", "st-within", "st-overlaps"] = Field(alias="type")
term: UnboundTerm
value: bytes = Field()
model_config = ConfigDict(populate_by_name=True, frozen=True, arbitrary_types_allowed=True)

def __init__(
self,
term: str | UnboundTerm,
geometry: bytes | bytearray | memoryview | None = None,
**kwargs: Any,
) -> None:
if geometry is None and "value" in kwargs:
geometry = kwargs["value"]
if geometry is None:
raise TypeError("Spatial predicates require WKB bytes")

super().__init__(term=_to_unbound_term(term), value=_to_bytes(geometry))

@property
def geometry(self) -> bytes:
return self.value

def bind(self, schema: Schema, case_sensitive: bool = True) -> BoundSpatialPredicate:
bound_term = self.term.bind(schema, case_sensitive)
if not isinstance(bound_term.ref().field.field_type, (GeometryType, GeographyType)):
raise TypeError(f"Spatial predicates can only be bound against geometry/geography fields: {bound_term.ref().field}")
bound_cls = cast(Any, self.as_bound)
return bound_cls(bound_term, self.geometry)

def __eq__(self, other: Any) -> bool:
"""Return whether two spatial predicates are equivalent."""
if isinstance(other, self.__class__):
return self.term == other.term and self.geometry == other.geometry
return False

def __str__(self) -> str:
"""Return a human-readable representation."""
return f"{str(self.__class__.__name__)}(term={repr(self.term)}, geometry={self.geometry!r})"

def __repr__(self) -> str:
"""Return the debug representation."""
return f"{str(self.__class__.__name__)}(term={repr(self.term)}, geometry={self.geometry!r})"

@property
@abstractmethod
def as_bound(self) -> builtins.type[BoundSpatialPredicate]: ...


class BoundSpatialPredicate(BoundPredicate, ABC):
value: bytes = Field()

def __init__(self, term: BoundTerm, geometry: bytes | bytearray | memoryview):
super().__init__(term=term, value=_to_bytes(geometry))

@property
def geometry(self) -> bytes:
return self.value

def __eq__(self, other: Any) -> bool:
"""Return whether two bound spatial predicates are equivalent."""
if isinstance(other, self.__class__):
return self.term == other.term and self.geometry == other.geometry
return False

def __str__(self) -> str:
"""Return a human-readable representation."""
return f"{self.__class__.__name__}(term={str(self.term)}, geometry={self.geometry!r})"

def __repr__(self) -> str:
"""Return the debug representation."""
return f"{str(self.__class__.__name__)}(term={repr(self.term)}, geometry={self.geometry!r})"

@property
@abstractmethod
def as_unbound(self) -> type[SpatialPredicate]: ...


class BoundSTContains(BoundSpatialPredicate):
def __invert__(self) -> BooleanExpression:
"""Return the negated expression."""
return Not(child=self)

@property
def as_unbound(self) -> type[STContains]:
return STContains


class BoundSTIntersects(BoundSpatialPredicate):
def __invert__(self) -> BooleanExpression:
"""Return the negated expression."""
return Not(child=self)

@property
def as_unbound(self) -> type[STIntersects]:
return STIntersects


class BoundSTWithin(BoundSpatialPredicate):
def __invert__(self) -> BooleanExpression:
"""Return the negated expression."""
return Not(child=self)

@property
def as_unbound(self) -> type[STWithin]:
return STWithin


class BoundSTOverlaps(BoundSpatialPredicate):
def __invert__(self) -> BooleanExpression:
"""Return the negated expression."""
return Not(child=self)

@property
def as_unbound(self) -> type[STOverlaps]:
return STOverlaps


class STContains(SpatialPredicate):
type: TypingLiteral["st-contains"] = Field(default="st-contains", alias="type")

def __invert__(self) -> BooleanExpression:
"""Return the negated expression."""
return Not(child=self)

@property
def as_bound(self) -> builtins.type[BoundSTContains]:
return BoundSTContains


class STIntersects(SpatialPredicate):
type: TypingLiteral["st-intersects"] = Field(default="st-intersects", alias="type")

def __invert__(self) -> BooleanExpression:
"""Return the negated expression."""
return Not(child=self)

@property
def as_bound(self) -> builtins.type[BoundSTIntersects]:
return BoundSTIntersects


class STWithin(SpatialPredicate):
type: TypingLiteral["st-within"] = Field(default="st-within", alias="type")

def __invert__(self) -> BooleanExpression:
"""Return the negated expression."""
return Not(child=self)

@property
def as_bound(self) -> builtins.type[BoundSTWithin]:
return BoundSTWithin


class STOverlaps(SpatialPredicate):
type: TypingLiteral["st-overlaps"] = Field(default="st-overlaps", alias="type")

def __invert__(self) -> BooleanExpression:
"""Return the negated expression."""
return Not(child=self)

@property
def as_bound(self) -> builtins.type[BoundSTOverlaps]:
return BoundSTOverlaps
Loading