Skip to content

docs(rfc): add static CSV provider specification#1701

Draft
LNSD wants to merge 1 commit intomainfrom
lnsd/feat-providers-static-external-table
Draft

docs(rfc): add static CSV provider specification#1701
LNSD wants to merge 1 commit intomainfrom
lnsd/feat-providers-static-external-table

Conversation

@LNSD
Copy link
Contributor

@LNSD LNSD commented Feb 5, 2026

Define the full design for amp-providers-static, covering provider config schema, CSV schema inference with column name sanitization, small-file in-memory caching, and lazy catalog integration into the providers registry.

  • Specify three-phase implementation plan with dependency ordering
  • Document provider TOML config with grouped tables and column mapping
  • Define schema inference rules, header auto-detection, and sanitization
  • Outline in-memory cache strategy with configurable byte threshold
  • Record all resolved design decisions in verification log

Define the full design for amp-providers-static, covering provider
config schema, CSV schema inference with column name sanitization,
small-file in-memory caching, and lazy catalog integration into
the providers registry.

- Specify three-phase implementation plan with dependency ordering
- Document provider TOML config with grouped tables and column mapping
- Define schema inference rules, header auto-detection, and sanitization
- Outline in-memory cache strategy with configurable byte threshold
- Record all resolved design decisions in verification log

Signed-off-by: Lorenzo Delgado <lorenzo@edgeandnode.com>
@LNSD LNSD self-assigned this Feb 5, 2026
@LNSD LNSD added the data-plane label Feb 5, 2026
@LNSD LNSD changed the title docs: add static CSV provider specification docs(rfc): add static CSV provider specification Feb 5, 2026
@leoyvens
Copy link
Collaborator

leoyvens commented Feb 5, 2026

Did you consider using datasets instead of providers for this? Then this would benefit from the tooling for dataset discoverability.

@LNSD
Copy link
Contributor Author

LNSD commented Feb 5, 2026

Did you consider using datasets instead of providers for this? Then this would benefit from the tooling for dataset discoverability.

That's a very good point. This is something that we should consider after the POC. Yes.

I see, at this moment, two main issues:

Coupling between datasets and materialization

Datasets require writing Parquet files to the Amp data lake.

Basically, there is no separation between extractors and datasets. These two concepts are tightly coupled. With the work in #1673, we'll be able to separate the two concepts (the materialized data from the dataset definition).

A "static-file" dataset: permissioned nature

The issue stems from the nature of the data access: a CSV file stored in an object store.

Datasets, as we understand them, are building blocks, distributable units. If one needs credentials to access that file (i.e., it is permissioned), that would limit the utility of that dataset.

In the end, for me, the provider's concept (external services that act as a data source) fits naturally in the mental model.

I am advocating for a POC to enable some use cases in the short term, and that can evolve alongside the dataset authoring work happening in parallel.

@LNSD
Copy link
Contributor Author

LNSD commented Feb 5, 2026

Note that the schema description here is a proposal that could be included or replaced completely by the dataset authoring design (e.g., by introducing a new dataset kind).

@leoyvens
Copy link
Collaborator

leoyvens commented Feb 5, 2026

Alright we can try out this design then

@leoyvens
Copy link
Collaborator

leoyvens commented Feb 6, 2026

Datasets require writing Parquet files to the Amp data lake.

Just to comment on this aspect, there are tradeoffs but it wouldn't be unreasonable to design this such that the CSV data is copied over into Amp table format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants