fix(glue): Support create_table for S3 Tables federated databases#3058
fix(glue): Support create_table for S3 Tables federated databases#3058jamesbornholt wants to merge 4 commits intoapache:mainfrom
Conversation
GlueCatalog.create_table() fails for S3 Tables because it tries to write Iceberg metadata to a warehouse-derived location before registering the table in Glue. S3 Tables manages storage internally, so the location is not known until the table exists in the service. Detect S3 Tables federated databases (FederatedDatabase.ConnectionType == "aws:s3tables") and use a two-phase create: first create a minimal table entry in Glue to have S3 Tables allocate storage, then write Iceberg metadata to the managed location and update the Glue table with the metadata pointer. On failure, clean up the table entry. This follows the same "pre-create then update" pattern used by the Java GlueCatalog when LakeFormation is enabled (GlueTableOperations.createGlueTempTableIfNecessary).
5633f4e to
9708909
Compare
|
thanks for the PR! do you know of a way to test this change? |
Add three moto-based tests for the S3 Tables code path in GlueCatalog: - create_table succeeds and uses the table warehouse location - create_table rejects an explicit location for S3 Tables databases - create_table raises TableAlreadyExistsError for duplicate tables Moto does not support FederatedDatabase or S3 Tables storage allocation, so the tests patch FakeDatabase.as_dict to return FederatedDatabase and FakeTable.__init__ to inject a StorageDescriptor.Location.
0c4f375 to
4f9299e
Compare
|
I took a stab at writing some tests. It requires monkeypatching moto to emulate Glue's behavior here, which is not super elegant, but it captures the right behavior. |
|
@geruh could you take a look when you get a chance? |
geruh
left a comment
There was a problem hiding this comment.
Thanks for raising this @jamesbornholt! I left a few comments. At a high level, Looks like we can already both load and commit against s3t tables federated to glue. This would just fill the create_table gap.
Initially I was a bit hesitant about adding support for this use case as it requires 4/5 api calls and in the non s3t federated case it adds an additional call. Although I don't think there is any way around this.
WDYT?
|
|
||
| def _create_table_s3tables( | ||
| self, | ||
| identifier: str | Identifier, |
There was a problem hiding this comment.
nit: we have some redundancy with these params passed in: ident, db_name, tbl_name maybe we can derive?
pyiceberg/catalog/glue.py
Outdated
| ) | ||
| raise | ||
|
|
||
| return self.load_table(identifier=identifier) |
There was a problem hiding this comment.
we could probably avoid the last load table here with the update table response
There was a problem hiding this comment.
glue UpdateTable doesn't return anything, but we can actually just construct the table from the staged metadata the same way we do it in the commit path: https://github.com/jamesbornholt/iceberg-python/blob/89a129c61b5f6320e78d7875a615b2383d590fa1/pyiceberg/catalog/glue.py#L381-L387. i've done that for both the S3 Tables and regular creation paths now.
…reate_table_s3tables signature - Extract "aws:s3tables" string literal into GLUE_CONNECTION_S3_TABLES constant - Remove redundant database_name/table_name params from _create_table_s3tables; derive them from identifier inside the method instead
Avoid a redundant GetTable + GetObject round-trip at the end of create_table and _create_table_s3tables by constructing the Table from the staged metadata that was just written, matching the pattern used in commit_table.
I was worried about this too and it's definitely unfortunate, but I also don't see a way around it, and since it's only on the creation path I think it's OK. I took the suggestion of removing the extra LoadTable request at the end of creation, so the non-federated path nets out as the same number of roundtrips now. |
Rationale for this change
GlueCatalog.create_table() fails for S3 Tables because it tries to write Iceberg metadata to a warehouse-derived location before registering the table in Glue. S3 Tables manages storage internally, so the location is not known until the table exists in the service.
Detect S3 Tables federated databases (FederatedDatabase.ConnectionType == "aws:s3tables") and use a two-phase create: first create a minimal table entry in Glue to have S3 Tables allocate storage, then write Iceberg metadata to the managed location and update the Glue table with the metadata pointer. On failure, clean up the table entry.
This follows the same "pre-create then update" pattern used by the Java GlueCatalog when LakeFormation is enabled
(GlueTableOperations.createGlueTempTableIfNecessary).
Are these changes tested?
moto doesn't mock with enough fidelity to test this directly. I've tested locally against both S3 Tables and vanilla Glue Iceberg and confirmed CreateTable still works on both paths.
Are there any user-facing changes?
No.