-
Notifications
You must be signed in to change notification settings - Fork 3k
[SPEC | CORE] : Allow table level override for scan planning #14867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[SPEC | CORE] : Allow table level override for scan planning #14867
Conversation
a04195b to
474e982
Compare
|
while the changes make sense to me, we may actually want to discuss this in the broader community to decide whether we want to override server-side scan planning at the table level |
3144cd9 to
f68fbe2
Compare
10123db to
7f2a05b
Compare
|
Thanks for raising this @singhpk234! I like the direction here. As a user today, there are two modes either use scan planning or not. Which begs the question, when should I use one versus the other? And right now, there is no clear insight or story from the user's perspective. Now from a catalog's perspective, the modes make sense. For instance, If the catalog is using planning to enforce governance, the |
|
Thanks for the feedback @geruh !
Optional in this context is that the catalog really doesn't have an opinion on what the client decides, it can choose local and remote, the way i was thinking is lets say you are running a lot of concurrent queries in your spark cluster and your driver is slim, even though, we are using spark, we may prefer spark. That being said yes in this impl what i did if the catalog supports plan endpoint and the catalog doesn't have any opinion on this, in java impl we always do scan planning, yes being able to toggle this based on client side config would be ideal may be when the server sends optional from the server side, and from the client side we have configured required we should not allow overwritting the key to optionals and reuse the optional ? WDYT I see @RussellSpitzer has similar feedback in ML thread too, let me take a deeper look on their feedback and respond there as well |
Decision matrix : scan planning mode (required | optional | none) :
Decision matrix : scan planning mode(client only | client preferred | catalog preferred | catalog only)
|
|
I wasn't thinking about it quite that way. I was assuming the client is configured independently of the catalog. The client can either have a user preference or none. If none, it does whatever the catalog feeds back to it if the client supports that mode. Otherwise it does what is manually specified. So Client (None) -> Follow Catalog Config (Client Only, Client Preferred, CatalogPreferred or Catalog Only), Fail if the client doesn't support config A user without a preference or who wants the Catalog to make the determination just leaves this unset on the client. Or a Client which wants to override the catalog can set a specific mode and fail fast if the Catalog doesn't support it. |
9b1adae to
66eb57a
Compare
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java
Outdated
Show resolved
Hide resolved
| * <p>Values: | ||
| * | ||
| * <ul> | ||
| * <li>CLIENT_ONLY - MUST use client-side planning. Fails if paired with CATALOG_ONLY from other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using ENFORCED might be a better fit instead of ONLY, wdyt? That explains the intent more naturally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean CLIENT_ENFORCED | CATALOG_ENFORCED, it believe it does more authoritative, since we are including in spec this might be language we prefer, let me think a bit more on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually prefer that we just remove it and have client, client-preferred, catalog-preferred, and catalog.
Using words like ENFORCED or REQUIRED don't quite feel right and ultimately, if we're going with this enumeration, it is explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit skeptical that we even need the notion of preferences, e.g. client-preferred, catalog-preferred. it''s plausible that servers could have more insights to give a more intelligent preference but it feels over complicated compared to just having a "clients-choice" (not a real mode, just something that's inferred when the endpoint is supported but not required) instead of 2 preferences. It simplifies the decision matrix logic below, and clients can then use their own heuristics.
I think that's what the decision as to if preferences or not makes sense, comes down to:
is it better to have clients just make intelligent choices when server side planning is available but not required, or is it better for servers to indicate preferences. My thought process is if a server really feels like it's advantageous to do remote planning, may as well just send it back as required.
I should note: I'm not super opinionated on this, but I do think it'd be great if we could outline some concrete cases where we think a preference is advantageous (in both directions) just to make it clear if the complexity is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it better to have clients just make intelligent choices when server side planning is available but not required, or is it better for servers to indicate preferences. My thought process is if a server really feels like it's advantageous to do remote planning, may as well just send it back as required
This is mostly from the POV that its dependent on the load they are having at the moment when the call is made, for example lets take the following cases:
- I am using py-iceberg, i know i am low on resources its better i just do remote planning if possible and the table is big and catalog can py-iceberg can say i prefer catalog to be planned and server based on catalog_only / catalog_preferred can have that negotiation.
- Let say i am spark and i have big compute infra, but i based on the current workload,
- lets say a lot of concurrent queries env, I will not have a lot of memory available to plan this, i would start with saying i prefer catalog
- let say i have dedicated cluster rather than doing remote plan i would do it in my JVM, i would say client_only from the client side
Server Side
- If the server is load and the client is open to plan it in client end then its better just server say hey i am burdened / low on resource are you open to planning in client end and hence as soft signal client_preferred, server has no clue on what the client is its purely sending this decision based on what its their state, sending client_only would have caused trouble for stuff like py-iceberg incase its configured to catalog_only
please let me know what do you think of these cases ?
| // Negotiation rules: ONLY beats PREFERRED, both PREFERRED = client wins | ||
| // Default when neither client nor server provides: client-preferred | ||
| public static final String SCAN_PLANNING_MODE = "scan-planning-mode"; | ||
| public static final String SCAN_PLANNING_MODE_DEFAULT = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to split out introducing the different planning modes from the option of overriding this at the table level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this more from backward compatibility pov ? asking because we haven't shipped any iceberg java version yet with this config
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java
Outdated
Show resolved
Hide resolved
66eb57a to
860ba21
Compare
860ba21 to
1f9bdcb
Compare
| * <li><b>Neither configured</b>: Use default (CLIENT_PREFERRED) | ||
| * </ul> | ||
| */ | ||
| public class ScanPlanningNegotiator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure we need a whole separate "negotiator" class, it feels a bit over the top, and it's not really a negotiation imo. We're using a defined set of rules to determine the planning mode. Have we considered just having a static ScanPlanningMode#determinePlanningDecision
| * <p>Values: | ||
| * | ||
| * <ul> | ||
| * <li>CLIENT_ONLY - MUST use client-side planning. Fails if paired with CATALOG_ONLY from other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit skeptical that we even need the notion of preferences, e.g. client-preferred, catalog-preferred. it''s plausible that servers could have more insights to give a more intelligent preference but it feels over complicated compared to just having a "clients-choice" (not a real mode, just something that's inferred when the endpoint is supported but not required) instead of 2 preferences. It simplifies the decision matrix logic below, and clients can then use their own heuristics.
I think that's what the decision as to if preferences or not makes sense, comes down to:
is it better to have clients just make intelligent choices when server side planning is available but not required, or is it better for servers to indicate preferences. My thought process is if a server really feels like it's advantageous to do remote planning, may as well just send it back as required.
I should note: I'm not super opinionated on this, but I do think it'd be great if we could outline some concrete cases where we think a preference is advantageous (in both directions) just to make it clear if the complexity is worth it.
| - **Both PREFERRED**: When both are PREFERRED (different types), client config wins | ||
| - **Both same**: When both have the same value, use that planning type | ||
| - **Only one configured**: Use the configured side (client or server) | ||
| - **Neither configured**: Use default (`client-preferred`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern about some catalogs starting to make every table CATALOG_ONLY, which would essentially lock users to the catalog without providing a way to migrate the data to another catalog.
Maybe we add a sentence in the spec to enforce, that there should be some users where the catalog MUST provide access to the metadata files.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the catalog without providing a way to migrate the data to another catalog
we would still have a way to migrate, mostly because in the loadTable we give back the metadata.json pointer (which is self describing the table state), and its the catalog ADMIN would be able to use that pointer and register table to another REST or Metastore backed catalog. In the model where storage is decoupled from compute its the administrator of the catalog who has given access the catalog to vend storage creds and it can very well take it back.
This feature is mostly like i want to read the table, can you help me with the data | delete files that corresponds to the table. Nevertheless i believe CATALOG_ONLY we think to be used primarily for gov cases also for things like scanning huge tables where planning can cause a lot of pressure on JVM (trino coordinar unstablity | spark requiring distributed planning) where catalog can do some efficient indexing (stuff like Redis) etc to help these engine.
All in all IMHO i believe vendor lock in and not being able to migrate would not be possible by exposing this option, please let me know if i am missing something.
About the change
Scan Planning Modes
Single enum ScanPlanningMode with 4 values:
client-only- MUST use client-side planningclient-preferred(default) - Prefer client-side, but flexiblecatalog-preferred- Prefer server-side, but flexible (fallback to client if unavailable)catalog-only- MUST use server-side planningNegotiation Logic
When both client and server configure scan-planning-mode:
ML : https://lists.apache.org/thread/z1g4y8b4ogdrn0jjtjlgg7yjgxdbzpvg