-
Notifications
You must be signed in to change notification settings - Fork 502
refactor(cost-model): use linear costing for valueContains #7476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Change valueContains costing from multiplied_sizes to
const_above_diagonal with linear_in_x_and_y inner model.
The new model:
- Above diagonal (y > x): constant cost (early exit, returns False)
- Below diagonal (x >= y): intercept + slope1*x + slope2*y
This better reflects the actual complexity bound O(m*log(n/m+1))
which Kenneth MacKenzie showed is bounded above by a linear function.
Changes:
- models.R: Update valueContainsModel to use const_above_diagonal
- SimpleJSON.hs: Add LinearInXAndY constructor for JSON parsing
- utils.js: Add evaluators for linear_in_x_and_y and const_above_diagonal
- builtinCostModel{A,B,C}.json: Regenerated with new model type
|
Remove const_above_diagonal optimization since benchmark data shows linear behavior regardless of input size relationship. The diagonal optimization assumed quick exit when needle size > haystack size, but benchmark measurements don't reflect this behavior.
Change valueContains benchmark to use ValueTotalSize for both container and contained arguments, enabling meaningful diagonal comparison: - When contained size > container size, containment is impossible - This allows const_above_diagonal optimization in cost model Also update R model to fit const_above_diagonal with proper diagonal filtering for above/below diagonal data points.
Revert builtinCostModel{A,B,C}.json to master to avoid unnecessary
re-indentation diff. The valueContains cost model will be updated
once new benchmarks complete with ValueTotalSize for both args.
… parameters - Changed CPU cost model from a simple linear model to a more complex model with constant and intercept values. - Updated the type from "multiplied_sizes" to "const_above_diagonal" for better accuracy in cost estimation. - Adjusted arguments for intercept, slope1, and slope2 to reflect new performance metrics.
zliu41
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 looks quite reasonable, but I'll leave it to @kwxm
|
This PR replaces #7476 |
| "model": { | ||
| "arguments": { | ||
| "intercept": 386227, | ||
| "slope1": 1970, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The theoretical linear upper bound of k*(m+n) would have slope1 = slope2 here, but here the two slopes are very different! I think that's OK though: the theoretical bound is very conservative and this seems to fit the data very well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but again I urge you to reduce the number of benchmarking inputs: see this earlier comment. We have over 1000 points here and that makes the benchmark take a very long time (> 90 minutes for just valueContains) and doesn't contribute anything to the accuracy of the model: you can see this by taking subsets of the existing data and fitting models to them. Also, the data is very unevenly distributed, which is probably not a good idea. There are 500 inputs with x<200 and y<200, so 1/25 of the space of inputs is contributing half of the datapoints, which will bias the model. Something like the below-diagonal part of a 25x25 or 30x30 grid of evenly spaced inputs (plus some points above the diagonal) would probably suffice and not lose any precision.
Also, the inputs aren't symmetric: the x values go up to 2047, but the y values only go up to 1000, so we're not always checking cases like a value containing itself, which might take some time. Can you remind me if there's some reason for this? We had earlier discussions about what the worst case would be, but I don't recall the details at the moment.
| "exp_mod_cost" -> ExpModCost <$> parseJSON args | ||
| "literal_in_y_or_linear_in_z" -> LiteralInYOrLinearInZ <$> parseJSON args | ||
| "linear_in_max_yz" -> LinearInMaxYZ <$> parseJSON args | ||
| "linear_in_x_and_y" -> LinearInXAndY <$> parseJSON args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a case for LinearInXAndY in the print-cost-model executable here.
| # Above diagonal (y > x): containment impossible, quick False return | ||
| # Use minimum observed time as constant for this case | ||
| above_diag <- filtered %>% filter(y_mem > x_mem) | ||
| constant <- if (nrow(above_diag) > 0) ceiling(mean(above_diag$t)) else min(filtered$t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding some points (maybe 20 or so) above the diagonal in the benchmarking code to ensure that above_diag is always nonempty and then we won't need to have the two ifs in this code. There's redundancy here that can be avoided by making sure that the benchmark results always contain points on both sides of the diagonal. If someone was later to remove the points above the diagonal then we'd get an error here anyway when we get to mean(above_diag), so there's no real reason to have two different cases to deal with the presence or absence of points above the diagonal.
Summary
valueContainscosting frommultiplied_sizestoconst_above_diagonalwithlinear_in_x_and_yinner modelLinearInXAndYconstructor to SimpleJSON.hs for JSON parsingMotivation
The true complexity of
valueContainsisO(m*log(n/m+1))where n is the container size and m is the contained size. Kenneth MacKenzie showed that this is bounded above by a linear function(m+n)/kfor some constant k.The new model better reflects this:
intercept + slope1*x + slope2*yChanges
models.RvalueContainsModelto useconst_above_diagonalwithlinear_in_x_and_ySimpleJSON.hsLinearInXAndYconstructor for JSON parsingutils.jslinear_in_x_and_yandconst_above_diagonalbuiltinCostModel{A,B,C}.jsonNew Cost Model Structure
Test plan
Closes https://github.com/IntersectMBO/plutus-private/issues/1984