-
Notifications
You must be signed in to change notification settings - Fork 128
add GHArchive random access benchmark #6372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
5d03562 to
b99febe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a new GHArchive-backed random access benchmark path, including nested field projection support for both Parquet (Arrow) and Vortex readers.
Changes:
- Introduces a GHArchive dataset module that downloads/converts GitHub Archive events into Parquet/Vortex for benchmarking.
- Adds projecting random-access traits/accessors and Parquet/Vortex implementations to benchmark nested field reads.
- Extends the random-access benchmark CLI to select between Taxi and GHArchive datasets and emits separate benchmark IDs.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| vortex-bench/src/random_access/take.rs | Adds projecting accessors and Parquet/Vortex “take with projection” implementations. |
| vortex-bench/src/random_access/mod.rs | Exposes new projecting accessor types and defines FieldPath + ProjectingRandomAccessor trait. |
| vortex-bench/src/datasets/mod.rs | Registers the new gharchive dataset module. |
| vortex-bench/src/datasets/gharchive.rs | Implements GHArchive dataset download/conversion and exposes useful nested-field paths. |
| benchmarks/random-access-bench/src/main.rs | Adds dataset selection and a projected-random-access benchmark path for GHArchive. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let row_group_indices = row_groups | ||
| .keys() | ||
| .sorted() | ||
| .map(|i| row_groups[i].clone()) | ||
| .collect_vec(); | ||
|
|
||
| let reader = builder | ||
| .with_projection(projection_mask) | ||
| .with_row_groups(row_groups.keys().copied().collect_vec()) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row_group_indices is built from sorted row_groups.keys(), but with_row_groups(row_groups.keys().copied().collect_vec()) passes row groups in HashMap iteration order. The stream will yield batches in the order requested, so indexing row_group_indices[idx] can apply the wrong per-row-group indices (or panic). Build a single sorted_row_groups vec once, use it both to construct row_group_indices and to pass into with_row_groups(...).
| let row_group_indices = row_groups | |
| .keys() | |
| .sorted() | |
| .map(|i| row_groups[i].clone()) | |
| .collect_vec(); | |
| let reader = builder | |
| .with_projection(projection_mask) | |
| .with_row_groups(row_groups.keys().copied().collect_vec()) | |
| // Build a single, consistently ordered list of row group indices. | |
| let sorted_row_groups = row_groups.keys().copied().sorted().collect_vec(); | |
| let row_group_indices = sorted_row_groups | |
| .iter() | |
| .map(|i| row_groups[i].clone()) | |
| .collect_vec(); | |
| let reader = builder | |
| .with_projection(projection_mask) | |
| .with_row_groups(sorted_row_groups) |
|
|
||
| for idx in indices { | ||
| let row_group_idx = row_group_offsets | ||
| .binary_search(&(idx as i64)) | ||
| .unwrap_or_else(|e| e - 1); | ||
| row_groups | ||
| .entry(row_group_idx) | ||
| .or_insert_with(Vec::new) | ||
| .push((idx as i64) - row_group_offsets[row_group_idx]); |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row_group_offsets includes the final total row count, so an index equal to (or greater than) the total number of rows can produce row_group_idx == row_group_count, which is not a valid row group index. This will later lead to out-of-bounds math and/or an invalid row group request. Consider validating that all indices are < total_rows up front (and returning a clear error) before computing row group selections.
| for idx in indices { | |
| let row_group_idx = row_group_offsets | |
| .binary_search(&(idx as i64)) | |
| .unwrap_or_else(|e| e - 1); | |
| row_groups | |
| .entry(row_group_idx) | |
| .or_insert_with(Vec::new) | |
| .push((idx as i64) - row_group_offsets[row_group_idx]); | |
| // The last entry is the total number of rows across all row groups. | |
| let total_rows = *row_group_offsets.last().unwrap_or(&0); | |
| for idx in indices { | |
| let idx_i64 = idx as i64; | |
| // Skip indices that are out of bounds for the file to avoid invalid row group indices. | |
| if idx_i64 < 0 || idx_i64 >= total_rows { | |
| continue; | |
| } | |
| let row_group_idx = row_group_offsets | |
| .binary_search(&idx_i64) | |
| .unwrap_or_else(|e| e - 1); | |
| row_groups | |
| .entry(row_group_idx) | |
| .or_insert_with(Vec::new) | |
| .push(idx_i64 - row_group_offsets[row_group_idx]); |
| .enumerate() | ||
| .map(|(idx, batch)| { | ||
| let batch = batch.unwrap(); | ||
| let indices = PrimitiveArray::<Int64Type>::from(row_group_indices[idx].clone()); | ||
| take_record_batch(&batch, &indices).unwrap() | ||
| }) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stream processing uses unwrap() for both the Parquet batch and take_record_batch(...). Any IO/decoding error or invalid index will panic the entire benchmark run. Prefer propagating these as anyhow::Result (e.g., map to Result<RecordBatch> and try_collect) so failures surface as errors rather than panics.
| } | ||
| Dataset::GhArchive => { | ||
| // Run gharchive benchmark with nested field projection. | ||
| // The field path is payload.ref - a deeply nested string field. |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says the field path is payload.ref, but the code actually benchmarks actor.login. Please update the comment (or change the field path) so benchmark output/intent is clear and doesn’t drift from the implementation.
| // The field path is payload.ref - a deeply nested string field. | |
| // The field path is actor.login - a deeply nested string field. |
| // For now, use the same path as OnDiskVortex (compact not yet implemented for gharchive) | ||
| let path = gharchive_vortex().await?; | ||
| Ok(Box::new(VortexProjectingAccessor::compact(path))) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format::VortexCompact is currently using gharchive_vortex() (non-compact) but labeling the accessor/target as compact via VortexProjectingAccessor::compact(...). This will misreport results and can confuse comparisons. Either use gharchive_vortex_compact() here, or return unimplemented!/error until a compact dataset path is available.
| // For now, use the same path as OnDiskVortex (compact not yet implemented for gharchive) | |
| let path = gharchive_vortex().await?; | |
| Ok(Box::new(VortexProjectingAccessor::compact(path))) | |
| unimplemented!("Compact gharchive dataset path is not yet implemented"); |
Benchmarks: PolarSignals ProfilingSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Benchmarks: Random AccessSummary
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
Benchmarks: CompressionSummary
Detailed Results Table
|
No description provided.