-
Notifications
You must be signed in to change notification settings - Fork 0
Rename fields in CaseRecord for consistency with langfuse evaluators
#38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Renames CaseRecord fields to align with langfuse evaluator conventions (input, expected_output, output) and updates agent + tests accordingly.
Changes:
- Renamed
CaseRecordfields (case→input,groundtruth→expected_output,analysis→output). - Updated AML investigation agent logic to read/write/analyze using the new field names.
- Updated AML investigation test cases to assert against the renamed fields.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| implementations/aml_investigation/agent.py | Updates analysis pipeline, resume logic, and metrics computation to use input/expected_output/output. |
| aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py | Renames the CaseRecord schema and updates case-building helpers accordingly. |
| aieng-eval-agents/tests/aieng/agent_evals/aml_investigation/data/test_cases.py | Updates tests to reference the renamed CaseRecord fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| input: CaseFile = Field(..., description="Metadata for the laundering case.") | ||
| expected_output: GroundTruth = Field(..., description="Ground truth information for the laundering case.") | ||
| output: AnalystOutput | None = Field( | ||
| default=None, | ||
| description="Optional analyst output for the laundering case. Typically populated after investigation.", | ||
| ) |
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming these serialized fields will break parsing of any existing JSON/JSONL persisted with the old keys (case, groundtruth, analysis). Since the agent reads prior results via CaseRecord.model_validate_json(...) for resume behavior, legacy lines will fail validation and be skipped. Consider adding Pydantic v2 compatibility via validation_alias (e.g., AliasChoices('input', 'case'), AliasChoices('expected_output', 'groundtruth'), AliasChoices('output', 'analysis')) so existing artifacts remain readable while emitting the new field names.
| input_records = _load_records(input_path) | ||
| existing_results = {record.case.case_id: record for record in _load_records(output_path)} | ||
| to_run = [r for r in input_records if existing_results.get(r.case.case_id, r).analysis is None] | ||
| existing_results = {record.input.case_id: record for record in _load_records(output_path)} |
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the CaseRecord field rename, _load_records(output_path) will skip any legacy JSONL rows (old schema) as invalid, which can cause existing_results to be incomplete and make the resume logic re-run cases unnecessarily (and potentially overwrite/duplicate outputs). After adding schema aliases (or a migration path), it would also help to make legacy parsing failures more explicit here (e.g., fail fast or log a clear warning/count) so users don’t silently lose resume behavior.
| existing_results = {record.input.case_id: record for record in _load_records(output_path)} | |
| # Load existing results from the output file, and detect any lines that could not be parsed | |
| existing_records = list(_load_records(output_path)) | |
| existing_results = {record.input.case_id: record for record in existing_records} | |
| # Compare parsed records against total lines to surface potential legacy/invalid rows | |
| if output_path.exists(): | |
| try: | |
| with output_path.open("r", encoding="utf-8") as f: | |
| total_lines = sum(1 for _ in f) | |
| except OSError: | |
| total_lines = None | |
| if total_lines is not None and total_lines > len(existing_records): | |
| logger.warning( | |
| "Detected %d/%d records in %s that could not be parsed. " | |
| "These may be legacy or invalid rows, and resume behavior may be affected.", | |
| total_lines - len(existing_records), | |
| total_lines, | |
| output_path, | |
| ) |
amrit110
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine!
Summary
Rename fields in
CaseRecordfor consistency with langfuse evaluatorsClickup Ticket(s): N/A
Type of Change
Changes Made
casetoinput.groundtruthtoexpected_output.analysistooutput.Testing
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
N/A
Screenshots/Recordings
N/A
Related Issues
N/A
Deployment Notes
N/A
Checklist