fix(cdk): upgrade unstructured from 0.10.27 to 0.18.18 to fix CVE-2025-64712 by devin-ai-integration[bot] · Pull Request #907 · airbytehq/airbyte-python-cdk

devin-ai-integration · 2026-02-17T15:46:35Z

fix(cdk): upgrade unstructured 0.10.27 → 0.18.18 to fix CVE-2025-64712

Summary

Upgrades the unstructured library to remediate critical path traversal vulnerability GHSA-gm8q-m8mv-jj5m (CVSS 9.8) in partition_msg. Affects file-based connectors: source-gcs, source-azure-blob-storage, source-s3.

The upgrade required adapting unstructured_parser.py to API changes between 0.10.27 and 0.18.18:

Removed dict lookups → New FileType methods: EXT_TO_FILETYPE[ext] → FileType.from_extension(ext), STR_TO_FILETYPE[mime] → FileType.from_mime_type(mime), FILETYPE_TO_MIMETYPE[ft] → ft.mime_type
Renamed parameter: detect_filetype(filename=...) → detect_filetype(file_path=...)
New dependency: Added pi-heif (required at import time by partition_pdf in 0.18.18)
partition_pdf import made optional: In 0.18.18, partition_pdf requires the heavy unstructured_inference package. The import is now wrapped in try/except; availability is checked only when actually processing PDF files. MD/TXT files no longer fail due to missing PDF dependencies.
File type detection reordered: Extension-based detection is now checked before content-based sniffing in _get_filetype. Content sniffing in 0.18.18 can return TXT for plain text content on some Python versions, which would mask the actual file type indicated by the extension (e.g., a .csv file with text content). A None guard was also added for FileType.from_extension() which returns None for unknown extensions (rather than raising), ensuring content-based fallback still works for extensionless files.
Per-filetype availability checks: Removed blanket "all three partitioners must be non-None" guards from _read_file() and _read_file_locally(). Each filetype now checks its own partition function availability, with a clear error message (e.g., PDF parsing requires unstructured_inference).
Exception propagation: Added except RecordParseError: raise before the generic exception handler in _read_file_locally() to ensure parse errors propagate correctly instead of being re-wrapped.

Test mocks updated to patch module-level globals instead of direct unstructured.partition.* paths, and to mock _import_unstructured to avoid pulling in unstructured_inference (heavy ML dep not needed for unit tests).

Scenario test expectation changes: Because unstructured_inference is not installed in the test environment, PDF files now produce _ab_source_file_parse_error records instead of parsed content. DOCX content rendering also changed from "# Content" to "Content" (the new unstructured version strips markdown heading syntax from .docx output).

Resolves https://github.com/airbytehq/oncall/issues/11267:

https://github.com/airbytehq/oncall/issues/11267

Review & Testing Checklist for Human

This is a large version jump (0.10.27 → 0.18.18) with significant API and behavioral changes. Unit tests pass locally, but there are meaningful risks that CI and unit tests alone cannot cover:

Test end-to-end with real files: Run actual file parsing with PDF/DOCX/PPTX/MD files against a real source (e.g., source-s3 or source-gcs) to verify parsing behavior hasn't regressed. Unit tests mock all partition functions, so real parsing with 0.18.18 is completely untested. Critical: DOCX content rendering changed (markdown heading syntax is now stripped), which could affect downstream consumers.
Verify PDF parsing WITH unstructured_inference installed: In production environments where unstructured_inference is installed, PDFs should parse successfully. Test this end-to-end because the test suite now only validates the "missing unstructured_inference" error path, not actual PDF parsing.
Verify file type detection priority change: The reordering of extension check before content sniffing could affect edge cases. Test files where extension doesn't match content (e.g., a .txt file containing CSV data) to ensure detection still works as expected.
Check for other breaking changes: Review unstructured changelog between 0.10.27 and 0.18.18 for any other API or behavioral changes that could affect file parsing (e.g., changes to markdown rendering, element extraction, etc.)
Verify CI passes on all platforms: Prior attempt to bump to 0.18.15 (airbytehq/airbyte-python-cdk#767) had 12 CI failures. Ensure all CI checks pass, especially on different platforms (pi-heif is a native dependency)

Notes

Mypy: Added cast(IO[bytes], file) to resolve type incompatibility with detect_filetype. All mypy errors resolved.
Test coverage gaps:
- By mocking _import_unstructured, tests no longer verify the lazy import mechanism works correctly with the new version or that partition functions are actually callable.
- The corrupted_file_scenario no longer tests actual PDF corruption handling - it now tests the "missing unstructured_inference" error path, same as all other PDF tests.
- PDF parsing is not validated at all in the test suite (only the error path when unstructured_inference is missing).
New transitive dependencies: poetry.lock shows many new packages (aiofiles, html5lib, nest-asyncio, olefile, unstructured-client, webencodings, eval-type-backport). These increase the dependency surface area.
Removed dependency: chardet was removed from the lock file (previously a transitive dep).
CI note: One flaky test failure in test_concurrent_declarative_source.py (unrelated to unstructured changes) - passes locally on both main and PR branch. All required checks pass.

Requested by Danylo Jablonski (@DanyloGL) via /ai-fix command
Devin session: https://app.devin.ai/sessions/321f6bedb61f4d809581e073e864a026

…5-64712 Upgrades the unstructured library to address critical path traversal vulnerability GHSA-gm8q-m8mv-jj5m (CVSS 9.8) in partition_msg. Changes: - Update unstructured dependency from 0.10.27 to 0.18.18 - Add pi-heif dependency required by new unstructured version - Adapt unstructured_parser.py to new API: - Replace removed EXT_TO_FILETYPE/STR_TO_FILETYPE/FILETYPE_TO_MIMETYPE with FileType.from_extension()/from_mime_type()/mime_type property - Update detect_filetype() parameter from filename= to file_path= - Update test mocks to match new API surface Co-Authored-By: unknown <>

devin-ai-integration · 2026-02-17T15:46:40Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

github-actions · 2026-02-17T15:46:49Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771342600-bump-unstructured-0.18.18#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771342600-bump-unstructured-0.18.18

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/prerelease - Triggers a prerelease publish with default arguments
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📚 Show Repo Guidance

Helpful Resources

CDK API Reference

📝 Edit this welcome message.

…etype Co-Authored-By: unknown <>

github-actions · 2026-02-17T15:57:09Z

PyTest Results (Fast)

590 tests - 3 279 578 ✅ - 3 279 3m 15s ⏱️ - 3m 23s
1 suites ± 0 11 💤 - 1
1 files ± 0 1 ❌ + 1

For more details on these failures, see this check.

Results for commit 76c7faf. ± Comparison against base commit e9144e2.

This pull request removes 3279 tests.

unit_tests.sources.declarative.async_job.test_integration.JobDeclarativeStreamTest ‑ test_when_read_then_call_stream_slices_only_once
unit_tests.sources.declarative.async_job.test_integration.JobDeclarativeStreamTest ‑ test_when_read_then_return_records_from_repository
unit_tests.sources.declarative.async_job.test_job.AsyncJobTest ‑ test_given_status_is_terminal_when_update_status_then_stop_timer
unit_tests.sources.declarative.async_job.test_job.AsyncJobTest ‑ test_given_timer_is_not_out_when_status_then_return_actual_status
unit_tests.sources.declarative.async_job.test_job.AsyncJobTest ‑ test_given_timer_is_out_when_status_then_return_timed_out
unit_tests.sources.declarative.async_job.test_job_orchestrator.AsyncJobOrchestratorTest ‑ test_given_exception_on_single_job_when_create_and_get_completed_partitions_then_return
unit_tests.sources.declarative.async_job.test_job_orchestrator.AsyncJobOrchestratorTest ‑ test_given_exception_to_break_when_start_job_and_raise_this_exception_and_abort_jobs
unit_tests.sources.declarative.async_job.test_job_orchestrator.AsyncJobOrchestratorTest ‑ test_given_exception_when_start_job_and_skip_this_exception
unit_tests.sources.declarative.async_job.test_job_orchestrator.AsyncJobOrchestratorTest ‑ test_given_failure_when_create_and_get_completed_partitions_then_raise_exception
unit_tests.sources.declarative.async_job.test_job_orchestrator.AsyncJobOrchestratorTest ‑ test_given_jobs_failed_more_than_max_attempts_when_create_and_get_completed_partitions_then_free_job_budget
…

♻️ This comment has been updated with latest results.

…et_filetype Move extension check before content-based detection to ensure deterministic behavior across Python versions. Content-based detection in unstructured 0.18.18 may return TXT for plain text content, which could mask the actual file type indicated by the extension. Co-Authored-By: unknown <>

….18.18 partition_pdf now requires unstructured_inference package which may not be installed. Make the import optional and check availability only when actually processing PDF files. MD/TXT files don't need partition functions and should not fail due to missing PDF dependencies. Co-Authored-By: unknown <>

Co-Authored-By: unknown <>

airbyte_cdk/sources/file_based/file_types/unstructured_parser.py

Co-Authored-By: unknown <>

github-actions · 2026-02-17T16:29:57Z

PyTest Results (Full)

3 872 tests ±0 3 860 ✅ ±0 11m 24s ⏱️ +7s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 76c7faf. ± Comparison against base commit e9144e2.

♻️ This comment has been updated with latest results.

…uctured scenarios Co-Authored-By: unknown <>

fix: resolve mypy error by casting IOBase to IO[bytes] for detect_fil…

5594c29

…etype Co-Authored-By: unknown <>

devin-ai-integration bot added 3 commits February 17, 2026 16:01

style: apply ruff formatting to unstructured_parser.py

b3389f7

Co-Authored-By: unknown <>

github-code-quality bot found potential problems Feb 17, 2026

View reviewed changes

airbyte_cdk/sources/file_based/file_types/unstructured_parser.py Fixed Show fixed Hide fixed

fix: add logging to empty except block for partition_pdf import failure

e011d9b

Co-Authored-By: unknown <>

fix: handle unstructured FileType.from_extension(None) + update unstr…

76c7faf

…uctured scenarios Co-Authored-By: unknown <>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(cdk): upgrade unstructured from 0.10.27 to 0.18.18 to fix CVE-2025-64712#907

fix(cdk): upgrade unstructured from 0.10.27 to 0.18.18 to fix CVE-2025-64712#907
devin-ai-integration[bot] wants to merge 7 commits intomainfrom
devin/1771342600-bump-unstructured-0.18.18

devin-ai-integration bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

github-actions bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Comments

Conversation

devin-ai-integration bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix(cdk): upgrade unstructured 0.10.27 → 0.18.18 to fix CVE-2025-64712

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Feb 17, 2026

🤖 Devin AI Engineer

Uh oh!

github-actions bot commented Feb 17, 2026

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

github-actions bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

devin-ai-integration bot commented Feb 17, 2026 •

edited

Loading

github-actions bot commented Feb 17, 2026 •

edited

Loading

github-actions bot commented Feb 17, 2026 •

edited

Loading