Skip to content

Comments

fix: enhance sanitizeDocuments to remove problematic characters for Postgresql#5755

Open
prd-hoang-doan wants to merge 1 commit intoFlowiseAI:mainfrom
prd-hoang-doan:fix/enhance_sanitize_docs_posgresql
Open

fix: enhance sanitizeDocuments to remove problematic characters for Postgresql#5755
prd-hoang-doan wants to merge 1 commit intoFlowiseAI:mainfrom
prd-hoang-doan:fix/enhance_sanitize_docs_posgresql

Conversation

@prd-hoang-doan
Copy link
Contributor

Fix: Enhance sanitizeDocuments to Remove Problematic Characters for PostgreSQL Compatibility

Summary

Enhances the sanitizeDocuments method in the TypeORM PostgreSQL driver to remove additional problematic characters that can cause errors when storing documents in PostgreSQL.

Ticket: #5611

Problem

The previous implementation only removed NULL characters (\0), but PostgreSQL and document processing can encounter issues with additional problematic characters, particularly those from PDF extraction:

  • Soft hyphens (U+00AD) - common in PDF extraction
  • Control characters (0x00-0x1F) except tab, newline, and carriage return
  • DEL character (0x7F)
  • Zero-width and invisible Unicode characters (U+200B-U+200D, U+FEFF)

Test Coverage:

image

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @prd-hoang-doan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the robustness of document storage in PostgreSQL by expanding the character sanitization logic within the TypeORM driver. It proactively filters out various non-printable and problematic Unicode characters that commonly lead to database errors, especially when processing text from diverse sources like PDFs. This ensures that document content is clean and compatible before being persisted, preventing data integrity issues and enhancing system stability.

Highlights

  • Enhanced Document Sanitization: The sanitizeDocuments method in the TypeORM PostgreSQL driver has been significantly enhanced to remove a wider range of problematic characters that can cause errors in PostgreSQL.
  • Expanded Character Removal: Beyond NULL characters, the method now targets soft hyphens (U+00AD), most control characters (0x00-0x1F, excluding tab, newline, carriage return), the DEL character (0x7F), and zero-width/invisible Unicode characters (U+200B-U+200D, U+FEFF).
  • Improved PostgreSQL Compatibility: This change specifically addresses issues encountered when storing documents, particularly those extracted from PDFs, ensuring better compatibility and preventing database errors.
  • Comprehensive Test Coverage: New unit tests have been added to thoroughly validate the removal of each newly targeted problematic character, as well as to ensure preservation of valid characters and metadata.
Changelog
  • packages/components/nodes/vectorstores/Postgres/driver/TypeORM.test.ts
    • Added comprehensive unit tests for the sanitizeDocuments method, covering various problematic characters like soft hyphens, control characters, DEL, and zero-width spaces, and ensuring correct preservation of valid characters and metadata.
  • packages/components/nodes/vectorstores/Postgres/driver/TypeORM.ts
    • Modified the sanitizeDocuments method to include removal of soft hyphens (U+00AD), a broader range of control characters (0x00-0x1F excluding tab, newline, CR), DEL character (0x7F), and zero-width/invisible Unicode characters (U+200B-U+200D, U+FEFF), in addition to existing NULL character removal.
Activity
  • The pull request was initiated to resolve an issue (Upsert into pgvector Postgres Vector store is buggy #5611) where problematic characters caused errors when storing documents in PostgreSQL.
  • The author provided a detailed problem description, highlighting specific character types that were causing issues, particularly those originating from PDF extraction.
  • Test coverage was explicitly demonstrated with a screenshot in the PR description, indicating a focus on validating the new sanitization logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the sanitizeDocuments method to remove additional problematic characters for PostgreSQL compatibility, which is a valuable improvement. The addition of a comprehensive test suite for this functionality is also excellent. I've provided a couple of suggestions to improve the implementation's efficiency and the clarity of one of the tests.

Comment on lines +252 to +265
it('should not mutate the original document array', () => {
const originalContent = 'Text\0with\u00ADissues'
const documents = [
new Document({
pageContent: originalContent,
metadata: {}
})
]

driver.sanitizeDocuments(documents)

// After sanitization, the documents array should be modified in place
expect(documents[0].pageContent).toBe('Textwithissues')
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test name 'should not mutate the original document array' is misleading because the function does mutate the pageContent of documents within the array. The test implementation and the comment on line 263 correctly reflect this in-place modification. To avoid confusion, the test should be renamed and the redundant comment removed.

        it('should modify the document content in place', () => {
            const originalContent = 'Text\0with\u00ADissues'
            const documents = [
                new Document({
                    pageContent: originalContent,
                    metadata: {}
                })
            ]

            driver.sanitizeDocuments(documents)

            expect(documents[0].pageContent).toBe('Textwithissues')
        })

Comment on lines 78 to 96
for (var i in documents) {
documents[i].pageContent = documents[i].pageContent.replace(/\0/g, '')
let content = documents[i].pageContent

// Remove NULL characters (0x00) - PostgreSQL doesn't allow these
content = content.replace(/\0/g, '')

// Remove soft hyphens (U+00AD) - common in PDF extraction, can cause encoding issues
content = content.replace(/\u00AD/g, '')

// Remove other control characters (0x00-0x1F except tab, newline, carriage return)
// and DEL (0x7F), which can cause issues
// eslint-disable-next-line no-control-regex
content = content.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, '')

// Remove零-width and other invisible Unicode characters that can cause issues
content = content.replace(/[\u200B-\u200D\uFEFF]/g, '')

documents[i].pageContent = content
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation can be improved for efficiency and readability:

  • Using for...of is the modern and safer way to iterate over array elements compared to for...in.
  • Chaining multiple replace() calls is less performant than a single call with a combined regular expression.
  • The first replace for \0 is redundant as it's covered by the control character regex.
  • There's a typo in the comment on line 92 ('零-width' instead of 'zero-width').

Combining these into a single, more efficient operation improves the code.

        for (const doc of documents) {
            // This combines all replacements into a single, more efficient operation.
            // It removes:
            // - NULL characters (\x00)
            // - Soft hyphens (\u00AD)
            // - Other C0 control characters (\x01-\x08, \x0B, \x0C, \x0E-\x1F) and DEL (\x7F)
            // - Zero-width and invisible Unicode characters (\u200B-\u200D, \uFEFF)
            // eslint-disable-next-line no-control-regex
            doc.pageContent = doc.pageContent.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F\u00AD\u200B-\u200D\uFEFF]/g, '');
        }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant