fix: enhance sanitizeDocuments to remove problematic characters for Postgresql by prd-hoang-doan · Pull Request #5755 · FlowiseAI/Flowise

prd-hoang-doan · 2026-02-14T14:22:38Z

Fix: Enhance sanitizeDocuments to Remove Problematic Characters for PostgreSQL Compatibility

Summary

Enhances the sanitizeDocuments method in the TypeORM PostgreSQL driver to remove additional problematic characters that can cause errors when storing documents in PostgreSQL.

Ticket: #5611

Problem

The previous implementation only removed NULL characters (\0), but PostgreSQL and document processing can encounter issues with additional problematic characters, particularly those from PDF extraction:

Soft hyphens (U+00AD) - common in PDF extraction
Control characters (0x00-0x1F) except tab, newline, and carriage return
DEL character (0x7F)
Zero-width and invisible Unicode characters (U+200B-U+200D, U+FEFF)

Test Coverage:

…ostgreSQL compatibility

gemini-code-assist · 2026-02-14T14:22:51Z

Summary of Changes

Hello @prd-hoang-doan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the robustness of document storage in PostgreSQL by expanding the character sanitization logic within the TypeORM driver. It proactively filters out various non-printable and problematic Unicode characters that commonly lead to database errors, especially when processing text from diverse sources like PDFs. This ensures that document content is clean and compatible before being persisted, preventing data integrity issues and enhancing system stability.

Highlights

Enhanced Document Sanitization: The sanitizeDocuments method in the TypeORM PostgreSQL driver has been significantly enhanced to remove a wider range of problematic characters that can cause errors in PostgreSQL.
Expanded Character Removal: Beyond NULL characters, the method now targets soft hyphens (U+00AD), most control characters (0x00-0x1F, excluding tab, newline, carriage return), the DEL character (0x7F), and zero-width/invisible Unicode characters (U+200B-U+200D, U+FEFF).
Improved PostgreSQL Compatibility: This change specifically addresses issues encountered when storing documents, particularly those extracted from PDFs, ensuring better compatibility and preventing database errors.
Comprehensive Test Coverage: New unit tests have been added to thoroughly validate the removal of each newly targeted problematic character, as well as to ensure preservation of valid characters and metadata.

Changelog

packages/components/nodes/vectorstores/Postgres/driver/TypeORM.test.ts
- Added comprehensive unit tests for the sanitizeDocuments method, covering various problematic characters like soft hyphens, control characters, DEL, and zero-width spaces, and ensuring correct preservation of valid characters and metadata.
packages/components/nodes/vectorstores/Postgres/driver/TypeORM.ts
- Modified the sanitizeDocuments method to include removal of soft hyphens (U+00AD), a broader range of control characters (0x00-0x1F excluding tab, newline, CR), DEL character (0x7F), and zero-width/invisible Unicode characters (U+200B-U+200D, U+FEFF), in addition to existing NULL character removal.

Activity

The pull request was initiated to resolve an issue (Upsert into pgvector Postgres Vector store is buggy #5611) where problematic characters caused errors when storing documents in PostgreSQL.
The author provided a detailed problem description, highlighting specific character types that were causing issues, particularly those originating from PDF extraction.
Test coverage was explicitly demonstrated with a screenshot in the PR description, indicating a focus on validating the new sanitization logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enhances the sanitizeDocuments method to remove additional problematic characters for PostgreSQL compatibility, which is a valuable improvement. The addition of a comprehensive test suite for this functionality is also excellent. I've provided a couple of suggestions to improve the implementation's efficiency and the clarity of one of the tests.

gemini-code-assist · 2026-02-14T14:24:48Z

packages/components/nodes/vectorstores/Postgres/driver/TypeORM.test.ts

+        it('should not mutate the original document array', () => {
+            const originalContent = 'Text\0with\u00ADissues'
+            const documents = [
+                new Document({
+                    pageContent: originalContent,
+                    metadata: {}
+                })
+            ]
+
+            driver.sanitizeDocuments(documents)
+
+            // After sanitization, the documents array should be modified in place
+            expect(documents[0].pageContent).toBe('Textwithissues')
+        })


The test name 'should not mutate the original document array' is misleading because the function does mutate the pageContent of documents within the array. The test implementation and the comment on line 263 correctly reflect this in-place modification. To avoid confusion, the test should be renamed and the redundant comment removed.

it('should modify the document content in place', () => { const originalContent = 'Text\0with\u00ADissues' const documents = [ new Document({ pageContent: originalContent, metadata: {} }) ] driver.sanitizeDocuments(documents) expect(documents[0].pageContent).toBe('Textwithissues') })

gemini-code-assist · 2026-02-14T14:24:48Z

packages/components/nodes/vectorstores/Postgres/driver/TypeORM.ts

        for (var i in documents) {
-            documents[i].pageContent = documents[i].pageContent.replace(/\0/g, '')
+            let content = documents[i].pageContent
+
+            // Remove NULL characters (0x00) - PostgreSQL doesn't allow these
+            content = content.replace(/\0/g, '')
+
+            // Remove soft hyphens (U+00AD) - common in PDF extraction, can cause encoding issues
+            content = content.replace(/\u00AD/g, '')
+
+            // Remove other control characters (0x00-0x1F except tab, newline, carriage return)
+            // and DEL (0x7F), which can cause issues
+            // eslint-disable-next-line no-control-regex
+            content = content.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, '')
+
+            // Remove零-width and other invisible Unicode characters that can cause issues
+            content = content.replace(/[\u200B-\u200D\uFEFF]/g, '')
+
+            documents[i].pageContent = content
        }


This implementation can be improved for efficiency and readability:

Using for...of is the modern and safer way to iterate over array elements compared to for...in.

Chaining multiple replace() calls is less performant than a single call with a combined regular expression.

The first replace for \0 is redundant as it's covered by the control character regex.

There's a typo in the comment on line 92 ('零-width' instead of 'zero-width').

Combining these into a single, more efficient operation improves the code.

for (const doc of documents) { // This combines all replacements into a single, more efficient operation. // It removes: // - NULL characters (\x00) // - Soft hyphens (\u00AD) // - Other C0 control characters (\x01-\x08, \x0B, \x0C, \x0E-\x1F) and DEL (\x7F) // - Zero-width and invisible Unicode characters (\u200B-\u200D, \uFEFF) // eslint-disable-next-line no-control-regex doc.pageContent = doc.pageContent.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F\u00AD\u200B-\u200D\uFEFF]/g, ''); }

fix: enhance sanitizeDocuments to remove problematic characters for P…

b098e6e

…ostgreSQL compatibility

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

fix: enhance sanitizeDocuments to remove problematic characters for Postgresql#5755

fix: enhance sanitizeDocuments to remove problematic characters for Postgresql#5755
prd-hoang-doan wants to merge 1 commit intoFlowiseAI:mainfrom
prd-hoang-doan:fix/enhance_sanitize_docs_posgresql

prd-hoang-doan commented Feb 14, 2026

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

prd-hoang-doan commented Feb 14, 2026

Fix: Enhance sanitizeDocuments to Remove Problematic Characters for PostgreSQL Compatibility

Summary

Ticket: #5611

Problem

Test Coverage:

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant