fix: enhance sanitizeDocuments to remove problematic characters for Postgresql#5755
fix: enhance sanitizeDocuments to remove problematic characters for Postgresql#5755prd-hoang-doan wants to merge 1 commit intoFlowiseAI:mainfrom
Conversation
…ostgreSQL compatibility
Summary of ChangesHello @prd-hoang-doan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the robustness of document storage in PostgreSQL by expanding the character sanitization logic within the TypeORM driver. It proactively filters out various non-printable and problematic Unicode characters that commonly lead to database errors, especially when processing text from diverse sources like PDFs. This ensures that document content is clean and compatible before being persisted, preventing data integrity issues and enhancing system stability. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enhances the sanitizeDocuments method to remove additional problematic characters for PostgreSQL compatibility, which is a valuable improvement. The addition of a comprehensive test suite for this functionality is also excellent. I've provided a couple of suggestions to improve the implementation's efficiency and the clarity of one of the tests.
| it('should not mutate the original document array', () => { | ||
| const originalContent = 'Text\0with\u00ADissues' | ||
| const documents = [ | ||
| new Document({ | ||
| pageContent: originalContent, | ||
| metadata: {} | ||
| }) | ||
| ] | ||
|
|
||
| driver.sanitizeDocuments(documents) | ||
|
|
||
| // After sanitization, the documents array should be modified in place | ||
| expect(documents[0].pageContent).toBe('Textwithissues') | ||
| }) |
There was a problem hiding this comment.
The test name 'should not mutate the original document array' is misleading because the function does mutate the pageContent of documents within the array. The test implementation and the comment on line 263 correctly reflect this in-place modification. To avoid confusion, the test should be renamed and the redundant comment removed.
it('should modify the document content in place', () => {
const originalContent = 'Text\0with\u00ADissues'
const documents = [
new Document({
pageContent: originalContent,
metadata: {}
})
]
driver.sanitizeDocuments(documents)
expect(documents[0].pageContent).toBe('Textwithissues')
})| for (var i in documents) { | ||
| documents[i].pageContent = documents[i].pageContent.replace(/\0/g, '') | ||
| let content = documents[i].pageContent | ||
|
|
||
| // Remove NULL characters (0x00) - PostgreSQL doesn't allow these | ||
| content = content.replace(/\0/g, '') | ||
|
|
||
| // Remove soft hyphens (U+00AD) - common in PDF extraction, can cause encoding issues | ||
| content = content.replace(/\u00AD/g, '') | ||
|
|
||
| // Remove other control characters (0x00-0x1F except tab, newline, carriage return) | ||
| // and DEL (0x7F), which can cause issues | ||
| // eslint-disable-next-line no-control-regex | ||
| content = content.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, '') | ||
|
|
||
| // Remove零-width and other invisible Unicode characters that can cause issues | ||
| content = content.replace(/[\u200B-\u200D\uFEFF]/g, '') | ||
|
|
||
| documents[i].pageContent = content | ||
| } |
There was a problem hiding this comment.
This implementation can be improved for efficiency and readability:
- Using
for...ofis the modern and safer way to iterate over array elements compared tofor...in. - Chaining multiple
replace()calls is less performant than a single call with a combined regular expression. - The first
replacefor\0is redundant as it's covered by the control character regex. - There's a typo in the comment on line 92 ('零-width' instead of 'zero-width').
Combining these into a single, more efficient operation improves the code.
for (const doc of documents) {
// This combines all replacements into a single, more efficient operation.
// It removes:
// - NULL characters (\x00)
// - Soft hyphens (\u00AD)
// - Other C0 control characters (\x01-\x08, \x0B, \x0C, \x0E-\x1F) and DEL (\x7F)
// - Zero-width and invisible Unicode characters (\u200B-\u200D, \uFEFF)
// eslint-disable-next-line no-control-regex
doc.pageContent = doc.pageContent.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F\u00AD\u200B-\u200D\uFEFF]/g, '');
}
Fix: Enhance sanitizeDocuments to Remove Problematic Characters for PostgreSQL Compatibility
Summary
Enhances the sanitizeDocuments method in the TypeORM PostgreSQL driver to remove additional problematic characters that can cause errors when storing documents in PostgreSQL.
Ticket: #5611
Problem
The previous implementation only removed NULL characters (
\0), but PostgreSQL and document processing can encounter issues with additional problematic characters, particularly those from PDF extraction:Test Coverage: