Skip to content

Conversation

Copy link

Copilot AI commented Aug 3, 2025

This PR implements a new keep_order parameter for the BS2Json class that preserves the original order of HTML elements instead of grouping them by type. This addresses the need for document generation workflows that require maintaining the structural flow of HTML content.

Problem

The current implementation groups all elements of the same type together, which loses the original document structure:

from bs2json import BS2Json
import json

html = '''<html><body>
<h3>Chapter 1</h3>
<p>Introduction</p>
<hr>
<h3>Chapter 2</h3>
<p>Content</p>
</body></html>'''

# Current behavior - elements grouped by type
result = BS2Json(html).convert()
# Output: {"html": {"body": {"h3": ["Chapter 1", "Chapter 2"], "p": ["Introduction", "Content"], "hr": null}}}

Solution

Added an optional keep_order parameter that preserves element sequence:

# New behavior - elements in original order
result = BS2Json(html, keep_order=True).convert()
# Output: {"html": [{"body": [{"h3": "Chapter 1"}, {"p": "Introduction"}, {"hr": null}, {"h3": "Chapter 2"}, {"p": "Content"}]}]}

Key Features

  • Backward Compatible: Default behavior unchanged (keep_order=False)
  • Order Preservation: Elements appear in their original HTML sequence
  • Clean Output: Simple text elements are simplified (e.g., {"h3": "chapter 1"})
  • Mixed Content Support: Preserves order within elements that have mixed content
  • Full API Support: Works with both main BS2Json class and tag extension methods

Implementation Details

  • Modified to_json() method to handle ordered output when keep_order=True
  • Added simplification logic for clean single-text element output
  • Updated extension method signatures to support the new parameter
  • Comprehensive test coverage ensuring both new functionality and backward compatibility

This enables use cases like feeding structured JSON to document generators (e.g., python-docx) while maintaining the original document flow and semantic relationships between adjacent elements.

Fixes #8.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

@MrDebugger MrDebugger marked this pull request as ready for review August 3, 2025 22:29
@MrDebugger MrDebugger marked this pull request as draft August 3, 2025 22:30
@MrDebugger MrDebugger self-requested a review August 3, 2025 22:30
Copilot AI and others added 2 commits August 3, 2025 22:32
…ping by type

Co-authored-by: MrDebugger <25988388+MrDebugger@users.noreply.github.com>
Co-authored-by: MrDebugger <25988388+MrDebugger@users.noreply.github.com>
Copilot AI changed the title [WIP] new feature request: option to keep elements order Add keep_order option to preserve HTML element sequence instead of grouping by type Aug 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

new feature request: option to keep elements order

2 participants