Skip to content

Conversation

@x15sr71
Copy link
Contributor

@x15sr71 x15sr71 commented Jan 14, 2026

In raising this pull request, I confirm the following (please check boxes):

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.
  • I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

Summary

This PR implements machine-readable JSON output for the -out=report feature, addressing issue #1399. Users can now generate structured reports that can be parsed with tools like jq, enabling seamless integration with automated workflows.

Background

Currently, CCExtractor’s report output is human-readable text that requires custom parsing for automation. While other media analysis tools such as ffprobe and mediainfo provide JSON output, structured closed-caption reporting is not consistently available across tools or versions. This feature enables CCExtractor to expose its existing report data in a structured JSON format.

Use case: Users running CCExtractor in automated environments (e.g., CI/CD pipelines, media processing workflows) need to programmatically determine if streams contain closed captions without writing custom parsers.

Changes

-out=report Option

ccextractor -out=report input.ts

Existing Text Output (-out=report)

File: ../20251206ch29FullTS.ts
Stream Mode: Transport Stream
Program Count: 5
Program Numbers: 1 2 3 4 5
PID: 49, Program: 1, MPEG-2 video
PID: 52, Program: 1, AC3 audio
PID: 53, Program: 1, AC3 audio
PID: 65, Program: 2, MPEG-2 video
PID: 68, Program: 2, AC3 audio
PID: 81, Program: 3, MPEG-2 video
PID: 84, Program: 3, AC3 audio
PID: 97, Program: 4, MPEG-2 video
PID: 100, Program: 4, AC3 audio
PID: 113, Program: 5, MPEG-2 video
PID: 116, Program: 5, AC3 audio
//////// Program #5: ////////
DVB Subtitles: No
Teletext: No
ATSC Closed Caption: Yes
EIA-608: Yes
XDS: No
CC1: Yes
CC2: No
CC3: No
CC4: No
CEA-708: Yes
Services: 1 2 3 4 5 6 9
Primary Language Present: Yes
Secondary Language Present: Yes
Width: 704
Height: 480
Aspect Ratio: 03 - 16:9
Frame Rate: 04 - 29.97


(More programs omitted for brevity)

JSON Output Structure (v1.0)

The output follows a versioned JSON report structure:

JSON output via --report-format json

ccextractor --report-format json -out=report input.ts
{
  "schema": {
    "name": "ccextractor-report",
    "version": "1.0"
  },
  "input": {
    "source": "file",
    "path": "../20251206ch29FullTS.ts"
  },
  "stream": {
    "mode": "Transport Stream",
    "program_count": 5,
    "program_numbers": [
      1,
      2,
      3,
      4,
      5
    ],
    "pids": [
      {
        "pid": 49,
        "program_number": 1,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 52,
        "program_number": 1,
        "codec": "AC3 audio"
      },
      {
        "pid": 53,
        "program_number": 1,
        "codec": "AC3 audio"
      },
      {
        "pid": 65,
        "program_number": 2,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 68,
        "program_number": 2,
        "codec": "AC3 audio"
      },
      {
        "pid": 81,
        "program_number": 3,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 84,
        "program_number": 3,
        "codec": "AC3 audio"
      },
      {
        "pid": 97,
        "program_number": 4,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 100,
        "program_number": 4,
        "codec": "AC3 audio"
      },
      {
        "pid": 113,
        "program_number": 5,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 116,
        "program_number": 5,
        "codec": "AC3 audio"
      }
    ]
  },
  "programs": [
    {
      "program_number": 1,
      "summary": {
        "has_any_captions": true,
        "has_608": true,
        "has_708": true
      },
      "services": {
        "dvb_subtitles": false,
        "teletext": false,
        "atsc_closed_caption": true
      },
      "captions": {
        "present": true,
        "eia_608": {
          "present": true,
          "xds": false,
          "channels": {
            "cc1": true,
            "cc2": false,
            "cc3": false,
            "cc4": false
          }
        },
        "cea_708": {
          "present": true,
          "services": [
            1,
            2,
            3,
            4,
            5,
            6,
            9
          ]
        }
      },
      "video": {
        "width": 1920,
        "height": 1080,
        "aspect_ratio": "03 - 16:9",
        "frame_rate": "04 - 29.97"
      }
    },

(More programs omitted for brevity)

Schema Notes

  • The JSON schema is intentionally descriptive rather than prescriptive.
  • Field presence and values depend on the input container, stream type, and available metadata.
  • Codec strings reflect CCExtractor's internal stream type descriptions and are container-dependent (e.g., "AC3 audio" vs "AC3").
  • The services object under programs[] indicates which captioning systems are present (DVB, Teletext, ATSC), while captions.cea_708.services[] lists active CEA-708 caption service numbers.

Program Ordering:

  • JSON output: Programs are sorted in ascending order by program number (1, 2, 3, 4, 5) for predictable parsing
  • Text output: Programs are displayed in descending order (5, 4, 3, 2, 1) as they're processed
Text Output Field JSON Field
File: input.path
Stream Mode stream.mode
Program Count stream.program_count
Program Numbers stream.program_numbers[]
PID: X, Program: Y, Codec stream.pids[]
DVB Subtitles programs[].services.dvb_subtitles
Teletext programs[].services.teletext
ATSC Closed Caption programs[].services.atsc_closed_caption
EIA-608 programs[].captions.eia_608.present
XDS programs[].captions.eia_608.xds
CC1..CC4 programs[].captions.eia_608.channels.*
CEA-708 programs[].captions.cea_708.present
Services: programs[].captions.cea_708.services[]
Primary Language Present (not in JSON)
Secondary Language Present (not in JSON)
Width / Height programs[].video.width / height
Aspect Ratio programs[].video.aspect_ratio
Frame Rate programs[].video.frame_rate
MPEG-4 Timed Text container.mp4.timed_text_tracks
(JSON-only) schema.*
(JSON-only) programs[].summary.*
(JSON-only) programs[].captions.present

Key Features:

  • Structured, machine-readable JSON output for -out=report
  • Versioned schema (v1.0) for future extensibility
  • Backward compatible (existing text report remains the default)
  • Caption presence reporting for:
    • ATSC Closed Captions (EIA-608 / CEA-708)
    • DVB subtitles (presence flag)
    • Teletext (presence flag)
    • Note: the has_any_captions summary field reflects EIA-608 / CEA-708 only.)
  • Program-level summary fields for fast closed-caption automation checks
  • PID and codec metadata per program (preserving CCExtractor’s existing codec string formats)
  • Guarded video metadata (emitted only when valid)
  • Multi-program stream support with deterministic ordering
  • Container-level metadata when available (e.g., MP4 timed text track count)

Technical Approach

  • JSON generation is implemented in C using existing CCExtractor internal data structures.
  • String values are properly escaped to ensure valid JSON output.
  • Format selection uses case-insensitive comparison (strcasecmp / _stricmp).
  • The JSON output uses CCExtractor’s existing internal data structures without modifying caption extraction or decoding logic.
  • Memory allocation and cleanup follow existing project patterns.
  • Programs are sorted by program number to provide stable and predictable output.

Example Testing Commands

# Test JSON output
ccextractor --report-format json -out=report sample.ts | jq .

# Verify caption presence
ccextractor --report-format json -out=report sample.ts | jq '.programs[0].summary.has_any_captions'

# Extract specific caption channels
ccextractor --report-format json -out=report sample.ts | jq '.programs[].captions.eia_608.channels'

# Check which CC channels are active
ccextractor --report-format json -out=report sample.ts | jq '.programs[].captions.eia_608.channels | to_entries | map(select(.value == true)) | .[].key'

# Get video dimensions
ccextractor --report-format json -out=report sample.ts | jq '.programs[].video | select(. != null) | {width, height}'

# Default text format still works
ccextractor -out=report sample.ts

Field Value Formats:

  • String values like aspect_ratio and frame_rate preserve CCExtractor's internal enum formatting (e.g., "03 - 16:9", "04 - 29.97")
  • This design choice maintains transparency and aids debugging
  • Users needing normalized values can post-process with simple string operations:
    jq '.programs[].video.aspect_ratio | split(" - ")[1]'

Benefits

  1. Automation-Friendly: Enables programmatic parsing without regex/custom parsers
  2. Familiar Structure: Uses JSON patterns similar to tools like ffprobe and mediainfo
  3. Extensible: Versioned schema to support future extensions
  4. Backward Compatible: Existing workflows continue to work unchanged
  5. Addresses Real Need: Solves problem raised by multiple community members (issue [PROPOSAL] - Structured data JSON output of ccextractor -out=report #1399 and related discussions)
  6. Quick Caption Detection: Provides has_any_captions summary field for fast EIA-608 / CEA-708 closed-caption checks

Notes

  • Platform compatibility: uses strcasecmp on POSIX systems and maps to _stricmp on Windows via platform-specific preprocessor guards.
  • Video and container metadata are emitted conditionally when applicable
  • Temporary allocations used for program ordering are properly released
  • The implementation follows existing CCExtractor coding conventions

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 2028754...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 7/7
DVD 3/3
DVR-MS 2/2
General 25/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 81/86
Teletext 21/21
WTV 13/13
XDS 34/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
  • ccextractor --out=spupng c83f765c66..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 2028754...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 6/7
DVD 3/3
DVR-MS 2/2
General 25/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 81/86
Teletext 21/21
WTV 13/13
XDS 34/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

Congratulations: Merging this PR would fix the following tests:


It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants