Skip to content

PDF title extraction in FileUpload picks up partial text or wrong content #469

@amahuli03

Description

@amahuli03

Problem

I've been manually testing sample PDFs and found that the title extraction has some issues. Specifically the fallback to search for the title in the first matching text block is not extracting the correct title sometimes.

Here's some examples:

# Filename Actual Title Extracted Title
1 patientperception.pdf Patients' perceptions of information received about medication prescribed for bipolar disorder: Implications for informed choice a Postgraduate Medical School, University of Brighton, United Kingdom b Centre for Behavioural Medicine...
2 The World Federation of Societies of Biological Psychiatry WFSBP Guidelines for the Biological Treatment of Bipolar Disorders Update 2009 on the Tr.pdf The World Federation of Societies of Biological Psychiatry (WFSBP) Guidelines for the Biological Treatment of Bipolar Disorders: Update 2009 on the Treatment of Acute Mania The World Federation of Societies of Biological Psychiatry (WFSBP) Guidelines for the Biological Tre
3 Pharmacological Treatment of Bipolar Depression_ What are the Current and Emerging Options_ - PMC.pdf Pharmacological Treatment of Bipolar Depression: What are the Current and Emerging Options? imply endorsement of, or agreement with, the contents by NLM or the National Institutes of
4 pharmacotherapyexposure.pdf Pharmacotherapy exposure as a marker of disease complexity in bipolar disorder: Associations with clinical & genetic risk factors A R T I C L E I N F O

Causes

(I added some debug logs to see what was going on during title extraction.)

Error 1:

Image The first block is wrapped in quotes, so it's not matching the title regex (which has to start with alphanumeric). Even if the regex accepted the first block, it included "Brief report," which is just a title prefix and not part of the actual title. Here, the text block approach we currently use has no way to distinguish between the title prefix and the actual title since they're in the same block.

Error 2:

It correctly extracts the title from the file's metadata. The problem is that the metadata is truncated (probably had a limit of 100 characters). Is this the correct flow? Should we stick to the title metadata if we can?

Another issue this file raised was that, even the metadata wasn't truncated and we got the full title, our title title would reject this title since it requires no year-matching numbers. We'd fall back to extracting text blocks which would also end up rejecting the correct title because of the year number. Is this okay/expected?

Error 3:

Image This file has a short preamble/disclaimer before the title. Since it matches the regex, it is extracted as the title.

Error 4:

Image It rejected the actual title because of the '&' character.

Proposed Fixes

Use font size to identify titles

Replace the text block scanning entirely. Instead, we can use get_text("dict") on the first few pages, which gives us access to font size metadata. We can find the max font size across the first few pages and collect adjacent spans of that size. If the title is the largest text, this should work to find it.

We could combine this with the regex approach and filter large non-title text using the title regex.

Edge cases to consider

  • Subtitles with smaller font size would be skipped
  • If all text on a page is the same font size, this won't work. Should fall back to OpenAI in this case.

Fall back to OpenAI title extraction in more cases

Depends on our upload volume and cost per extraction. File uploads are a admin-only, and our current approach is failing a lot, so it's something to consider.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions