-
-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Problem
I've been manually testing sample PDFs and found that the title extraction has some issues. Specifically the fallback to search for the title in the first matching text block is not extracting the correct title sometimes.
Here's some examples:
| # | Filename | Actual Title | Extracted Title |
|---|---|---|---|
| 1 | patientperception.pdf |
Patients' perceptions of information received about medication prescribed for bipolar disorder: Implications for informed choice | a Postgraduate Medical School, University of Brighton, United Kingdom b Centre for Behavioural Medicine... |
| 2 | The World Federation of Societies of Biological Psychiatry WFSBP Guidelines for the Biological Treatment of Bipolar Disorders Update 2009 on the Tr.pdf |
The World Federation of Societies of Biological Psychiatry (WFSBP) Guidelines for the Biological Treatment of Bipolar Disorders: Update 2009 on the Treatment of Acute Mania | The World Federation of Societies of Biological Psychiatry (WFSBP) Guidelines for the Biological Tre |
| 3 | Pharmacological Treatment of Bipolar Depression_ What are the Current and Emerging Options_ - PMC.pdf |
Pharmacological Treatment of Bipolar Depression: What are the Current and Emerging Options? | imply endorsement of, or agreement with, the contents by NLM or the National Institutes of |
| 4 | pharmacotherapyexposure.pdf |
Pharmacotherapy exposure as a marker of disease complexity in bipolar disorder: Associations with clinical & genetic risk factors | A R T I C L E I N F O |
Causes
(I added some debug logs to see what was going on during title extraction.)
Error 1:
The first block is wrapped in quotes, so it's not matching the title regex (which has to start with alphanumeric).
Even if the regex accepted the first block, it included "Brief report," which is just a title prefix and not part of the actual title. Here, the text block approach we currently use has no way to distinguish between the title prefix and the actual title since they're in the same block.
Error 2:
It correctly extracts the title from the file's metadata. The problem is that the metadata is truncated (probably had a limit of 100 characters). Is this the correct flow? Should we stick to the title metadata if we can?
Another issue this file raised was that, even the metadata wasn't truncated and we got the full title, our title title would reject this title since it requires no year-matching numbers. We'd fall back to extracting text blocks which would also end up rejecting the correct title because of the year number. Is this okay/expected?
Error 3:
This file has a short preamble/disclaimer before the title. Since it matches the regex, it is extracted as the title.
Error 4:
It rejected the actual title because of the '&' character.
Proposed Fixes
Use font size to identify titles
Replace the text block scanning entirely. Instead, we can use get_text("dict") on the first few pages, which gives us access to font size metadata. We can find the max font size across the first few pages and collect adjacent spans of that size. If the title is the largest text, this should work to find it.
We could combine this with the regex approach and filter large non-title text using the title regex.
Edge cases to consider
- Subtitles with smaller font size would be skipped
- If all text on a page is the same font size, this won't work. Should fall back to OpenAI in this case.
Fall back to OpenAI title extraction in more cases
Depends on our upload volume and cost per extraction. File uploads are a admin-only, and our current approach is failing a lot, so it's something to consider.