PDF title extraction in FileUpload picks up partial text or wrong content

## Problem

I've been manually testing sample PDFs and found that the title extraction has some issues. Specifically the fallback to search for the title in the first matching text block is not extracting the correct title sometimes.

Here's some examples:
| # | Filename | Actual Title | Extracted Title |
|---|---|---|---|
| 1 | `patientperception.pdf` | Patients' perceptions of information received about medication prescribed for bipolar disorder: Implications for informed choice | a Postgraduate Medical School, University of Brighton, United Kingdom b Centre for Behavioural Medicine... |
| 2 | `The World Federation of Societies of Biological Psychiatry  WFSBP  Guidelines for the Biological Treatment of Bipolar Disorders  Update 2009 on the Tr.pdf` | The World Federation of Societies of Biological Psychiatry (WFSBP) Guidelines for the Biological Treatment of Bipolar Disorders: Update 2009 on the Treatment of Acute Mania | The World Federation of Societies of Biological Psychiatry (WFSBP) Guidelines for the Biological Tre |
| 3 | `Pharmacological Treatment of Bipolar Depression_ What are the Current and Emerging Options_ - PMC.pdf` | Pharmacological Treatment of Bipolar Depression: What are the Current and Emerging Options? | imply endorsement of, or agreement with, the contents by NLM or the National Institutes of |
| 4 | `pharmacotherapyexposure.pdf` | Pharmacotherapy exposure as a marker of disease complexity in bipolar disorder: Associations with clinical & genetic risk factors | A R T I C L E I N F O |


### Causes
(I added some debug logs to see what was going on during title extraction.)

#### Error 1: 

<img width="1512" height="123" alt="Image" src="https://github.com/user-attachments/assets/6fbc440f-2cd1-4dad-8797-ca3a8aea6bcc" />
The first block is wrapped in quotes, so it's not matching the title regex (which has to start with alphanumeric).
Even if the regex accepted the first block, it included "Brief report," which is just a title prefix and not part of the actual title. Here, the text block approach we currently use has no way to distinguish between the title prefix and the actual title since they're in the same block.

#### Error 2:
It correctly extracts the title from the file's metadata. The problem is that the metadata is truncated (probably had a limit of 100 characters). Is this the correct flow? Should we stick to the title metadata if we can?

Another issue this file raised was that, even the metadata wasn't truncated and we got the full title, our title title would reject this title since it requires no year-matching numbers. We'd fall back to extracting text blocks which would also end up rejecting the correct title because of the year number. Is this okay/expected?

#### Error 3:
<img width="1192" height="109" alt="Image" src="https://github.com/user-attachments/assets/71eed437-457b-430c-a8c6-226f9ac60b0c" />
This file has a short preamble/disclaimer before the title. Since it matches the regex, it is extracted as the title.

#### Error 4:
<img width="1758" height="294" alt="Image" src="https://github.com/user-attachments/assets/581bb051-66ac-4176-a0be-195b32f3fdbf" />
It rejected the actual title because of the '&' character.

## Proposed Fixes

###  Use font size to identify titles 
Replace the text block scanning entirely. Instead, we can use `get_text("dict")` on the first few pages, which[ gives us access to font size metadata](https://pymupdf.readthedocs.io/en/latest/app1.html#dict-or-json). We can find the max font size across the first few pages and collect adjacent spans of that size. If the title is the largest text, this should work to find it.

We could combine this with the regex approach and filter large non-title text using the title regex.

#### Edge cases to consider
- Subtitles with smaller font size would be skipped
- If all text on a page is the same font size, this won't work. Should fall back to OpenAI in this case.

### Fall back to OpenAI title extraction in more cases
Depends on our upload volume and cost per extraction. File uploads are a admin-only, and our current approach is failing a lot, so it's something to consider.









Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF title extraction in FileUpload picks up partial text or wrong content #469

Problem

Causes

Error 1:

Error 2:

Error 3:

Error 4:

Proposed Fixes

Use font size to identify titles

Edge cases to consider

Fall back to OpenAI title extraction in more cases

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Filename	Actual Title	Extracted Title
1	`patientperception.pdf`	Patients' perceptions of information received about medication prescribed for bipolar disorder: Implications for informed choice	a Postgraduate Medical School, University of Brighton, United Kingdom b Centre for Behavioural Medicine...
2	`The World Federation of Societies of Biological Psychiatry WFSBP Guidelines for the Biological Treatment of Bipolar Disorders Update 2009 on the Tr.pdf`	The World Federation of Societies of Biological Psychiatry (WFSBP) Guidelines for the Biological Treatment of Bipolar Disorders: Update 2009 on the Treatment of Acute Mania	The World Federation of Societies of Biological Psychiatry (WFSBP) Guidelines for the Biological Tre
3	`Pharmacological Treatment of Bipolar Depression_ What are the Current and Emerging Options_ - PMC.pdf`	Pharmacological Treatment of Bipolar Depression: What are the Current and Emerging Options?	imply endorsement of, or agreement with, the contents by NLM or the National Institutes of
4	`pharmacotherapyexposure.pdf`	Pharmacotherapy exposure as a marker of disease complexity in bipolar disorder: Associations with clinical & genetic risk factors	A R T I C L E I N F O

Uh oh!

PDF title extraction in FileUpload picks up partial text or wrong content #469

Description

Problem

Causes

Error 1:

Error 2:

Error 3:

Error 4:

Proposed Fixes

Use font size to identify titles

Edge cases to consider

Fall back to OpenAI title extraction in more cases

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions