-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Redundancy in code can be reduced between functions ensure_required_columns and normalize_dataframe.
normalize_dataframe should use the constant declared at the top of the preprocess module. These two functions can be combined to reduce redundancy between both. Redundancy exists within normalize_dataframe where some working columns are checked multiple times.
I noticed this when trying to debug an error.
def ensure_required_columns(df: pd.DataFrame) -> pd.DataFrame:
"""Normalize column names and validate that all required columns are present.
Standardizes column names to uppercase and underscores, then validates that
the DataFrame contains all required columns for immunization processing.
Parameters
----------
df : pd.DataFrame
Input DataFrame with client data (column names may have mixed case/spacing).
Returns
-------
pd.DataFrame
Copy of input DataFrame with normalized column names.
Raises
------
ValueError
If any required columns are missing from the DataFrame.
"""
df = df.copy()
df.columns = [col.strip().upper() for col in df.columns]
missing = [col for col in REQUIRED_COLUMNS if col not in df.columns]
if missing:
raise ValueError(
f"Missing required columns: {missing} \n Found columns: {list(df.columns)} "
)
df.rename(columns=lambda x: x.replace(" ", "_"), inplace=True)
df.rename(columns={"PROVINCE/TERRITORY": "PROVINCE"}, inplace=True)
return df
def normalize_dataframe(df: pd.DataFrame) -> pd.DataFrame:
"""Standardize data types and fill missing values in the input DataFrame.
Ensures consistent data types across all columns:
- String columns are filled with empty strings and trimmed
- DATE_OF_BIRTH is converted to datetime
- AGE is converted to numeric (if present)
- Missing board/school data is initialized with empty dicts
This normalization is critical for downstream processing as it ensures
every client record has the expected structure.
Parameters
----------
df : pd.DataFrame
Input DataFrame with raw client data.
Returns
-------
pd.DataFrame
Copy of DataFrame with normalized types and filled values.
"""
working = df.copy()
string_columns = [
"SCHOOL_NAME",
"FIRST_NAME",
"LAST_NAME",
"CITY",
"PROVINCE",
"POSTAL_CODE",
"STREET_ADDRESS_LINE_1",
"STREET_ADDRESS_LINE_2",
"SCHOOL_TYPE",
"BOARD_NAME",
"BOARD_ID",
"SCHOOL_ID",
"UNIQUE_ID",
]
for column in string_columns:
if column not in working.columns:
working[column] = ""
working[column] = working[column].fillna(" ").astype(str).str.strip()
working["DATE_OF_BIRTH"] = pd.to_datetime(working["DATE_OF_BIRTH"], errors="coerce")
if "AGE" in working.columns:
working["AGE"] = pd.to_numeric(working["AGE"], errors="coerce")
else:
working["AGE"] = pd.NA
if "BOARD_NAME" not in working.columns:
working["BOARD_NAME"] = ""
if "BOARD_ID" not in working.columns:
working["BOARD_ID"] = ""
if "SCHOOL_TYPE" not in working.columns:
working["SCHOOL_TYPE"] = ""
return working
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working