Skip to content

Reduce redundancy in preprocess functions #149

@kassyray

Description

@kassyray

Redundancy in code can be reduced between functions ensure_required_columns and normalize_dataframe.

normalize_dataframe should use the constant declared at the top of the preprocess module. These two functions can be combined to reduce redundancy between both. Redundancy exists within normalize_dataframe where some working columns are checked multiple times.

I noticed this when trying to debug an error.

def ensure_required_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Normalize column names and validate that all required columns are present.

    Standardizes column names to uppercase and underscores, then validates that
    the DataFrame contains all required columns for immunization processing.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with client data (column names may have mixed case/spacing).

    Returns
    -------
    pd.DataFrame
        Copy of input DataFrame with normalized column names.

    Raises
    ------
    ValueError
        If any required columns are missing from the DataFrame.
    """
    df = df.copy()
    df.columns = [col.strip().upper() for col in df.columns]
    missing = [col for col in REQUIRED_COLUMNS if col not in df.columns]
    if missing:
        raise ValueError(
            f"Missing required columns: {missing} \n Found columns: {list(df.columns)} "
        )

    df.rename(columns=lambda x: x.replace(" ", "_"), inplace=True)
    df.rename(columns={"PROVINCE/TERRITORY": "PROVINCE"}, inplace=True)
    return df


def normalize_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """Standardize data types and fill missing values in the input DataFrame.

    Ensures consistent data types across all columns:
    - String columns are filled with empty strings and trimmed
    - DATE_OF_BIRTH is converted to datetime
    - AGE is converted to numeric (if present)
    - Missing board/school data is initialized with empty dicts

    This normalization is critical for downstream processing as it ensures
    every client record has the expected structure.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with raw client data.

    Returns
    -------
    pd.DataFrame
        Copy of DataFrame with normalized types and filled values.
    """
    working = df.copy()
    string_columns = [
        "SCHOOL_NAME",
        "FIRST_NAME",
        "LAST_NAME",
        "CITY",
        "PROVINCE",
        "POSTAL_CODE",
        "STREET_ADDRESS_LINE_1",
        "STREET_ADDRESS_LINE_2",
        "SCHOOL_TYPE",
        "BOARD_NAME",
        "BOARD_ID",
        "SCHOOL_ID",
        "UNIQUE_ID",
    ]

    for column in string_columns:
        if column not in working.columns:
            working[column] = ""
        working[column] = working[column].fillna(" ").astype(str).str.strip()

    working["DATE_OF_BIRTH"] = pd.to_datetime(working["DATE_OF_BIRTH"], errors="coerce")
    if "AGE" in working.columns:
        working["AGE"] = pd.to_numeric(working["AGE"], errors="coerce")
    else:
        working["AGE"] = pd.NA

    if "BOARD_NAME" not in working.columns:
        working["BOARD_NAME"] = ""
    if "BOARD_ID" not in working.columns:
        working["BOARD_ID"] = ""
    if "SCHOOL_TYPE" not in working.columns:
        working["SCHOOL_TYPE"] = ""

return working

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions