Reduce redundancy in preprocess functions

Redundancy in code can be reduced between functions `ensure_required_columns` and `normalize_dataframe`. 

normalize_dataframe should use the constant declared at the top of the preprocess module. These two functions can be combined to reduce redundancy between both. Redundancy exists within normalize_dataframe where some working columns are checked multiple times. 

I noticed this when trying to debug an error.

```
def ensure_required_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Normalize column names and validate that all required columns are present.

    Standardizes column names to uppercase and underscores, then validates that
    the DataFrame contains all required columns for immunization processing.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with client data (column names may have mixed case/spacing).

    Returns
    -------
    pd.DataFrame
        Copy of input DataFrame with normalized column names.

    Raises
    ------
    ValueError
        If any required columns are missing from the DataFrame.
    """
    df = df.copy()
    df.columns = [col.strip().upper() for col in df.columns]
    missing = [col for col in REQUIRED_COLUMNS if col not in df.columns]
    if missing:
        raise ValueError(
            f"Missing required columns: {missing} \n Found columns: {list(df.columns)} "
        )

    df.rename(columns=lambda x: x.replace(" ", "_"), inplace=True)
    df.rename(columns={"PROVINCE/TERRITORY": "PROVINCE"}, inplace=True)
    return df


def normalize_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """Standardize data types and fill missing values in the input DataFrame.

    Ensures consistent data types across all columns:
    - String columns are filled with empty strings and trimmed
    - DATE_OF_BIRTH is converted to datetime
    - AGE is converted to numeric (if present)
    - Missing board/school data is initialized with empty dicts

    This normalization is critical for downstream processing as it ensures
    every client record has the expected structure.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with raw client data.

    Returns
    -------
    pd.DataFrame
        Copy of DataFrame with normalized types and filled values.
    """
    working = df.copy()
    string_columns = [
        "SCHOOL_NAME",
        "FIRST_NAME",
        "LAST_NAME",
        "CITY",
        "PROVINCE",
        "POSTAL_CODE",
        "STREET_ADDRESS_LINE_1",
        "STREET_ADDRESS_LINE_2",
        "SCHOOL_TYPE",
        "BOARD_NAME",
        "BOARD_ID",
        "SCHOOL_ID",
        "UNIQUE_ID",
    ]

    for column in string_columns:
        if column not in working.columns:
            working[column] = ""
        working[column] = working[column].fillna(" ").astype(str).str.strip()

    working["DATE_OF_BIRTH"] = pd.to_datetime(working["DATE_OF_BIRTH"], errors="coerce")
    if "AGE" in working.columns:
        working["AGE"] = pd.to_numeric(working["AGE"], errors="coerce")
    else:
        working["AGE"] = pd.NA

    if "BOARD_NAME" not in working.columns:
        working["BOARD_NAME"] = ""
    if "BOARD_ID" not in working.columns:
        working["BOARD_ID"] = ""
    if "SCHOOL_TYPE" not in working.columns:
        working["SCHOOL_TYPE"] = ""

```
    return working


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce redundancy in preprocess functions #149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce redundancy in preprocess functions #149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions