Skip to content

Cleaning Functions

The cleaning pipeline processes raw CSV tables extracted from ICAR annual reports into standardized, validated datasets. The pipeline performs data quality checks, column standardization, state name validation, and calculations verification.

Main Wrapper Functions

The cleaning pipeline is orchestrated through a wrapper function that combines multiple cleaning steps in the correct sequence:

FMDData.all_cleaning_steps Function
julia
all_cleaning_steps(
    input_filename::T1,
    input_dir::T1;
    output_filename::T1 = "clean_$(input_filename)",
    output_dir::T1 = icar_cleaned_dir(),
    load_format = DataFrame,
    skiptotals = false,
    [logging_level_parameters...]
) where {T1 <: AbstractString}

A wrapper function that runs all the cleaning steps for seroprevalence tables. This function handles all data formats including tables with multiple rows per state (e.g., 2019 report tables) through the skiptotals parameter and conditional logic.

Arguments

  • input_filename: Name of the CSV file to process

  • input_dir: Directory containing the input file

  • output_filename: Name for the cleaned output file (defaults to "clean_" + input filename)

  • output_dir: Directory for output (defaults to cleaned data directory)

  • load_format: Format for loading data (defaults to DataFrame)

  • skiptotals: If true, skips totals-related processing for datasets without totals rows

  • *_ll: Logging level parameters for each processing step (:Error, :Warn, or :Info)

Returns

  • Try.Ok(nothing) on success

  • Try.Err(message) if any critical errors occur

Examples

julia
# Standard processing
all_cleaning_steps("2022_NADCP-2.csv", icar_inputs_dir())

# Skip totals for 2019 data
all_cleaning_steps("2019_Bihar.csv", icar_inputs_dir(); skiptotals = true)
source

State Reference Data

The package validates state names against a comprehensive dictionary of Indian states and union territories:

FMDData.states_dict Constant
julia
states_dict

A Dictionary of States/UTs that can appear in the data set. The keys will be returned in the cleaning steps, and the values can be matched in the underlying datasets.

source
FMDData.state_code_dict Constant
julia
state_code_dict

A Dictionary mapping standardized state/UT names to their official two-letter state codes. The keys correspond to the standardized names from states_dict, and the values are the ISO 3166-2:IN state codes used for administrative purposes.

source

File Management

Core I/O operations for loading and saving CSV data with proper error handling:

FMDData.load_csv Function
julia
load_csv(
    filename::T1,
    dir::T1,
    output_format = DataFrame
) where {T1 <: AbstractString}

A helper function to check if a csv input file and directory exists, and if so, load (as a DataFrame by default).

source
FMDData.write_csv Function
julia
write_csv(
    filename::T1,
    dir::T1,
    data::DataFrame
) where {T1 <: AbstractString}

A helper function to check if the specified name and directory exist and are valid, and if so, write the CSV to disk.

source

Column Name Processing

Functions for standardizing column names and detecting structural issues:

FMDData.clean_colnames Function
julia
clean_colnames(df::DataFrame, allowed_chars_reg::Regex = r"[^\w]")

Standardize column names by replacing special characters and abbreviations with consistent formats.

Performs the following transformations:

  • Converts to lowercase

  • Replaces "/" and spaces with underscores

  • Replaces "(n)" and "n" with "count"

  • Replaces "(%)" and "%" with "pct"

  • Validates that no disallowed characters remain

Arguments

  • df: DataFrame to clean

  • allowed_chars_reg: Regex pattern for disallowed characters (default matches non-alphanumeric/underscore)

Returns

  • Try.Ok(cleaned_df) with standardized column names

  • Try.Err(message) if disallowed characters remain after cleaning

source
FMDData.rename_aggregated_pre_post_counts Function
julia
rename_aggregated_pre_post_counts(
    df::DataFrame,
    original_regex::Regex
    substitution_string::SubstitutionString
)

Rename the aggregated pre/post counts to use the same format as the serotype-specific values

source
FMDData.check_duplicated_column_names Function
julia
check_duplicated_column_names(df::DataFrame)

Wrapper function around the two internal functions _check_identical_column_names() and _check_similar_column_names(). Checks for both identical and very similar column names in the DataFrame.

source
FMDData.check_duplicated_columns Function
julia
check_duplicated_columns(df::DataFrame)

Check if any columns have identical values

source

State Data Processing

Functions for validating and standardizing state names using the reference dictionary:

FMDData.correct_all_state_names Function
julia
correct_all_state_names(
    df::DataFrame,
    column::Symbol = :states_ut,
    states_dict::Dict = FMDData.states_dict
)

Correct all state name values in the data

source
FMDData.check_missing_states Function
julia
check_missing_states(
    df::DataFrame,
    column::Symbol = :states_ut,
)

Check if the states column of the data contains missing values

source
FMDData.check_duplicated_states Function
julia
check_duplicated_states(
    df::DataFrame,
    column::Symbol = :states_ut,
)

Check if there are duplicated states in the data

source
FMDData.add_state_code! Function
julia
add_state_code!(
    df::DataFrame,
    state_code_dict::Dict = state_code_dict,
    column::Symbol = :states_ut;
    totals_key = "total",
)

Add a region code column to the DataFrame based on state names. Maps state names to their corresponding region codes using the provided dictionary.

source

Data Calculations

Functions for calculating derived values from raw count data:

Count Calculations

FMDData.calculate_state_counts Function
julia
calculate_state_counts(df::DataFrame, allowed_serotypes = default_allowed_serotypes)

Calculate state-specific serotype counts from seroprevalence percentages and total counts.

This function calculates the number of positive samples for each serotype by multiplying the seroprevalence percentage by the total number of samples tested for that state and vaccination timing (pre/post).

Formula: count = (seroprevalence_pct / 100) * total_count

Arguments

  • df: DataFrame containing seroprevalence data

  • allowed_serotypes: Vector of serotype names to process

Returns

DataFrame with additional calculated count columns suffixed with "_calculated"

Note

The calculated values can be compared with existing count columns using check_calculated_values_match_existing().

source

Seroprevalence Calculations

FMDData.calculate_state_seroprevalence Function
julia
calculate_state_seroprevalence(df::DataFrame, allowed_serotypes = default_allowed_serotypes)

Calculate state-specific serotype seroprevalence percentages from count data.

This function calculates the seroprevalence percentage for each serotype by dividing the serotype-specific positive count by the total number of samples tested for that state and vaccination timing (pre/post).

Formula: seroprevalence_pct = (serotype_count / total_count) * 100

Arguments

  • df: DataFrame containing count data

  • allowed_serotypes: Vector of serotype names to process

  • reg: Regex pattern to identify count columns (optional)

  • digits: Number of decimal places for percentages (default: 1)

Returns

DataFrame with additional calculated seroprevalence columns suffixed with "_calculated"

Note

The calculated values can be compared with existing percentage columns using check_calculated_values_match_existing().

source

Data Validation

Comprehensive validation functions to ensure data quality and consistency:

Value Checks

FMDData.check_calculated_values_match_existing Function
julia
check_calculated_values_match_existing(
    df::DataFrame,
    allowed_serotypes::T = default_allowed_serotypes;
    digits = 1
) where {T <: AbstractVector{<:AbstractString}}

Check whether the provided values of counts and seroprevalence values match the corresponding values calculated.

source
FMDData.check_seroprevalence_as_pct Function
julia
check_seroprevalence_as_pct(df::DataFrame, reg::Regex)

Check if all seroprevalence columns are reported as a percentage, and not as a proportion.

source

Serotype Validation

FMDData.check_allowed_serotypes Function
julia
check_allowed_serotypes(
    df::DataFrame,
    allowed_serotypes::Vector{String} = vcat("all", default_allowed_serotypes),
    reg::Regex = r"serotype_(.*)_(?|count|pct)_(pre|post)"
)

Function to confirm that all required and no disallowed serotypes are provided in the data.

source

Pre/Post Vaccination Checks

FMDData.check_pre_post_exists Function
julia
check_pre_post_exists(df::DataFrame, reg::Regex)

Confirms each combination of serotype and result type (N/%) has both a pre- and post-vaccination results column, but nothing else.

source
FMDData.check_aggregated_pre_post_counts_exist Function
julia
check_aggregated_pre_post_counts_exist(
	df::DataFrame,
	columns = ["serotype_all_count_pre", "serotype_all_count_post"]
)

Check if data contains aggregated counts of pre and post vaccinated individuals. Should only be used on dataframes that have renamed these columns to meet the standard pattern of "serotype_all_count_pre"

source

Totals Processing

Functions for handling and validating totals rows in the datasets:

FMDData.has_totals_row Function
julia
has_totals_row(
    df::DataFrame,
    column::Symbol = :states_ut,
    possible_keys = ["total", "totals"]
)

Check if the table has a totals row.

df should have, at the very least, cleaned column names using the clean_colnames() function.

source
FMDData.all_totals_check Function
julia
all_totals_check(
    df::DataFrame;
    column::Symbol = :states_ut,
    totals_key = "total",
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex,
    atol = 0.0,
    digits = 1
)

Verify that the totals row in a DataFrame accurately reflects the sum of state-level data.

This function has two main methods:

  1. all_totals_check(df::DataFrame; ...) - Calculates totals and compares with existing totals row

  2. all_totals_check(totals_dict::OrderedDict, df::DataFrame; ...) - Uses pre-calculated totals

Calculation Methods

  • Count columns: Simple sum across all states

  • Percentage columns: Weighted average based on corresponding count columns

Arguments

  • df: DataFrame to validate

  • column: Column containing state names

  • totals_key: String identifying the totals row (case-insensitive)

  • allowed_serotypes: Serotypes to include in validation

  • reg: Regex pattern to identify columns for validation

  • atol: Absolute tolerance for floating-point comparisons

  • digits: Decimal places for percentage calculations

Returns

  • Try.Ok(nothing) if totals are accurate within tolerance

  • Try.Err(message) describing discrepancies found

Note

This function is critical for data quality assurance, ensuring that reported totals match the sum of individual state contributions.

source
FMDData.calculate_all_totals Function
julia
calculate_all_totals(
    df::DataFrame;
    column::Symbol = :states_ut,
    totals_key = "total",
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex, # Note: This is a positional argument in the function signature.
    digits = 1
)

Calculate all totals using the appropriate method instance of the internal function _calculate_totals!(), dependent on whether the column is a Float (seroprevalence) or Integer (count). Uses the internal function _collect_totals_check_args() to identify what arguments need to be passed to _calculate_totals!() function. Uses the internal function _totals_row_selector() to extract the totals row from the dataframe, for use when calculating the serotype weight total seroprevalence.

Arguments

  • df::DataFrame: The input DataFrame.

  • column::Symbol: Symbol of the column containing state/UT names (default: :states_ut).

  • totals_key::String: String key used to identify the totals row (default: "total").

  • allowed_serotypes::Vector{String}: Vector of allowed serotype strings.

  • reg::Regex: A positional Regex argument to select columns for totals calculation.

  • digits::Int: Number of digits for rounding (default: 1).

source
FMDData.totals_check Function
julia
totals_check(
    totals::DataFrameRow,
    calculated_totals::OrderedDict,
    column::Symbol = :states_ut;
    atol = 0.0
)

Check if the totals provided in a DataFrameRow match the calculated totals.

Arguments

  • totals::DataFrameRow: A row from a DataFrame, typically the 'total' row.

  • calculated_totals::OrderedDict: An OrderedDict where keys are column names and values are the calculated totals for these columns.

  • column::Symbol: The symbol for the column containing state/UT names, used for error messaging if totals don't match (default: :states_ut).

  • atol::Float64: Absolute tolerance used for comparing floating-point numbers (default: 0.0).

Returns Try.Ok(nothing) if all totals match, or Try.Err with a descriptive message if discrepancies are found.

source
FMDData.select_calculated_totals! Function
julia
select_calculated_totals!(
	df::DataFrame,
	column::Symbol = :states_ut,
	totals_key = "total",
	calculated_totals_key = "total calculated"
)

If the cleaned data contains both a provided and a calculated totals row then return strip the provided one and rename the calculated.

source

Column Selection and Sorting

Final processing steps to organize and standardize the cleaned data structure:

FMDData.select_calculated_cols! Function
julia
select_calculated_cols!(
    df::DataFrame,
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex
)

Checks if the data contains both provided and calculated columns that refer to the same variables. If the calculated column is a % seroprevalence, keep the calculated values. If the calculated column is a column of counts, keep the provided as they are deemed to be more accurate (counts require no calculation and should be a direct recording/reporting of the underlying data). The cleaning function check_calculated_values_match_existing() should have been run before to ensure there are no surprises during this processing step i.e., accidentally deleting columns that should be retained.

source
FMDData.sort_columns! Function
julia
sort_columns!(
    df::DataFrame;
    statename_column = :states_ut,
    allowed_serotypes = vcat("all", default_allowed_serotypes)
    prefix = "serotype_",
    suffix_order = [
        "_count_pre",
        "_pct_pre",
        "_count_post",
        "_pct_post",
    ]
)

Sort the columns of the cleaned dataframe to have a consistent order. Follows the pattern:

  • state column name

  • serotype all counts (pre then post)

  • serotype specific columns in the order "O", "A", "Asia1"

The serotype specific columns have their data presented in the following order.

  • serotype X pre count

  • serotype X pre pct

  • serotype X post count

  • serotype X post pct

source
FMDData.sort_states! Function
julia
sort_states!(
    df::DataFrame;
    statename_column = :states_ut,
    roundname_column = :round_name,
    totals_key = "total"
)

Sort the dataframe by alphabetical order of the states and list the totals row at the bottom. Preserves the original order of rows if there are duplicates.

source