Cleaning Functions
The cleaning pipeline processes raw CSV tables extracted from ICAR annual reports into standardized, validated datasets. The pipeline performs data quality checks, column standardization, state name validation, and calculations verification.
Main Wrapper Functions
The cleaning pipeline is orchestrated through a wrapper function that combines multiple cleaning steps in the correct sequence:
FMDData.all_cleaning_steps Function
all_cleaning_steps(
input_filename::T1,
input_dir::T1;
output_filename::T1 = "clean_$(input_filename)",
output_dir::T1 = icar_cleaned_dir(),
load_format = DataFrame,
skiptotals = false,
[logging_level_parameters...]
) where {T1 <: AbstractString}A wrapper function that runs all the cleaning steps for seroprevalence tables. This function handles all data formats including tables with multiple rows per state (e.g., 2019 report tables) through the skiptotals parameter and conditional logic.
Arguments
input_filename: Name of the CSV file to processinput_dir: Directory containing the input fileoutput_filename: Name for the cleaned output file (defaults to "clean_" + input filename)output_dir: Directory for output (defaults to cleaned data directory)load_format: Format for loading data (defaults to DataFrame)skiptotals: If true, skips totals-related processing for datasets without totals rows*_ll: Logging level parameters for each processing step (:Error, :Warn, or :Info)
Returns
Try.Ok(nothing)on successTry.Err(message)if any critical errors occur
Examples
# Standard processing
all_cleaning_steps("2022_NADCP-2.csv", icar_inputs_dir())
# Skip totals for 2019 data
all_cleaning_steps("2019_Bihar.csv", icar_inputs_dir(); skiptotals = true)State Reference Data
The package validates state names against a comprehensive dictionary of Indian states and union territories:
FMDData.states_dict Constant
states_dictA Dictionary of States/UTs that can appear in the data set. The keys will be returned in the cleaning steps, and the values can be matched in the underlying datasets.
sourceFMDData.state_code_dict Constant
state_code_dictA Dictionary mapping standardized state/UT names to their official two-letter state codes. The keys correspond to the standardized names from states_dict, and the values are the ISO 3166-2:IN state codes used for administrative purposes.
File Management
Core I/O operations for loading and saving CSV data with proper error handling:
FMDData.load_csv Function
load_csv(
filename::T1,
dir::T1,
output_format = DataFrame
) where {T1 <: AbstractString}A helper function to check if a csv input file and directory exists, and if so, load (as a DataFrame by default).
sourceFMDData.write_csv Function
write_csv(
filename::T1,
dir::T1,
data::DataFrame
) where {T1 <: AbstractString}A helper function to check if the specified name and directory exist and are valid, and if so, write the CSV to disk.
sourceColumn Name Processing
Functions for standardizing column names and detecting structural issues:
FMDData.clean_colnames Function
clean_colnames(df::DataFrame, allowed_chars_reg::Regex = r"[^\w]")Standardize column names by replacing special characters and abbreviations with consistent formats.
Performs the following transformations:
Converts to lowercase
Replaces "/" and spaces with underscores
Replaces "(n)" and "n" with "count"
Replaces "(%)" and "%" with "pct"
Validates that no disallowed characters remain
Arguments
df: DataFrame to cleanallowed_chars_reg: Regex pattern for disallowed characters (default matches non-alphanumeric/underscore)
Returns
Try.Ok(cleaned_df)with standardized column namesTry.Err(message)if disallowed characters remain after cleaning
FMDData.rename_aggregated_pre_post_counts Function
rename_aggregated_pre_post_counts(
df::DataFrame,
original_regex::Regex
substitution_string::SubstitutionString
)Rename the aggregated pre/post counts to use the same format as the serotype-specific values
sourceFMDData.check_duplicated_column_names Function
check_duplicated_column_names(df::DataFrame)Wrapper function around the two internal functions _check_identical_column_names() and _check_similar_column_names(). Checks for both identical and very similar column names in the DataFrame.
FMDData.check_duplicated_columns Function
check_duplicated_columns(df::DataFrame)Check if any columns have identical values
sourceState Data Processing
Functions for validating and standardizing state names using the reference dictionary:
FMDData.correct_all_state_names Function
correct_all_state_names(
df::DataFrame,
column::Symbol = :states_ut,
states_dict::Dict = FMDData.states_dict
)Correct all state name values in the data
sourceFMDData.check_missing_states Function
check_missing_states(
df::DataFrame,
column::Symbol = :states_ut,
)Check if the states column of the data contains missing values
sourceFMDData.check_duplicated_states Function
check_duplicated_states(
df::DataFrame,
column::Symbol = :states_ut,
)Check if there are duplicated states in the data
sourceFMDData.add_state_code! Function
add_state_code!(
df::DataFrame,
state_code_dict::Dict = state_code_dict,
column::Symbol = :states_ut;
totals_key = "total",
)Add a region code column to the DataFrame based on state names. Maps state names to their corresponding region codes using the provided dictionary.
sourceData Calculations
Functions for calculating derived values from raw count data:
Count Calculations
FMDData.calculate_state_counts Function
calculate_state_counts(df::DataFrame, allowed_serotypes = default_allowed_serotypes)Calculate state-specific serotype counts from seroprevalence percentages and total counts.
This function calculates the number of positive samples for each serotype by multiplying the seroprevalence percentage by the total number of samples tested for that state and vaccination timing (pre/post).
Formula: count = (seroprevalence_pct / 100) * total_count
Arguments
df: DataFrame containing seroprevalence dataallowed_serotypes: Vector of serotype names to process
Returns
DataFrame with additional calculated count columns suffixed with "_calculated"
Note
The calculated values can be compared with existing count columns using check_calculated_values_match_existing().
Seroprevalence Calculations
FMDData.calculate_state_seroprevalence Function
calculate_state_seroprevalence(df::DataFrame, allowed_serotypes = default_allowed_serotypes)Calculate state-specific serotype seroprevalence percentages from count data.
This function calculates the seroprevalence percentage for each serotype by dividing the serotype-specific positive count by the total number of samples tested for that state and vaccination timing (pre/post).
Formula: seroprevalence_pct = (serotype_count / total_count) * 100
Arguments
df: DataFrame containing count dataallowed_serotypes: Vector of serotype names to processreg: Regex pattern to identify count columns (optional)digits: Number of decimal places for percentages (default: 1)
Returns
DataFrame with additional calculated seroprevalence columns suffixed with "_calculated"
Note
The calculated values can be compared with existing percentage columns using check_calculated_values_match_existing().
Data Validation
Comprehensive validation functions to ensure data quality and consistency:
Value Checks
FMDData.check_calculated_values_match_existing Function
check_calculated_values_match_existing(
df::DataFrame,
allowed_serotypes::T = default_allowed_serotypes;
digits = 1
) where {T <: AbstractVector{<:AbstractString}}Check whether the provided values of counts and seroprevalence values match the corresponding values calculated.
sourceFMDData.check_seroprevalence_as_pct Function
check_seroprevalence_as_pct(df::DataFrame, reg::Regex)Check if all seroprevalence columns are reported as a percentage, and not as a proportion.
sourceSerotype Validation
FMDData.check_allowed_serotypes Function
check_allowed_serotypes(
df::DataFrame,
allowed_serotypes::Vector{String} = vcat("all", default_allowed_serotypes),
reg::Regex = r"serotype_(.*)_(?|count|pct)_(pre|post)"
)Function to confirm that all required and no disallowed serotypes are provided in the data.
sourcePre/Post Vaccination Checks
FMDData.check_pre_post_exists Function
check_pre_post_exists(df::DataFrame, reg::Regex)Confirms each combination of serotype and result type (N/%) has both a pre- and post-vaccination results column, but nothing else.
sourceFMDData.check_aggregated_pre_post_counts_exist Function
check_aggregated_pre_post_counts_exist(
df::DataFrame,
columns = ["serotype_all_count_pre", "serotype_all_count_post"]
)Check if data contains aggregated counts of pre and post vaccinated individuals. Should only be used on dataframes that have renamed these columns to meet the standard pattern of "serotype_all_count_pre"
sourceTotals Processing
Functions for handling and validating totals rows in the datasets:
FMDData.has_totals_row Function
has_totals_row(
df::DataFrame,
column::Symbol = :states_ut,
possible_keys = ["total", "totals"]
)Check if the table has a totals row.
df should have, at the very least, cleaned column names using the clean_colnames() function.
FMDData.all_totals_check Function
all_totals_check(
df::DataFrame;
column::Symbol = :states_ut,
totals_key = "total",
allowed_serotypes = vcat("all", default_allowed_serotypes),
reg::Regex,
atol = 0.0,
digits = 1
)Verify that the totals row in a DataFrame accurately reflects the sum of state-level data.
This function has two main methods:
all_totals_check(df::DataFrame; ...)- Calculates totals and compares with existing totals rowall_totals_check(totals_dict::OrderedDict, df::DataFrame; ...)- Uses pre-calculated totals
Calculation Methods
Count columns: Simple sum across all states
Percentage columns: Weighted average based on corresponding count columns
Arguments
df: DataFrame to validatecolumn: Column containing state namestotals_key: String identifying the totals row (case-insensitive)allowed_serotypes: Serotypes to include in validationreg: Regex pattern to identify columns for validationatol: Absolute tolerance for floating-point comparisonsdigits: Decimal places for percentage calculations
Returns
Try.Ok(nothing)if totals are accurate within toleranceTry.Err(message)describing discrepancies found
Note
This function is critical for data quality assurance, ensuring that reported totals match the sum of individual state contributions.
sourceFMDData.calculate_all_totals Function
calculate_all_totals(
df::DataFrame;
column::Symbol = :states_ut,
totals_key = "total",
allowed_serotypes = vcat("all", default_allowed_serotypes),
reg::Regex, # Note: This is a positional argument in the function signature.
digits = 1
)Calculate all totals using the appropriate method instance of the internal function _calculate_totals!(), dependent on whether the column is a Float (seroprevalence) or Integer (count). Uses the internal function _collect_totals_check_args() to identify what arguments need to be passed to _calculate_totals!() function. Uses the internal function _totals_row_selector() to extract the totals row from the dataframe, for use when calculating the serotype weight total seroprevalence.
Arguments
df::DataFrame: The input DataFrame.column::Symbol: Symbol of the column containing state/UT names (default::states_ut).totals_key::String: String key used to identify the totals row (default:"total").allowed_serotypes::Vector{String}: Vector of allowed serotype strings.reg::Regex: A positional Regex argument to select columns for totals calculation.digits::Int: Number of digits for rounding (default:1).
FMDData.totals_check Function
totals_check(
totals::DataFrameRow,
calculated_totals::OrderedDict,
column::Symbol = :states_ut;
atol = 0.0
)Check if the totals provided in a DataFrameRow match the calculated totals.
Arguments
totals::DataFrameRow: A row from a DataFrame, typically the 'total' row.calculated_totals::OrderedDict: An OrderedDict where keys are column names and values are the calculated totals for these columns.column::Symbol: The symbol for the column containing state/UT names, used for error messaging if totals don't match (default::states_ut).atol::Float64: Absolute tolerance used for comparing floating-point numbers (default:0.0).
Returns Try.Ok(nothing) if all totals match, or Try.Err with a descriptive message if discrepancies are found.
FMDData.select_calculated_totals! Function
select_calculated_totals!(
df::DataFrame,
column::Symbol = :states_ut,
totals_key = "total",
calculated_totals_key = "total calculated"
)If the cleaned data contains both a provided and a calculated totals row then return strip the provided one and rename the calculated.
sourceColumn Selection and Sorting
Final processing steps to organize and standardize the cleaned data structure:
FMDData.select_calculated_cols! Function
select_calculated_cols!(
df::DataFrame,
allowed_serotypes = vcat("all", default_allowed_serotypes),
reg::Regex
)Checks if the data contains both provided and calculated columns that refer to the same variables. If the calculated column is a % seroprevalence, keep the calculated values. If the calculated column is a column of counts, keep the provided as they are deemed to be more accurate (counts require no calculation and should be a direct recording/reporting of the underlying data). The cleaning function check_calculated_values_match_existing() should have been run before to ensure there are no surprises during this processing step i.e., accidentally deleting columns that should be retained.
FMDData.sort_columns! Function
sort_columns!(
df::DataFrame;
statename_column = :states_ut,
allowed_serotypes = vcat("all", default_allowed_serotypes)
prefix = "serotype_",
suffix_order = [
"_count_pre",
"_pct_pre",
"_count_post",
"_pct_post",
]
)Sort the columns of the cleaned dataframe to have a consistent order. Follows the pattern:
state column name
serotype all counts (pre then post)
serotype specific columns in the order "O", "A", "Asia1"
The serotype specific columns have their data presented in the following order.
serotype X pre count
serotype X pre pct
serotype X post count
serotype X post pct
FMDData.sort_states! Function
sort_states!(
df::DataFrame;
statename_column = :states_ut,
roundname_column = :round_name,
totals_key = "total"
)Sort the dataframe by alphabetical order of the states and list the totals row at the bottom. Preserves the original order of rows if there are duplicates.
source