Data Cleaning Process
This document outlines the data cleaning process for the FMDData project. The process is orchestrated by the scripts/icar-cleaning.jl
script, which uses two distinct cleaning pipelines to handle variations in the source data formats.
Cleaning Pipelines
The scripts/icar-cleaning.jl
script segregates files into two main cleaning workflows:
all_cleaning_steps
: The standard pipeline for NADCP and post-2019 reports.all_2019_cleaning_steps
: A specialized pipeline for the 2019 annual report data, which has a different structure.
Main Workflow: all_cleaning_steps
This function, defined in src/icar-cleaning/wrapper-functions.jl
, is a multi-stage process for loading, validating, calculating, and saving the data.
Step 1: Setup and Logging
A
logfiles
directory and a log file for the specific dataset are created.Each step is wrapped in a
_log_try_error
block to capture warnings and errors without halting execution.
Step 2: Initial Loading and Sanitization
load_csv
: Loads the raw CSV file frominputs/ICAR-Reports/extracted-seroprevalence-tables/
into a DataFrame.clean_colnames
: Standardizes column names by converting tosnake_case
, removing special characters, and trimming whitespace.rename_aggregated_pre_post_counts
: Renames columns for pre- and post-vaccination counts for clarity.correct_all_state_names
: Standardizes state names using a predefined mapping insrc/icar-cleaning/state-keys.jl
.
Step 3: Data Validation
A series of validation checks are performed to ensure data integrity.
Structural Checks:
check_duplicated_column_names
: Ensures no duplicate column names exist after cleaning.check_duplicated_states
: Verifies that each state appears only once.
Content and Completeness Checks:
check_missing_states
: Confirms that all expected states are present.check_allowed_serotypes
: Validates that serotype columns match a predefined list.check_seroprevalence_as_pct
: Ensures seroprevalence values are percentages.check_aggregated_pre_post_counts_exist
andcheck_pre_post_exists
: Confirms the presence of essential columns for vaccination counts.
Step 4: Totals Calculation and Verification
has_totals_row
: Checks for the existence of a "Total" row.calculate_all_totals
: Independently calculates totals for all numeric columns.all_totals_check
: Compares calculated totals with existing totals. If they do not match, a warning is logged, and the calculated totals are used. If no totals row exists, the calculated one is added.
Step 5: Core Calculations and Finalization
calculate_state_counts
&calculate_state_seroprevalence
: Calculates state-level counts and seroprevalence.check_calculated_values_match_existing
: Compares these calculations with the original values and logs any discrepancies.select_calculated_cols!
&select_calculated_totals!
: Replaces the original columns with the calculated ones.sort_columns!
&sort_states!
: Sorts columns and rows for consistency.
Step 6: Output
The cleaned DataFrame is saved as a new CSV file in the data/icar-seroprevalence/cleaned/
directory.
Specialized Workflow: all_2019_cleaning_steps
The 2019 data requires a different cleaning process due to its unique structure.
No Totals Row: The 2019 tables do not contain a "Total" row. The function will log an error if one is found.
Duplicate States: The 2019 data may contain multiple rows for the same state. The
check_duplicated_states
validation step is skipped in this pipeline.Conditional Calculations: The core calculations (
calculate_state_counts
,calculate_state_seroprevalence
) are only performed if aggregated count columns are present.
Summary
The FMDData project uses two distinct cleaning pipelines to handle different data formats:
all_cleaning_steps
: A comprehensive pipeline for structured tables with a single row per state and a totals row.all_2019_cleaning_steps
: A flexible pipeline for the 2019 report tables, which lack totals and may have multiple entries per state.
This approach allows for robust and accurate cleaning of various data formats.