Data Cleaning Process
This document outlines the data cleaning process for the FMDData project. The process is orchestrated by the scripts/icar-cleaning.jl script, which uses a unified cleaning pipeline to handle variations in the source data formats.
Cleaning Pipeline
The scripts/icar-cleaning.jl script uses a single, flexible cleaning workflow:
all_cleaning_steps: A comprehensive pipeline that handles all data formats, including NADCP reports, post-2019 reports, and the 2019 annual report data with its different structure.
Main Workflow: all_cleaning_steps
This function, defined in src/icar-cleaning/wrapper-functions.jl, is a multi-stage process for loading, validating, calculating, and saving the data. The function includes flexible parameters to handle different data formats, including the ability to skip totals processing for datasets that don't contain totals rows (such as 2019 data).
Step 1: Setup and Logging
A
logfilesdirectory and a log file for the specific dataset are created.Each step is wrapped in a
_log_try_errorblock to capture warnings and errors without halting execution.
Step 2: Initial Loading and Sanitization
load_csv: Loads the raw CSV file frominputs/ICAR-Reports/extracted-seroprevalence-tables/into a DataFrame.clean_colnames: Standardizes column names by converting tosnake_case, removing special characters, and trimming whitespace.rename_aggregated_pre_post_counts: Renames columns for pre- and post-vaccination counts for clarity.correct_all_state_names: Standardizes state names using a predefined mapping insrc/icar-cleaning/state-keys.jl.
Step 3: Data Validation
A series of validation checks are performed to ensure data integrity.
Structural Checks:
check_duplicated_column_names: Ensures no duplicate column names exist after cleaning.check_duplicated_states: Verifies that each state appears only once.
Content and Completeness Checks:
check_missing_states: Confirms that all expected states are present.check_allowed_serotypes: Validates that serotype columns match a predefined list.check_seroprevalence_as_pct: Ensures seroprevalence values are percentages.check_aggregated_pre_post_counts_existandcheck_pre_post_exists: Confirms the presence of essential columns for vaccination counts.
Step 4: Totals Calculation and Verification
has_totals_row: Checks for the existence of a "Total" row.calculate_all_totals: Independently calculates totals for all numeric columns.all_totals_check: Compares calculated totals with existing totals. If they do not match, a warning is logged, and the calculated totals are used. If no totals row exists, the calculated one is added.
Step 5: Core Calculations and Finalization
calculate_state_counts&calculate_state_seroprevalence: Calculates state-level counts and seroprevalence.check_calculated_values_match_existing: Compares these calculations with the original values and logs any discrepancies.select_calculated_cols!&select_calculated_totals!: Replaces the original columns with the calculated ones.sort_columns!&sort_states!: Sorts columns and rows for consistency.
Step 6: Output
The cleaned DataFrame is saved as a new CSV file in the data/icar-seroprevalence/cleaned/ directory.
Handling Different Data Formats
The all_cleaning_steps function includes flexible parameters to handle different data structures:
skiptotalsParameter: When set totrue, skips totals-related processing for datasets that don't contain totals rows (such as 2019 data).Conditional Logic: The function automatically detects and adapts to different data structures:
Duplicate States: Handles datasets with multiple rows for the same state (common in 2019 data)
Missing Totals: Gracefully processes datasets without totals rows
Variable Columns: Adapts to different column structures across different report years
- Flexible Validation: Validation steps are conditionally applied based on the data structure detected.
Summary
The FMDData project uses a unified, flexible cleaning pipeline that adapts to different data formats:
all_cleaning_steps: A comprehensive pipeline that handles all data formats including:Structured tables with a single row per state and a totals row (NADCP and post-2019 reports)
Flexible tables with multiple entries per state and no totals (2019 report tables)
Various column structures across different report years
This unified approach provides robust and accurate cleaning while maintaining flexibility for different data formats.