Data Cleaning Process

This document outlines the data cleaning process for the FMDData project. The process is orchestrated by the scripts/icar-cleaning.jl script, which uses two distinct cleaning pipelines to handle variations in the source data formats.

Cleaning Pipelines

The scripts/icar-cleaning.jl script segregates files into two main cleaning workflows:

all_cleaning_steps: The standard pipeline for NADCP and post-2019 reports.
all_2019_cleaning_steps: A specialized pipeline for the 2019 annual report data, which has a different structure.

Main Workflow: `all_cleaning_steps`

This function, defined in src/icar-cleaning/wrapper-functions.jl, is a multi-stage process for loading, validating, calculating, and saving the data.

Step 1: Setup and Logging

A logfiles directory and a log file for the specific dataset are created.
Each step is wrapped in a _log_try_error block to capture warnings and errors without halting execution.

Step 2: Initial Loading and Sanitization

load_csv: Loads the raw CSV file from inputs/ICAR-Reports/extracted-seroprevalence-tables/ into a DataFrame.
clean_colnames: Standardizes column names by converting to snake_case, removing special characters, and trimming whitespace.
rename_aggregated_pre_post_counts: Renames columns for pre- and post-vaccination counts for clarity.
correct_all_state_names: Standardizes state names using a predefined mapping in src/icar-cleaning/state-keys.jl.

Step 3: Data Validation

A series of validation checks are performed to ensure data integrity.

Structural Checks:
- check_duplicated_column_names: Ensures no duplicate column names exist after cleaning.
- check_duplicated_states: Verifies that each state appears only once.
Content and Completeness Checks:
- check_missing_states: Confirms that all expected states are present.
- check_allowed_serotypes: Validates that serotype columns match a predefined list.
- check_seroprevalence_as_pct: Ensures seroprevalence values are percentages.
- check_aggregated_pre_post_counts_exist and check_pre_post_exists: Confirms the presence of essential columns for vaccination counts.

Step 4: Totals Calculation and Verification

has_totals_row: Checks for the existence of a "Total" row.
calculate_all_totals: Independently calculates totals for all numeric columns.
all_totals_check: Compares calculated totals with existing totals. If they do not match, a warning is logged, and the calculated totals are used. If no totals row exists, the calculated one is added.

Step 5: Core Calculations and Finalization

calculate_state_counts & calculate_state_seroprevalence: Calculates state-level counts and seroprevalence.
check_calculated_values_match_existing: Compares these calculations with the original values and logs any discrepancies.
select_calculated_cols! & select_calculated_totals!: Replaces the original columns with the calculated ones.
sort_columns! & sort_states!: Sorts columns and rows for consistency.

Step 6: Output

The cleaned DataFrame is saved as a new CSV file in the data/icar-seroprevalence/cleaned/ directory.

Specialized Workflow: `all_2019_cleaning_steps`

The 2019 data requires a different cleaning process due to its unique structure.

No Totals Row: The 2019 tables do not contain a "Total" row. The function will log an error if one is found.
Duplicate States: The 2019 data may contain multiple rows for the same state. The check_duplicated_states validation step is skipped in this pipeline.
Conditional Calculations: The core calculations (calculate_state_counts, calculate_state_seroprevalence) are only performed if aggregated count columns are present.

Summary

The FMDData project uses two distinct cleaning pipelines to handle different data formats:

all_cleaning_steps: A comprehensive pipeline for structured tables with a single row per state and a totals row.
all_2019_cleaning_steps: A flexible pipeline for the 2019 report tables, which lack totals and may have multiple entries per state.

This approach allows for robust and accurate cleaning of various data formats.

Data Cleaning Process ​

Cleaning Pipelines ​

Main Workflow: all_cleaning_steps ​

Specialized Workflow: all_2019_cleaning_steps ​

Summary ​

Data Cleaning Process

Cleaning Pipelines

Main Workflow: `all_cleaning_steps`

Specialized Workflow: `all_2019_cleaning_steps`

Summary