Processing Functions
The processing stage takes cleaned data and adds metadata, infers missing values, and combines datasets. This stage follows the cleaning pipeline and prepares data for analysis by enriching it with contextual information and combining multiple datasets into unified structures.
Data Inference
Functions for inferring missing or incomplete data values based on patterns in the dataset:
FMDData.infer_later_year_values Function
infer_later_year_values(
cumulative_later_df::T,
initial_df::T;
year_column = :sample_year,
statename_column = :states_ut,
allowed_serotypes = vcat("all", default_allowed_serotypes),
reg::Regex,
atol = 0.0,
digits = 1
) where {T <: AbstractDataFrame}Infer single-year values by subtracting initial year data from cumulative data.
This function is used when ICAR reports provide cumulative data (e.g., 2021+2022 combined) and you need to extract the data for just the later year (e.g., 2022 only).
Arguments
cumulative_later_df: DataFrame with cumulative data for multiple yearsinitial_df: DataFrame with data for the initial year onlyyear_column: Column containing sample year informationstatename_column: Column containing state/UT namesallowed_serotypes: Vector of serotypes to processreg: Regex pattern to identify count columns for processingatol: Absolute tolerance for floating-point comparisonsdigits: Number of decimal places for calculated percentages
Returns
Try.Ok(inferred_df)with single-year data for the later periodTry.Err(message)if inference fails
Process
Subtracts initial year counts from cumulative counts
Handles rounding errors and missing values
Recalculates totals and seroprevalence percentages
Removes states with no data
Validates final totals
DataFrame Operations
Functions for combining and manipulating cleaned datasets from different sources or time periods:
FMDData.combine_round_dfs Function
combine_round_dfs(dfs::DataFrame...)Combines multiple DataFrames into a single DataFrame with permissive column handling.
sourceFMDData.combine_all_processed_data Function
combine_all_processed_data(selection::String = "all"; cols = :union)Load and combine processed CSV files from the processed directory into a single DataFrame.
Selection Options
"all": All processed files in the directory"2019": Files matching pattern "*_2019.csv" (state-specific 2019 data)"nadcp": Files matching pattern "nadcp_[123].csv" (NADCP rounds 1, 2, 3)"organized_farms": Files containing "organized_farms" (farm survey data)
Arguments
selection: String specifying which files to combinecols: Column handling strategy for vcat (:union allows different columns)
Returns
Combined DataFrame with all selected processed data
Errors
Throws error if selection is invalid or no matching files found
Examples
# Combine all data
all_data = combine_all_processed_data("all")
# Only NADCP surveillance data
nadcp_data = combine_all_processed_data("nadcp")These functions handle the combination of multiple cleaned datasets, typically used when processing data from different testing rounds (NADCP 1, NADCP 2, NADCP 3) or organized farm surveys across different time periods.
Metadata Operations
Main Metadata Function
The primary function for adding comprehensive metadata to processed datasets:
FMDData.add_all_metadata! Function
add_all_metadata!(
df_pair::Pair{T, D}
) where {T <: DataFrame, D <: OrderedDict{<:Symbol, <:Any}}Adds multiple metadata columns to a DataFrame based on a dictionary of metadata.
Arguments
df_pair: APairwhere the key is the DataFrame to modify and the value is anOrderedDictof metadata. The keys of the dictionary should be the names of the metadata columns to add, and the values should be the values to populate those columns with.
Specific Metadata Functions
Individual functions for adding specific types of metadata. These are typically called through add_all_metadata! but can be used individually for specific metadata requirements:
FMDData.add_test_threshold! Function
add_test_threshold!(
df_round_pairs::Pair{T, S}...;
threshold_column = :test_threshold
) where {T <: AbstractDataFrame, S <: AbstractString}Adds a test threshold column to one or more DataFrames.
sourceFMDData.add_test_type! Function
add_test_type!(
df_round_pairs::Pair{T, S}...;
test_column = :test_type
) where {T <: AbstractDataFrame, S <: AbstractString}Adds a test type column to one or more DataFrames.
sourceFMDData.add_round_name! Function
add_round_name!(
df_round_pairs::Pair{T, S}...;
round_column = :round
) where {T <: AbstractDataFrame, S <: AbstractString}Adds a round name column to one or more DataFrames.
sourceFMDData.add_report_year! Function
add_report_year!(
df_year_pairs::Pair{T, I}...;
year_column = :report_year
) where {T <: AbstractDataFrame, I <: Integer}Adds a report year column to one or more DataFrames.
sourceFMDData.add_sample_year! Function
add_sample_year!(
df_year_pairs...;
year_column = :sample_year
)Adds a sample year column to one or more DataFrames.
sourceFMDData.add_metadata_col! Function
add_metadata_col!(metadata_column, df_metadata_pairs...)Adds a metadata column to one or more DataFrames. This is a generic function that can be used to add any metadata column.
sourceadd_metadata_col!(
metadata_column::Symbol,
df_metadata_pair::Pair{T, I},
) where {T <: AbstractDataFrame, I <: Union{<:Integer, <:AbstractFloat, <:AbstractString}}Adds a metadata column to a single DataFrame.
sourceThe metadata functions enrich the cleaned data with contextual information including:
Test thresholds: Diagnostic test cutoff values used for seroprevalence determination
Test types: The specific diagnostic assay used (e.g., ELISA, VNT)
Round names: Testing campaign identifiers (NADCP 1, NADCP 2, etc.)
Report year: The year the ICAR annual report was published
Sample year: The year when samples were collected (may differ from report year)
Custom metadata: Additional contextual information as needed