Skip to content

Processing Functions

The processing stage takes cleaned data and adds metadata, infers missing values, and combines datasets. This stage follows the cleaning pipeline and prepares data for analysis by enriching it with contextual information and combining multiple datasets into unified structures.

Data Inference

Functions for inferring missing or incomplete data values based on patterns in the dataset:

FMDData.infer_later_year_values Function
julia
infer_later_year_values(
    cumulative_later_df::T,
    initial_df::T;
    year_column = :sample_year,
    statename_column = :states_ut,
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex,
    atol = 0.0,
    digits = 1
) where {T <: AbstractDataFrame}

Infer single-year values by subtracting initial year data from cumulative data.

This function is used when ICAR reports provide cumulative data (e.g., 2021+2022 combined) and you need to extract the data for just the later year (e.g., 2022 only).

Arguments

  • cumulative_later_df: DataFrame with cumulative data for multiple years

  • initial_df: DataFrame with data for the initial year only

  • year_column: Column containing sample year information

  • statename_column: Column containing state/UT names

  • allowed_serotypes: Vector of serotypes to process

  • reg: Regex pattern to identify count columns for processing

  • atol: Absolute tolerance for floating-point comparisons

  • digits: Number of decimal places for calculated percentages

Returns

  • Try.Ok(inferred_df) with single-year data for the later period

  • Try.Err(message) if inference fails

Process

  1. Subtracts initial year counts from cumulative counts

  2. Handles rounding errors and missing values

  3. Recalculates totals and seroprevalence percentages

  4. Removes states with no data

  5. Validates final totals

source

DataFrame Operations

Functions for combining and manipulating cleaned datasets from different sources or time periods:

FMDData.combine_round_dfs Function
julia
combine_round_dfs(dfs::DataFrame...)

Combines multiple DataFrames into a single DataFrame with permissive column handling.

source
FMDData.combine_all_processed_data Function
julia
combine_all_processed_data(selection::String = "all"; cols = :union)

Load and combine processed CSV files from the processed directory into a single DataFrame.

Selection Options

  • "all": All processed files in the directory

  • "2019": Files matching pattern "*_2019.csv" (state-specific 2019 data)

  • "nadcp": Files matching pattern "nadcp_[123].csv" (NADCP rounds 1, 2, 3)

  • "organized_farms": Files containing "organized_farms" (farm survey data)

Arguments

  • selection: String specifying which files to combine

  • cols: Column handling strategy for vcat (:union allows different columns)

Returns

Combined DataFrame with all selected processed data

Errors

Throws error if selection is invalid or no matching files found

Examples

julia
# Combine all data
all_data = combine_all_processed_data("all")

# Only NADCP surveillance data
nadcp_data = combine_all_processed_data("nadcp")
source

These functions handle the combination of multiple cleaned datasets, typically used when processing data from different testing rounds (NADCP 1, NADCP 2, NADCP 3) or organized farm surveys across different time periods.

Metadata Operations

Main Metadata Function

The primary function for adding comprehensive metadata to processed datasets:

FMDData.add_all_metadata! Function
julia
add_all_metadata!(
    df_pair::Pair{T, D}
) where {T <: DataFrame, D <: OrderedDict{<:Symbol, <:Any}}

Adds multiple metadata columns to a DataFrame based on a dictionary of metadata.

Arguments

  • df_pair: A Pair where the key is the DataFrame to modify and the value is an OrderedDict of metadata. The keys of the dictionary should be the names of the metadata columns to add, and the values should be the values to populate those columns with.
source

Specific Metadata Functions

Individual functions for adding specific types of metadata. These are typically called through add_all_metadata! but can be used individually for specific metadata requirements:

FMDData.add_test_threshold! Function
julia
add_test_threshold!(
    df_round_pairs::Pair{T, S}...;
    threshold_column = :test_threshold
) where {T <: AbstractDataFrame, S <: AbstractString}

Adds a test threshold column to one or more DataFrames.

source
FMDData.add_test_type! Function
julia
add_test_type!(
    df_round_pairs::Pair{T, S}...;
    test_column = :test_type
) where {T <: AbstractDataFrame, S <: AbstractString}

Adds a test type column to one or more DataFrames.

source
FMDData.add_round_name! Function
julia
add_round_name!(
    df_round_pairs::Pair{T, S}...;
    round_column = :round
) where {T <: AbstractDataFrame, S <: AbstractString}

Adds a round name column to one or more DataFrames.

source
FMDData.add_report_year! Function
julia
add_report_year!(
    df_year_pairs::Pair{T, I}...;
    year_column = :report_year
) where {T <: AbstractDataFrame, I <: Integer}

Adds a report year column to one or more DataFrames.

source
FMDData.add_sample_year! Function
julia
add_sample_year!(
    df_year_pairs...;
    year_column = :sample_year
)

Adds a sample year column to one or more DataFrames.

source
FMDData.add_metadata_col! Function
julia
add_metadata_col!(metadata_column, df_metadata_pairs...)

Adds a metadata column to one or more DataFrames. This is a generic function that can be used to add any metadata column.

source
julia
add_metadata_col!(
    metadata_column::Symbol,
    df_metadata_pair::Pair{T, I},
) where {T <: AbstractDataFrame, I <: Union{<:Integer, <:AbstractFloat, <:AbstractString}}

Adds a metadata column to a single DataFrame.

source

The metadata functions enrich the cleaned data with contextual information including:

  • Test thresholds: Diagnostic test cutoff values used for seroprevalence determination

  • Test types: The specific diagnostic assay used (e.g., ELISA, VNT)

  • Round names: Testing campaign identifiers (NADCP 1, NADCP 2, etc.)

  • Report year: The year the ICAR annual report was published

  • Sample year: The year when samples were collected (may differ from report year)

  • Custom metadata: Additional contextual information as needed