Skip to content

Cleaning Functions

FMDData.all_2019_cleaning_steps Method
julia
all_2019_cleaning_steps(
    input_filename::T1,
    input_dir::T1;
	output_filename::T1 = "clean_$(input_filename)",
    output_dir::T1 = icar_cleaned_dir(),
    load_format = DataFrame
) where {T1 <: AbstractString}

A wrapper function that runs all the cleaning steps for seroprevalence tables from the 2019 annual report that share the common format of states in each row and columns relating to serotype seroprevalence. For tables from later reports, use all_cleaning_steps()

source
FMDData.all_cleaning_steps Method
julia
all_cleaning_steps(
    input_filename::T1,
    input_dir::T1;
	output_filename::T1 = "clean_$(input_filename)",
    output_dir::T1 = icar_cleaned_dir(),
    load_format = DataFrame
) where {T1 <: AbstractString}

A wrapper function that runs all the cleaning steps for seroprevalence tables that share the common format of states in each row and columns relating to serotype counts/seroprevalence. For tables that contain multiple rows for each state e.g., 2019 report tables which cover multiple years for a single state, use the relevant alternative wrapper functions all_2019_cleaning_steps().

source
FMDData.calculate_state_counts Function
julia
calculate_state_counts(df::DataFrame, allowed_serotypes = default_allowed_serotypes)

A wrapper function around the internal _calculate_state_counts() function to calculate the state/serotype specific counts based upon the state/serotype seroprevalence values and total state counts. See the documentation of _calculate_state_counts() for more details on the implementation.

source
FMDData.calculate_state_seroprevalence Method
julia
calculate_state_seroprevalence(df::DataFrame, allowed_serotypes = default_allowed_serotypes)

A wrapper function around the internal _calculate_state_seroprevalence() function to calculate the state/serotype specific counts based upon the state/serotype seroprevalence values and total state counts. See the documentation of _calculate_state_seroprevalence() for more details on the implementation.

source
FMDData.check_calculated_values_match_existing Method
julia
check_calculated_values_match_existing(
    df::DataFrame,
    allowed_serotypes::T = default_allowed_serotypes;
    digits = 1
) where {T <: AbstractVector{<:AbstractString}}

Check whether the provided values of counts and seroprevalence values match the corresponding values calculated.

source
FMDData.check_seroprevalence_as_pct Function
julia
check_seroprevalence_as_pct(df::DataFrame, reg::Regex)

Check if all seroprevalence columns are reported as a percentage, and not as a proportion.

source
FMDData.clean_colnames Function
julia
clean_colnames(df::DataFrame, allowed_chars_reg::Regex)

Replace spaces and / with underscores, and (n) and (%) with "count" and "pct" respectively. allowed_chars_reg should be a negative match, where the default r"[^\w]" matches to all non numeric/alphabetic/_ characters

source
FMDData.rename_aggregated_pre_post_counts Function
julia
rename_aggregated_pre_post_counts(
    df::DataFrame,
    original_regex::Regex
    substitution_string::SubstitutionString
)

Rename the aggregated pre/post counts to use the same format as the serotype-specific values

source
FMDData.check_duplicated_column_names Method
julia
check_duplicated_column_names(df::DataFrame)

Wrapper function around the two internal functions _check_identical_column_names() and _check_similar_column_names(). Checks for both identical and very similar column names in the DataFrame.

source
FMDData.check_duplicated_columns Method
julia
check_duplicated_columns(df::DataFrame)

Check if any columns have identical values

source
FMDData.load_csv Method
julia
load_csv(
    filename::T1,
    dir::T1,
    output_format = DataFrame
) where {T1 <: AbstractString}

A helper function to check if a csv input file and directory exists, and if so, load (as a DataFrame by default).

source
FMDData.write_csv Method
julia
write_csv(
    filename::T1,
    dir::T1,
    data::DataFrame
) where {T1 <: AbstractString}

A helper function to check if the specified name and directory exist and are valid, and if so, write the CSV to disk.

source
FMDData.check_aggregated_pre_post_counts_exist Function
julia
check_aggregated_pre_post_counts_exist(
	df::DataFrame,
	columns = ["serotype_all_count_pre", "serotype_all_count_post"]
)

Check if data contains aggregated counts of pre and post vaccinated individuals. Should only be used on dataframes that have renamed these columns to meet the standard pattern of "serotype_all_count_pre"

source
FMDData.check_pre_post_exists Function
julia
check_pre_post_exists(df::DataFrame, reg::Regex)

Confirms each combination of serotype and result type (N/%) has both a pre- and post-vaccination results column, but nothing else.

source
FMDData.select_calculated_cols! Method
julia
select_calculated_cols!(
    df::DataFrame,
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex
)

Checks if the data contains both provided and calculated columns that refer to the same variables. If the calculated column is a % seroprevalence, keep the calculated values. If the calculated column is a column of counts, keep the provided as they are deemed to be more accurate (counts require no calculation and should be a direct recording/reporting of the underlying data). The cleaning function check_calculated_values_match_existing() should have been run before to ensure there are no surprises during this processing step i.e., accidentally deleting columns that should be retained.

source
FMDData.check_allowed_serotypes Function
julia
check_allowed_serotypes(
    df::DataFrame,
    allowed_serotypes::Vector{String} = vcat("all", default_allowed_serotypes),
    reg::Regex = r"serotype_(.*)_(?|count|pct)_(pre|post)"
)

Function to confirm that all required and no disallowed serotypes are provided in the data.

source
FMDData.sort_columns! Method
julia
sort_columns!(
    df::DataFrame;
    statename_column = :states_ut,
    allowed_serotypes = vcat("all", default_allowed_serotypes)
    prefix = "serotype_",
    suffix_order = [
        "_count_pre",
        "_pct_pre",
        "_count_post",
        "_pct_post",
    ]
)

Sort the columns of the cleaned dataframe to have a consistent order. Follows the pattern:

  • state column name

  • serotype all counts (pre then post)

  • serotype specific columns in the order "O", "A", "Asia1"

The serotype specific columns have their data presented in the following order.

  • serotype X pre count

  • serotype X pre pct

  • serotype X post count

  • serotype X post pct

source
FMDData.sort_states! Method
julia
sort_states!(
    df::DataFrame;
    statename_column = :states_ut,
    totals_key = "total"
)

Sort the dataframe by alphabetical order of the states and list the totals row at the bottom. Preserves the original order of rows if there are duplicates.

source
FMDData.check_duplicated_states Function
julia
check_duplicated_states(
    df::DataFrame,
    column::Symbol = :states_ut,
)

Check if there are duplicated states in the data

source
FMDData.check_missing_states Function
julia
check_missing_states(
    df::DataFrame,
    column::Symbol = :states_ut,
)

Check if the states column of the data contains missing values

source
FMDData.correct_all_state_names Function
julia
correct_all_state_names(
    df::DataFrame,
    column::Symbol = :states_ut,
    states_dict::Dict = FMDData.states_dict
)

Correct all state name values in the data

source
FMDData.all_totals_check Method
julia
all_totals_check(
    df::DataFrame;
    column::Symbol = :states_ut,
    totals_key = "total",
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex, # Note: This is a positional argument in the function signature.
    atol = 0.0,
    digits = 1
)

Checks if the totals row in a DataFrame is accurate.

This function has two main methods:

  1. all_totals_check(df::DataFrame; ...): This is the main method, which calculates the totals and then compares them to the existing totals row in the DataFrame.

  2. all_totals_check(totals_dict::OrderedDict, df::DataFrame; ...): This method is used when the totals have already been calculated and are passed in as a dictionary.

The function calculates the totals for both counts and seroprevalence. For counts, it calculates a simple sum. For seroprevalence, it calculates a weighted sum based on the relevant counts (pre- or post-vaccination).

Arguments

  • df: The DataFrame to check.

  • column: The column containing the state/UT names. Defaults to :states_ut.

  • totals_key: The key used to identify the totals row. Defaults to "total".

  • allowed_serotypes: A vector of allowed serotypes.

  • reg: A regular expression used to select the columns to check.

  • atol: The absolute tolerance to use when comparing floating-point numbers. Defaults to 0.0.

  • digits: The number of digits to round to. Defaults to 1.

source
FMDData.calculate_all_totals Method
julia
calculate_all_totals(
    df::DataFrame;
    column::Symbol = :states_ut,
    totals_key = "total",
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex, # Note: This is a positional argument in the function signature.
    digits = 1
)

Calculate all totals using the appropriate method instance of the internal function _calculate_totals!(), dependent on whether the column is a Float (seroprevalence) or Integer (count). Uses the internal function _collect_totals_check_args() to identify what arguments need to be passed to _calculate_totals!() function. Uses the internal function _totals_row_selectors() to extract the totals row from the dataframe, for use when calculating the serotype weight total seroprevalence.

Arguments

  • df::DataFrame: The input DataFrame.

  • column::Symbol: Symbol of the column containing state/UT names (default: :states_ut).

  • totals_key::String: String key used to identify the totals row (default: "total").

  • allowed_serotypes::Vector{String}: Vector of allowed serotype strings.

  • reg::Regex: A positional Regex argument to select columns for totals calculation.

  • digits::Int: Number of digits for rounding (default: 1).

source
FMDData.has_totals_row Function
julia
has_totals_row(
    df::DataFrame,
    column::Symbol = :states_ut,
    possible_keys = ["total", "totals"]
)

Check if the table has a totals row.

df should have, at the very least, cleaned column names using the clean_colnames() function.

source
FMDData.select_calculated_totals! Function
julia
select_calculated_totals!(
	df::DataFrame,
	column::Symbol = :states_ut,
	totals_key = "total",
	calculated_totals_key = "total calculated"
)

If the cleaned data contains both a provided and a calculated totals row then return strip the provided one and rename the calculated.

source
FMDData.totals_check Function
julia
totals_check(
    totals::DataFrameRow,
    calculated_totals::OrderedDict,
    column::Symbol = :states_ut;
    atol = 0.0
)

Check if the totals provided in a DataFrameRow match the calculated totals.

Arguments

  • totals::DataFrameRow: A row from a DataFrame, typically the 'total' row.

  • calculated_totals::OrderedDict: An OrderedDict where keys are column names and values are the calculated totals for these columns.

  • column::Symbol: The symbol for the column containing state/UT names, used for error messaging if totals don't match (default: :states_ut).

  • atol::Float64: Absolute tolerance used for comparing floating-point numbers (default: 0.0).

Returns Try.Ok(nothing) if all totals match, or Try.Err with a descriptive message if discrepancies are found.

source