Cleaning Functions

FMDData.all_2019_cleaning_steps Method

julia

all_2019_cleaning_steps(
    input_filename::T1,
    input_dir::T1;
	output_filename::T1 = "clean_$(input_filename)",
    output_dir::T1 = icar_cleaned_dir(),
    load_format = DataFrame
) where {T1 <: AbstractString}

A wrapper function that runs all the cleaning steps for seroprevalence tables from the 2019 annual report that share the common format of states in each row and columns relating to serotype seroprevalence. For tables from later reports, use all_cleaning_steps()

FMDData.all_cleaning_steps Method

julia

all_cleaning_steps(
    input_filename::T1,
    input_dir::T1;
	output_filename::T1 = "clean_$(input_filename)",
    output_dir::T1 = icar_cleaned_dir(),
    load_format = DataFrame
) where {T1 <: AbstractString}

A wrapper function that runs all the cleaning steps for seroprevalence tables that share the common format of states in each row and columns relating to serotype counts/seroprevalence. For tables that contain multiple rows for each state e.g., 2019 report tables which cover multiple years for a single state, use the relevant alternative wrapper functions all_2019_cleaning_steps().

FMDData.calculate_state_counts Function

julia

calculate_state_counts(df::DataFrame, allowed_serotypes = default_allowed_serotypes)

A wrapper function around the internal _calculate_state_counts() function to calculate the state/serotype specific counts based upon the state/serotype seroprevalence values and total state counts. See the documentation of _calculate_state_counts() for more details on the implementation.

FMDData.calculate_state_seroprevalence Method

julia

calculate_state_seroprevalence(df::DataFrame, allowed_serotypes = default_allowed_serotypes)

A wrapper function around the internal _calculate_state_seroprevalence() function to calculate the state/serotype specific counts based upon the state/serotype seroprevalence values and total state counts. See the documentation of _calculate_state_seroprevalence() for more details on the implementation.

FMDData.check_calculated_values_match_existing Method

julia

check_calculated_values_match_existing(
    df::DataFrame,
    allowed_serotypes::T = default_allowed_serotypes;
    digits = 1
) where {T <: AbstractVector{<:AbstractString}}

Check whether the provided values of counts and seroprevalence values match the corresponding values calculated.

FMDData.check_seroprevalence_as_pct Function

julia

check_seroprevalence_as_pct(df::DataFrame, reg::Regex)

Check if all seroprevalence columns are reported as a percentage, and not as a proportion.

FMDData.clean_colnames Function

julia

clean_colnames(df::DataFrame, allowed_chars_reg::Regex)

Replace spaces and / with underscores, and (n) and (%) with "count" and "pct" respectively. allowed_chars_reg should be a negative match, where the default r"[^\w]" matches to all non numeric/alphabetic/_ characters

FMDData.rename_aggregated_pre_post_counts Function

julia

rename_aggregated_pre_post_counts(
    df::DataFrame,
    original_regex::Regex
    substitution_string::SubstitutionString
)

Rename the aggregated pre/post counts to use the same format as the serotype-specific values

FMDData.check_duplicated_column_names Method

julia

check_duplicated_column_names(df::DataFrame)

Wrapper function around the two internal functions _check_identical_column_names() and _check_similar_column_names(). Checks for both identical and very similar column names in the DataFrame.

FMDData.check_duplicated_columns Method

julia

check_duplicated_columns(df::DataFrame)

Check if any columns have identical values

FMDData.load_csv Method

julia

load_csv(
    filename::T1,
    dir::T1,
    output_format = DataFrame
) where {T1 <: AbstractString}

A helper function to check if a csv input file and directory exists, and if so, load (as a DataFrame by default).

FMDData.write_csv Method

julia

write_csv(
    filename::T1,
    dir::T1,
    data::DataFrame
) where {T1 <: AbstractString}

A helper function to check if the specified name and directory exist and are valid, and if so, write the CSV to disk.

FMDData.check_aggregated_pre_post_counts_exist Function

julia

check_aggregated_pre_post_counts_exist(
	df::DataFrame,
	columns = ["serotype_all_count_pre", "serotype_all_count_post"]
)

Check if data contains aggregated counts of pre and post vaccinated individuals. Should only be used on dataframes that have renamed these columns to meet the standard pattern of "serotype_all_count_pre"

FMDData.check_pre_post_exists Function

julia

check_pre_post_exists(df::DataFrame, reg::Regex)

Confirms each combination of serotype and result type (N/%) has both a pre- and post-vaccination results column, but nothing else.

FMDData.select_calculated_cols! Method

julia

select_calculated_cols!(
    df::DataFrame,
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex
)

Checks if the data contains both provided and calculated columns that refer to the same variables. If the calculated column is a % seroprevalence, keep the calculated values. If the calculated column is a column of counts, keep the provided as they are deemed to be more accurate (counts require no calculation and should be a direct recording/reporting of the underlying data). The cleaning function check_calculated_values_match_existing() should have been run before to ensure there are no surprises during this processing step i.e., accidentally deleting columns that should be retained.

FMDData.check_allowed_serotypes Function

julia

check_allowed_serotypes(
    df::DataFrame,
    allowed_serotypes::Vector{String} = vcat("all", default_allowed_serotypes),
    reg::Regex = r"serotype_(.*)_(?|count|pct)_(pre|post)"
)

Function to confirm that all required and no disallowed serotypes are provided in the data.

FMDData.sort_columns! Method

julia

sort_columns!(
    df::DataFrame;
    statename_column = :states_ut,
    allowed_serotypes = vcat("all", default_allowed_serotypes)
    prefix = "serotype_",
    suffix_order = [
        "_count_pre",
        "_pct_pre",
        "_count_post",
        "_pct_post",
    ]
)

Sort the columns of the cleaned dataframe to have a consistent order. Follows the pattern:

state column name
serotype all counts (pre then post)
serotype specific columns in the order "O", "A", "Asia1"

The serotype specific columns have their data presented in the following order.

serotype X pre count
serotype X pre pct
serotype X post count
serotype X post pct

FMDData.sort_states! Method

julia

sort_states!(
    df::DataFrame;
    statename_column = :states_ut,
    totals_key = "total"
)

Sort the dataframe by alphabetical order of the states and list the totals row at the bottom. Preserves the original order of rows if there are duplicates.

FMDData.check_duplicated_states Function

julia

check_duplicated_states(
    df::DataFrame,
    column::Symbol = :states_ut,
)

Check if there are duplicated states in the data

FMDData.check_missing_states Function

julia

check_missing_states(
    df::DataFrame,
    column::Symbol = :states_ut,
)

Check if the states column of the data contains missing values

FMDData.correct_all_state_names Function

julia

correct_all_state_names(
    df::DataFrame,
    column::Symbol = :states_ut,
    states_dict::Dict = FMDData.states_dict
)

Correct all state name values in the data

FMDData.all_totals_check Method

julia

all_totals_check(
    df::DataFrame;
    column::Symbol = :states_ut,
    totals_key = "total",
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex, # Note: This is a positional argument in the function signature.
    atol = 0.0,
    digits = 1
)

Checks if the totals row in a DataFrame is accurate.

This function has two main methods:

all_totals_check(df::DataFrame; ...): This is the main method, which calculates the totals and then compares them to the existing totals row in the DataFrame.
all_totals_check(totals_dict::OrderedDict, df::DataFrame; ...): This method is used when the totals have already been calculated and are passed in as a dictionary.

The function calculates the totals for both counts and seroprevalence. For counts, it calculates a simple sum. For seroprevalence, it calculates a weighted sum based on the relevant counts (pre- or post-vaccination).

Arguments

df: The DataFrame to check.
column: The column containing the state/UT names. Defaults to :states_ut.
totals_key: The key used to identify the totals row. Defaults to "total".
allowed_serotypes: A vector of allowed serotypes.
reg: A regular expression used to select the columns to check.
atol: The absolute tolerance to use when comparing floating-point numbers. Defaults to 0.0.
digits: The number of digits to round to. Defaults to 1.

FMDData.calculate_all_totals Method

julia

calculate_all_totals(
    df::DataFrame;
    column::Symbol = :states_ut,
    totals_key = "total",
    allowed_serotypes = vcat("all", default_allowed_serotypes),
    reg::Regex, # Note: This is a positional argument in the function signature.
    digits = 1
)

Calculate all totals using the appropriate method instance of the internal function _calculate_totals!(), dependent on whether the column is a Float (seroprevalence) or Integer (count). Uses the internal function _collect_totals_check_args() to identify what arguments need to be passed to _calculate_totals!() function. Uses the internal function _totals_row_selectors() to extract the totals row from the dataframe, for use when calculating the serotype weight total seroprevalence.

Arguments

df::DataFrame: The input DataFrame.
column::Symbol: Symbol of the column containing state/UT names (default: :states_ut).
totals_key::String: String key used to identify the totals row (default: "total").
allowed_serotypes::Vector{String}: Vector of allowed serotype strings.
reg::Regex: A positional Regex argument to select columns for totals calculation.
digits::Int: Number of digits for rounding (default: 1).

FMDData.has_totals_row Function

julia

has_totals_row(
    df::DataFrame,
    column::Symbol = :states_ut,
    possible_keys = ["total", "totals"]
)

Check if the table has a totals row.

df should have, at the very least, cleaned column names using the clean_colnames() function.

FMDData.select_calculated_totals! Function

julia

select_calculated_totals!(
	df::DataFrame,
	column::Symbol = :states_ut,
	totals_key = "total",
	calculated_totals_key = "total calculated"
)

If the cleaned data contains both a provided and a calculated totals row then return strip the provided one and rename the calculated.

FMDData.totals_check Function

julia

totals_check(
    totals::DataFrameRow,
    calculated_totals::OrderedDict,
    column::Symbol = :states_ut;
    atol = 0.0
)

Check if the totals provided in a DataFrameRow match the calculated totals.

Arguments

totals::DataFrameRow: A row from a DataFrame, typically the 'total' row.
calculated_totals::OrderedDict: An OrderedDict where keys are column names and values are the calculated totals for these columns.
column::Symbol: The symbol for the column containing state/UT names, used for error messaging if totals don't match (default: :states_ut).
atol::Float64: Absolute tolerance used for comparing floating-point numbers (default: 0.0).

Returns Try.Ok(nothing) if all totals match, or Try.Err with a descriptive message if discrepancies are found.