tab_err.error_type#

Classes#

`AddDelta`	Adds a delta to values in a column.
`CategorySwap`	Simulate incorrect labels in a column that contains categorical values.
`ErrorType`	Error Type Abstract Base Class.
`ErrorTypeConfig`	Parameters that describe the error type.
`Extraneous`	Adds Extraneous strings around the values in a column.
`MissingValue`	Insert missing values into a column.
`Mistype`	Insert incorrectly typed values into a column. Note that the dtype of the column is changed by this operation.
`Mojibake`	Inserts mojibake into a column containing strings.
`Outlier`	Inserts outliers into a column by pushing data points outside the interquartile range (IQR) boundaries.
`Permutate`	Permutates the parts of a compound value in a column.
`Replace`	Replace a part of strings within a column.
`Typo`	Inserts realistic typos into a column containing strings.
`WrongUnit`	Simulate a column containing values that are scaled because they are not stored in the same unit.

Package Contents#

class tab_err.error_type.AddDelta(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#

Bases: tab_err.error_type._error_type.ErrorType

Adds a delta to values in a column.

apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) → pandas.Series#

Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.

Parameters:

data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.

Returns:

The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.

Return type:

pd.Series

classmethod from_dict(data: dict[str, Any]) → ErrorType#

Deserialize an ErrorType object from a dictionary.

Parameters:: data (dict[str, Any]) – A dictionary representation of the ErrorType object.
Returns:: An ErrorType object deserialized from the dictionary.
Return type:: ErrorType

get_valid_columns(data: pandas.DataFrame) → list[str | int]#: Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.

to_dict() → dict[str, Any]#

Serialized the ErrorType object into a dictionary.

Returns:: A dictionary representation of the ErrorType object.
Return type:: dict[str, Any]

class tab_err.error_type.CategorySwap(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#

Bases: tab_err.error_type._error_type.ErrorType

Simulate incorrect labels in a column that contains categorical values.

apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) → pandas.Series#

Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.

Parameters:

data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.

Returns:

The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.

Return type:

pd.Series

classmethod from_dict(data: dict[str, Any]) → ErrorType#

Deserialize an ErrorType object from a dictionary.

Parameters:: data (dict[str, Any]) – A dictionary representation of the ErrorType object.
Returns:: An ErrorType object deserialized from the dictionary.
Return type:: ErrorType

get_valid_columns(data: pandas.DataFrame) → list[str | int]#: Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.

to_dict() → dict[str, Any]#

Serialized the ErrorType object into a dictionary.

Returns:: A dictionary representation of the ErrorType object.
Return type:: dict[str, Any]

class tab_err.error_type.ErrorType(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#

Bases: abc.ABC

Error Type Abstract Base Class.

apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) → pandas.Series#

Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.

Parameters:

data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.

Returns:

The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.

Return type:

pd.Series

classmethod from_dict(data: dict[str, Any]) → ErrorType#

Deserialize an ErrorType object from a dictionary.

Parameters:: data (dict[str, Any]) – A dictionary representation of the ErrorType object.
Returns:: An ErrorType object deserialized from the dictionary.
Return type:: ErrorType

get_valid_columns(data: pandas.DataFrame) → list[str | int]#: Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.

to_dict() → dict[str, Any]#

Serialized the ErrorType object into a dictionary.

Returns:: A dictionary representation of the ErrorType object.
Return type:: dict[str, Any]

class tab_err.error_type.ErrorTypeConfig#

Parameters that describe the error type.

Arguments that are specific to the error type. Most error types do not share the same arguments, which is why there are many attributes of this dataclass that are mostly default values.

encoding_sender#

When creating Mojibake, used to encode strings to bytes. Defaults to None.

Type:: str | None, optional

encoding_receiver#

When creating Mojibake, used to decode bytes back to strings. Defaults to None.

Type:: str | None, optional

typo_keyboard_layout#

When using Typo, the keyboard layout used by the typer. Defaults to “ansi-qwerty”.

Type:: str

typo_error_period#

When using Typo, the period at which the error occurs. Defaults to 10.

Type:: int

missing_value#

Token used to indicate missing values in Pandas. Defaults to None.

Type:: str | None, optional

mislabel_weighing#

Weight of the distribution that mislables are drawn from. Either “uniform” or “frequency”. Defaults to “uniform”

Type:: str

mislabel_weights#

_description_. Defaults to None.

Type:: dict[Any, float] | None

mistype_dtype#

dtype of the column that is incorrectly typed. One of “object”, “string”, “int64”, “Int64”, “float64”, “Float64”. Defaults to None.

Type:: str | None

wrong_unit_scaling#

Function that scales a value from one unit to another. Defaults to None.

Type:: Callable | None

permutation_separator#

A Char that separates structured text, e.g. ‘ ‘ in an address or ‘-’ in a date. Defaults to “ “.

Type:: str

permutation_automation_pattern#

Permutations either all follow the same pattern (fixed) or not (random). Defaults to “random”

Type:: str

permutation_pattern#

Manually specify the pattern which the permutations follow. Overwrite automation patterns if set. Defaults to None.

Type:: list[int] | None

extraneous_value_template#

Template string used to add extraneous data to the value. The position of the value is indicated by the template string ‘{value}’. Defaults to None.

Type:: str | None

replace_what#

String that the Replace Error Type replaces with replace_with. Defaults to None.

Type:: str | None

replace_with#

String that the Replace Error Type uses to replace replace_what with. Defaults to “”.

Type:: str

add_delta_value#

Value that is added to the value by the AddDelta Error Type. Defaults to None.

Type:: Any | None

outlier_coin_flip_threshold#

Coin flip determines the direction (positive, negative) of the outlier. Defaults to 0.5.

Type:: float

outlier_coefficient#

Coefficient that determines the magnitude of the outliers for the Outlier Error Type. Defaults to 1.0.

Type:: float

outlier_noise_coeff#

Coefficient that influences the standard deviation of the noise added to the outliers for the Outlier Error Type. Defaults to 0.1.

Type:: float

static from_dict(data: dict[str, Any]) → ErrorTypeConfig#: Deserializes the ErrorTypeConfig from a dict.

to_dict() → dict[str, Any]#: Serializes the ErrorTypeConfig to a dict.

add_delta_value: float | int | None = None#

encoding_receiver: str | None = None#

encoding_sender: str | None = None#

extraneous_value_template: str | None = None#

mislabel_weighing: str = 'uniform'#

mislabel_weights: dict[Any, float] | None = None#

missing_value: str | None = None#

mistype_dtype: str | None = None#

outlier_coefficient: float = 1.0#

outlier_coin_flip_threshold: float = 0.5#

outlier_noise_coeff: float = 0.1#

permutation_automation_pattern: str = 'random'#

permutation_pattern: list[int] | None = None#

permutation_separator: str = ' '#

replace_what: str | None = None#

replace_with: str = ''#

typo_error_period: int = 10#

typo_keyboard_layout: str = 'ansi-qwerty'#

wrong_unit_scaling: Callable | None = None#

class tab_err.error_type.Extraneous(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#

Bases: tab_err.error_type._error_type.ErrorType

Adds Extraneous strings around the values in a column.

apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) → pandas.Series#

Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.

Parameters:

data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.

Returns:

The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.

Return type:

pd.Series

classmethod from_dict(data: dict[str, Any]) → ErrorType#

Deserialize an ErrorType object from a dictionary.

Parameters:: data (dict[str, Any]) – A dictionary representation of the ErrorType object.
Returns:: An ErrorType object deserialized from the dictionary.
Return type:: ErrorType

get_valid_columns(data: pandas.DataFrame) → list[str | int]#: Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.

to_dict() → dict[str, Any]#

Serialized the ErrorType object into a dictionary.

Returns:: A dictionary representation of the ErrorType object.
Return type:: dict[str, Any]

class tab_err.error_type.MissingValue(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#

Bases: tab_err.error_type._error_type.ErrorType

Insert missing values into a column.

Missing value handling is not a solved problem in pandas and under active development. Today, the best heuristic for inserting missing values is to assign None to the value. Pandas will choose the missing value sentinel based on the column dtype (https://pandas.pydata.org/docs/user_guide/missing_data.html#inserting-missing-data).

apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) → pandas.Series#

Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.

Parameters:

data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.

Returns:

The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.

Return type:

pd.Series

classmethod from_dict(data: dict[str, Any]) → ErrorType#

Deserialize an ErrorType object from a dictionary.

Parameters:: data (dict[str, Any]) – A dictionary representation of the ErrorType object.
Returns:: An ErrorType object deserialized from the dictionary.
Return type:: ErrorType

get_valid_columns(data: pandas.DataFrame) → list[str | int]#: Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.

to_dict() → dict[str, Any]#

Serialized the ErrorType object into a dictionary.

Returns:: A dictionary representation of the ErrorType object.
Return type:: dict[str, Any]

class tab_err.error_type.Mistype(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#

Bases: tab_err.error_type._error_type.ErrorType

Insert incorrectly typed values into a column. Note that the dtype of the column is changed by this operation.

String / Object is the dead end of typing

In an effort to keep the code relatively simple, we cast the corrupted column to an Object Dtype.

apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) → pandas.Series#

Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.

Parameters:

data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.

Returns:

The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.

Return type:

pd.Series

classmethod from_dict(data: dict[str, Any]) → ErrorType#

Deserialize an ErrorType object from a dictionary.

Parameters:: data (dict[str, Any]) – A dictionary representation of the ErrorType object.
Returns:: An ErrorType object deserialized from the dictionary.
Return type:: ErrorType

get_valid_columns(data: pandas.DataFrame) → list[str | int]#: Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.

to_dict() → dict[str, Any]#

Serialized the ErrorType object into a dictionary.

Returns:: A dictionary representation of the ErrorType object.
Return type:: dict[str, Any]

class tab_err.error_type.Mojibake#

Bases: tab_err.error_type._error_type.ErrorType

Inserts mojibake into a column containing strings.

class tab_err.error_type.Outlier(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#

Bases: tab_err.error_type._error_type.ErrorType

Inserts outliers into a column by pushing data points outside the interquartile range (IQR) boundaries.

Data points below the mean are pushed towards lower outliers, while those above the mean are pushed towards upper outliers.
The outlier_coefficient controls how far values are pushed relative to the IQR. An outlier_coefficient of 1.0 means the

push is equal to half of the IQR, shifting the mean value exactly to the edge of the IQR. Values that deviate more from the mean will be pushed beyond the IQR boundary. When outlier_coefficient is less than 1.0, values—including the mean—are pushed less drastically, potentially keeping them within the IQR. - The push is calculated as:

push = outlier_coefficient * |upper_boundary - mean_value|

Values above the mean are pushed towards the upper boundary, and values below the mean are pushed towards the lower boundary.

If a value equals the mean, a coin flip decides whether it is pushed towards the upper or lower boundary. - After this process, Gaussian noise is added to simulate measurement errors and make the outliers appear more realistic. The amount of noise can be controlled via the outlier_noise_coeff parameter and is scaled with the IQR to ensure it is proportional to the data’s spread.

apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) → pandas.Series#

Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.

Parameters:

data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.

Returns:

The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.

Return type:

pd.Series

classmethod from_dict(data: dict[str, Any]) → ErrorType#

Deserialize an ErrorType object from a dictionary.

Parameters:: data (dict[str, Any]) – A dictionary representation of the ErrorType object.
Returns:: An ErrorType object deserialized from the dictionary.
Return type:: ErrorType

get_valid_columns(data: pandas.DataFrame) → list[str | int]#: Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.

to_dict() → dict[str, Any]#

Serialized the ErrorType object into a dictionary.

Returns:: A dictionary representation of the ErrorType object.
Return type:: dict[str, Any]

class tab_err.error_type.Permutate(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#

Bases: tab_err.error_type._error_type.ErrorType

Permutates the parts of a compound value in a column.

apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) → pandas.Series#

Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.

Parameters:

data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.

Returns:

The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.

Return type:

pd.Series

classmethod from_dict(data: dict[str, Any]) → ErrorType#

Deserialize an ErrorType object from a dictionary.

Parameters:: data (dict[str, Any]) – A dictionary representation of the ErrorType object.
Returns:: An ErrorType object deserialized from the dictionary.
Return type:: ErrorType

get_valid_columns(data: pandas.DataFrame) → list[str | int]#: Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.

to_dict() → dict[str, Any]#

Serialized the ErrorType object into a dictionary.

Returns:: A dictionary representation of the ErrorType object.
Return type:: dict[str, Any]

class tab_err.error_type.Replace#

Bases: tab_err.error_type._error_type.ErrorType

Replace a part of strings within a column.

class tab_err.error_type.Typo#

Bases: tab_err.error_type._error_type.ErrorType

Inserts realistic typos into a column containing strings.

Typo imitates a typist who misses the correct key. For a given keyboard-layout and key, Typo maps all keys that physically border the given key on the given layout. It assumes that all bordering keys are equally likely to be hit by the typist.

Typo assumes that words are separated by whitespaces. Applied to a cell, the period with which Typo will corrupt words in that cell is controlled by the parameter typo_error_period. By default, Typo will insert a typo into every 10th word. Typo will always insert at least one typo into an affected cell.

class tab_err.error_type.WrongUnit(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#

Bases: tab_err.error_type._error_type.ErrorType

Simulate a column containing values that are scaled because they are not stored in the same unit.

apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) → pandas.Series#

Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.

Parameters:

data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.

Returns:

The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.

Return type:

pd.Series

classmethod from_dict(data: dict[str, Any]) → ErrorType#

Deserialize an ErrorType object from a dictionary.

Parameters:: data (dict[str, Any]) – A dictionary representation of the ErrorType object.
Returns:: An ErrorType object deserialized from the dictionary.
Return type:: ErrorType

get_valid_columns(data: pandas.DataFrame) → list[str | int]#: Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.

to_dict() → dict[str, Any]#

Serialized the ErrorType object into a dictionary.

Returns:: A dictionary representation of the ErrorType object.
Return type:: dict[str, Any]