tab_err.error_type#
Classes#
Adds a delta to values in a column. |
|
Simulate incorrect labels in a column that contains categorical values. |
|
Error Type Abstract Base Class. |
|
Parameters that describe the error type. |
|
Adds Extraneous strings around the values in a column. |
|
Insert missing values into a column. |
|
Insert incorrectly typed values into a column. Note that the dtype of the column is changed by this operation. |
|
Inserts mojibake into a column containing strings. |
|
Inserts outliers into a column by pushing data points outside the interquartile range (IQR) boundaries. |
|
Permutates the parts of a compound value in a column. |
|
Replace a part of strings within a column. |
|
Inserts realistic typos into a column containing strings. |
|
Simulate a column containing values that are scaled because they are not stored in the same unit. |
Package Contents#
- class tab_err.error_type.AddDelta(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#
Bases:
tab_err.error_type._error_type.ErrorTypeAdds a delta to values in a column.
- apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) pandas.Series#
Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.
- Parameters:
data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.
- Returns:
The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.
- Return type:
pd.Series
- classmethod from_dict(data: dict[str, Any]) ErrorType#
Deserialize an ErrorType object from a dictionary.
- Parameters:
data (dict[str, Any]) – A dictionary representation of the ErrorType object.
- Returns:
An ErrorType object deserialized from the dictionary.
- Return type:
- get_valid_columns(data: pandas.DataFrame) list[str | int]#
Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.
- to_dict() dict[str, Any]#
Serialized the ErrorType object into a dictionary.
- Returns:
A dictionary representation of the ErrorType object.
- Return type:
dict[str, Any]
- class tab_err.error_type.CategorySwap(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#
Bases:
tab_err.error_type._error_type.ErrorTypeSimulate incorrect labels in a column that contains categorical values.
- apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) pandas.Series#
Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.
- Parameters:
data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.
- Returns:
The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.
- Return type:
pd.Series
- classmethod from_dict(data: dict[str, Any]) ErrorType#
Deserialize an ErrorType object from a dictionary.
- Parameters:
data (dict[str, Any]) – A dictionary representation of the ErrorType object.
- Returns:
An ErrorType object deserialized from the dictionary.
- Return type:
- get_valid_columns(data: pandas.DataFrame) list[str | int]#
Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.
- to_dict() dict[str, Any]#
Serialized the ErrorType object into a dictionary.
- Returns:
A dictionary representation of the ErrorType object.
- Return type:
dict[str, Any]
- class tab_err.error_type.ErrorType(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#
Bases:
abc.ABCError Type Abstract Base Class.
- apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) pandas.Series#
Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.
- Parameters:
data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.
- Returns:
The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.
- Return type:
pd.Series
- classmethod from_dict(data: dict[str, Any]) ErrorType#
Deserialize an ErrorType object from a dictionary.
- Parameters:
data (dict[str, Any]) – A dictionary representation of the ErrorType object.
- Returns:
An ErrorType object deserialized from the dictionary.
- Return type:
- get_valid_columns(data: pandas.DataFrame) list[str | int]#
Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.
- to_dict() dict[str, Any]#
Serialized the ErrorType object into a dictionary.
- Returns:
A dictionary representation of the ErrorType object.
- Return type:
dict[str, Any]
- class tab_err.error_type.ErrorTypeConfig#
Parameters that describe the error type.
Arguments that are specific to the error type. Most error types do not share the same arguments, which is why there are many attributes of this dataclass that are mostly default values.
- encoding_sender#
When creating Mojibake, used to encode strings to bytes. Defaults to None.
- Type:
str | None, optional
- encoding_receiver#
When creating Mojibake, used to decode bytes back to strings. Defaults to None.
- Type:
str | None, optional
- typo_keyboard_layout#
When using Typo, the keyboard layout used by the typer. Defaults to “ansi-qwerty”.
- Type:
str
- typo_error_period#
When using Typo, the period at which the error occurs. Defaults to 10.
- Type:
int
- missing_value#
Token used to indicate missing values in Pandas. Defaults to None.
- Type:
str | None, optional
- mislabel_weighing#
Weight of the distribution that mislables are drawn from. Either “uniform” or “frequency”. Defaults to “uniform”
- Type:
str
- mislabel_weights#
_description_. Defaults to None.
- Type:
dict[Any, float] | None
- mistype_dtype#
dtype of the column that is incorrectly typed. One of “object”, “string”, “int64”, “Int64”, “float64”, “Float64”. Defaults to None.
- Type:
str | None
- wrong_unit_scaling#
Function that scales a value from one unit to another. Defaults to None.
- Type:
Callable | None
- permutation_separator#
A Char that separates structured text, e.g. ‘ ‘ in an address or ‘-’ in a date. Defaults to “ “.
- Type:
str
- permutation_automation_pattern#
Permutations either all follow the same pattern (fixed) or not (random). Defaults to “random”
- Type:
str
- permutation_pattern#
Manually specify the pattern which the permutations follow. Overwrite automation patterns if set. Defaults to None.
- Type:
list[int] | None
- extraneous_value_template#
Template string used to add extraneous data to the value. The position of the value is indicated by the template string ‘{value}’. Defaults to None.
- Type:
str | None
- replace_what#
String that the Replace Error Type replaces with replace_with. Defaults to None.
- Type:
str | None
- replace_with#
String that the Replace Error Type uses to replace replace_what with. Defaults to “”.
- Type:
str
- add_delta_value#
Value that is added to the value by the AddDelta Error Type. Defaults to None.
- Type:
Any | None
- outlier_coin_flip_threshold#
Coin flip determines the direction (positive, negative) of the outlier. Defaults to 0.5.
- Type:
float
- outlier_coefficient#
Coefficient that determines the magnitude of the outliers for the Outlier Error Type. Defaults to 1.0.
- Type:
float
- outlier_noise_coeff#
Coefficient that influences the standard deviation of the noise added to the outliers for the Outlier Error Type. Defaults to 0.1.
- Type:
float
- static from_dict(data: dict[str, Any]) ErrorTypeConfig#
Deserializes the ErrorTypeConfig from a dict.
- to_dict() dict[str, Any]#
Serializes the ErrorTypeConfig to a dict.
- add_delta_value: float | int | None = None#
- encoding_receiver: str | None = None#
- encoding_sender: str | None = None#
- extraneous_value_template: str | None = None#
- mislabel_weighing: str = 'uniform'#
- mislabel_weights: dict[Any, float] | None = None#
- missing_value: str | None = None#
- mistype_dtype: str | None = None#
- outlier_coefficient: float = 1.0#
- outlier_coin_flip_threshold: float = 0.5#
- outlier_noise_coeff: float = 0.1#
- permutation_automation_pattern: str = 'random'#
- permutation_pattern: list[int] | None = None#
- permutation_separator: str = ' '#
- replace_what: str | None = None#
- replace_with: str = ''#
- typo_error_period: int = 10#
- typo_keyboard_layout: str = 'ansi-qwerty'#
- wrong_unit_scaling: Callable | None = None#
- class tab_err.error_type.Extraneous(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#
Bases:
tab_err.error_type._error_type.ErrorTypeAdds Extraneous strings around the values in a column.
- apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) pandas.Series#
Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.
- Parameters:
data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.
- Returns:
The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.
- Return type:
pd.Series
- classmethod from_dict(data: dict[str, Any]) ErrorType#
Deserialize an ErrorType object from a dictionary.
- Parameters:
data (dict[str, Any]) – A dictionary representation of the ErrorType object.
- Returns:
An ErrorType object deserialized from the dictionary.
- Return type:
- get_valid_columns(data: pandas.DataFrame) list[str | int]#
Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.
- to_dict() dict[str, Any]#
Serialized the ErrorType object into a dictionary.
- Returns:
A dictionary representation of the ErrorType object.
- Return type:
dict[str, Any]
- class tab_err.error_type.MissingValue(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#
Bases:
tab_err.error_type._error_type.ErrorTypeInsert missing values into a column.
Missing value handling is not a solved problem in pandas and under active development. Today, the best heuristic for inserting missing values is to assign None to the value. Pandas will choose the missing value sentinel based on the column dtype (https://pandas.pydata.org/docs/user_guide/missing_data.html#inserting-missing-data).
- apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) pandas.Series#
Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.
- Parameters:
data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.
- Returns:
The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.
- Return type:
pd.Series
- classmethod from_dict(data: dict[str, Any]) ErrorType#
Deserialize an ErrorType object from a dictionary.
- Parameters:
data (dict[str, Any]) – A dictionary representation of the ErrorType object.
- Returns:
An ErrorType object deserialized from the dictionary.
- Return type:
- get_valid_columns(data: pandas.DataFrame) list[str | int]#
Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.
- to_dict() dict[str, Any]#
Serialized the ErrorType object into a dictionary.
- Returns:
A dictionary representation of the ErrorType object.
- Return type:
dict[str, Any]
- class tab_err.error_type.Mistype(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#
Bases:
tab_err.error_type._error_type.ErrorTypeInsert incorrectly typed values into a column. Note that the dtype of the column is changed by this operation.
String / Object is the dead end of typing
In an effort to keep the code relatively simple, we cast the corrupted column to an Object Dtype.
- apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) pandas.Series#
Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.
- Parameters:
data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.
- Returns:
The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.
- Return type:
pd.Series
- classmethod from_dict(data: dict[str, Any]) ErrorType#
Deserialize an ErrorType object from a dictionary.
- Parameters:
data (dict[str, Any]) – A dictionary representation of the ErrorType object.
- Returns:
An ErrorType object deserialized from the dictionary.
- Return type:
- get_valid_columns(data: pandas.DataFrame) list[str | int]#
Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.
- to_dict() dict[str, Any]#
Serialized the ErrorType object into a dictionary.
- Returns:
A dictionary representation of the ErrorType object.
- Return type:
dict[str, Any]
- class tab_err.error_type.Mojibake#
Bases:
tab_err.error_type._error_type.ErrorTypeInserts mojibake into a column containing strings.
- class tab_err.error_type.Outlier(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#
Bases:
tab_err.error_type._error_type.ErrorTypeInserts outliers into a column by pushing data points outside the interquartile range (IQR) boundaries.
Data points below the mean are pushed towards lower outliers, while those above the mean are pushed towards upper outliers.
The outlier_coefficient controls how far values are pushed relative to the IQR. An outlier_coefficient of 1.0 means the
push is equal to half of the IQR, shifting the mean value exactly to the edge of the IQR. Values that deviate more from the mean will be pushed beyond the IQR boundary. When outlier_coefficient is less than 1.0, values—including the mean—are pushed less drastically, potentially keeping them within the IQR. - The push is calculated as:
push = outlier_coefficient * |upper_boundary - mean_value|
Values above the mean are pushed towards the upper boundary, and values below the mean are pushed towards the lower boundary.
If a value equals the mean, a coin flip decides whether it is pushed towards the upper or lower boundary. - After this process, Gaussian noise is added to simulate measurement errors and make the outliers appear more realistic. The amount of noise can be controlled via the outlier_noise_coeff parameter and is scaled with the IQR to ensure it is proportional to the data’s spread.
- apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) pandas.Series#
Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.
- Parameters:
data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.
- Returns:
The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.
- Return type:
pd.Series
- classmethod from_dict(data: dict[str, Any]) ErrorType#
Deserialize an ErrorType object from a dictionary.
- Parameters:
data (dict[str, Any]) – A dictionary representation of the ErrorType object.
- Returns:
An ErrorType object deserialized from the dictionary.
- Return type:
- get_valid_columns(data: pandas.DataFrame) list[str | int]#
Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.
- to_dict() dict[str, Any]#
Serialized the ErrorType object into a dictionary.
- Returns:
A dictionary representation of the ErrorType object.
- Return type:
dict[str, Any]
- class tab_err.error_type.Permutate(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#
Bases:
tab_err.error_type._error_type.ErrorTypePermutates the parts of a compound value in a column.
- apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) pandas.Series#
Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.
- Parameters:
data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.
- Returns:
The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.
- Return type:
pd.Series
- classmethod from_dict(data: dict[str, Any]) ErrorType#
Deserialize an ErrorType object from a dictionary.
- Parameters:
data (dict[str, Any]) – A dictionary representation of the ErrorType object.
- Returns:
An ErrorType object deserialized from the dictionary.
- Return type:
- get_valid_columns(data: pandas.DataFrame) list[str | int]#
Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.
- to_dict() dict[str, Any]#
Serialized the ErrorType object into a dictionary.
- Returns:
A dictionary representation of the ErrorType object.
- Return type:
dict[str, Any]
- class tab_err.error_type.Replace#
Bases:
tab_err.error_type._error_type.ErrorTypeReplace a part of strings within a column.
- class tab_err.error_type.Typo#
Bases:
tab_err.error_type._error_type.ErrorTypeInserts realistic typos into a column containing strings.
Typo imitates a typist who misses the correct key. For a given keyboard-layout and key, Typo maps all keys that physically border the given key on the given layout. It assumes that all bordering keys are equally likely to be hit by the typist.
Typo assumes that words are separated by whitespaces. Applied to a cell, the period with which Typo will corrupt words in that cell is controlled by the parameter typo_error_period. By default, Typo will insert a typo into every 10th word. Typo will always insert at least one typo into an affected cell.
- class tab_err.error_type.WrongUnit(config: tab_err.error_type._config.ErrorTypeConfig | dict | None = None, seed: int | None = None)#
Bases:
tab_err.error_type._error_type.ErrorTypeSimulate a column containing values that are scaled because they are not stored in the same unit.
- apply(data: pandas.DataFrame, error_mask: pandas.DataFrame, column: str | int) pandas.Series#
Applies an ErrorType to a column of ‘data’. Does type and shape checking and creates a random number generator.
- Parameters:
data (pd.DataFrame) – The Pandas DataFrame containing the column where errors are to be introduced.
error_mask (pd.DataFrame) – The Pandas DataFrame containing the error mask for ‘column’.
column (str | int) – The index in the ‘data’ and ‘error_mask’ DataFrames where errors are to be introduced.
- Returns:
The data column, ‘column’, after errors of ErrorType at the locations specified by ‘error_mask’ are introduced.
- Return type:
pd.Series
- classmethod from_dict(data: dict[str, Any]) ErrorType#
Deserialize an ErrorType object from a dictionary.
- Parameters:
data (dict[str, Any]) – A dictionary representation of the ErrorType object.
- Returns:
An ErrorType object deserialized from the dictionary.
- Return type:
- get_valid_columns(data: pandas.DataFrame) list[str | int]#
Finds the valid columns to which the error type can be applied. Wrapper around _get_valid_columns.
- to_dict() dict[str, Any]#
Serialized the ErrorType object into a dictionary.
- Returns:
A dictionary representation of the ErrorType object.
- Return type:
dict[str, Any]