Feature Engineering

Check Significant Association with Chi_square

This function checks for a significant association between a specified categorical feature and a binary target in a pandas DataFrame using the Chi-Square test. It returns True if there is a significant association and False if the variables are independent. The significance level for the Chi-Square test is controlled by the alpha parameter. s

significant_association_chi_square(df, column_category, column_target, alpha=0.05)

Check for significant association between a categorical feature and a binary target using the Chi-Square test.

Parameters:

df (pandas.DataFrame) – The pandas DataFrame containing the data.
column_category (str) – The name of the categorical feature column.
column_target (str) – The name of the binary target column.
alpha (float, optional, default: 0.05) – The significance level for the Chi-Square test.

Returns:

True if there is a significant association, False if the variables are independent.

Return type:

bool

Raises:

ValueError – If one or more specified columns are not found in the DataFrame.

Example

from from df_csv_excel.fe_functions import significant_association_chi_square

# Assuming 'df' is your DataFrame
result = significant_association_chi_square(df, column_category='category', column_target='target')

Calculate Jaccard similarity

get_similarity(df, column_str_1, column_str_2)

This function is to calculate Jaccard similarity scores between two specified columns in a DataFrame.

Parameters:

df (pandas.DataFrame) – The input DataFrame.
column_str_1 (str) – The name of the first column.
column_str_2 (str) – The name of the second column.

Returns:

An array of Jaccard similarity scores.

Return type:

numpy.ndarray

Raises:

ValueError – If the DataFrame is empty or if one or more specified columns are not found.

Example:

from df_csv_excel.fe_functions import get_similarity

score = get_similarity(df, column_str_1='column1', column_str_2='column2')

Note

The Jaccard similarity is calculated by comparing the unique values in the specified columns. It is a measure of the similarity between the sets of values in the two columns. A score of 0 indicates no similarity, and a score of 1 indicates complete similarity.

Calculate Information Value (IV)

calculate_iv(data, feature, target, custom_bins=None)

This function calculates the Information Value (IV) for a given feature in a binary classification task.

Parameters:

data (pandas.DataFrame) – DataFrame containing the feature and target columns.
feature (str) – Name of the feature column.
target (str) – Name of the target column.
custom_bins (list or array, optional) – Custom bin edges for discretizing the feature. If None, automatic binning is used.

Returns:

Information Value (IV) for the feature, pivot table, and a bar chart.

Return type:

Tuple[float, pandas.DataFrame, matplotlib.pyplot]

Raises:

ValueError – If the DataFrame is empty or if one or more specified columns are not found.

This function discretizes the continuous feature into bins, calculates Weight of Evidence (WoE) and IV, and returns the overall IV.

The formula for IV is:

\[IV = \sum_{i} \left( \text{WoE}_i \cdot (\text{good\_percentage}_i - \text{bad\_percentage}_i) \right)\]

where WoE (Weight of Evidence) is given by:

\[\text{WoE}_i = \ln\left(\frac{\text{good\_percentage}_i + \epsilon}{\text{bad\_percentage}_i + \epsilon}\right)\]

and (epsilon) is a small constant added to avoid division by zero.

The IV indicates the predictive power of the feature:

IV < 0.02: Weak predictor

0.02 <= IV < 0.1: Medium predictor

IV >= 0.1: Strong predictor

Example:

from df_csv_excel.fe_functions import calculate_iv

total_iv, pivot_table, plt = calculate_iv(data, 'age', 'label', custom_bins=[0, 20, 40, 60, 80, 100])

Note

Adjust the value of num_bins based on your specific requirements and the characteristics of your data.

Warning

Ensure that the specified feature and target columns exist in the DataFrame to avoid errors.

The function also returns a pivot table containing counts and percentages for each bin and target label, as well as a bar chart visualizing the counts for each bin and label.