Process Dataframe Column

Process Date Column

parse_dates(df, date_column_name, format=None)

The parse_dates() function is designed to parse date columns in a Pandas DataFrame. It provides flexibility by allowing users to specify a date format or automatically extracting the format from an error message. You can use the parse_dates() function.

Parameters:

df (pandas.DataFrame) – The DataFrame that includes the date column.
date_column_name (str) – The name of the date column.
format (str, optional) – The format of the date column, for example, %d/%m/%Y %H:%M:%S. Optional, default is None.

Returns:

Values of the date column with datetime datatype.

Return type:

numpy.ndarray

Example 1: Parsing dates with default settings

from df_csv_excel import read_data

# Parsing dates with default settings
df['date_column'] = read_data.parse_dates(df, 'date_column')

Example 2: Parsing dates with a specified format

from df_csv_excel import read_data

# Parsing dates with a specified format
df['date_column'] = read_data.parse_dates(df, 'date_column', format='%d/%m/%Y %H:%M:%S')

Process JSON Data in DataFrame

get_feature_from_json(df, json_column_name, key_names)

The :func: get_feature_from_json extracts the value of a nested key in a JSON string column of a Pandas DataFrame.

Parameters:

df (pandas.DataFrame) – The DataFrame containing the JSON column.
json_column_name (str) – The name of the column containing JSON strings.
key_names (list) – A list of keys to navigate through the JSON structure.

Returns:

If successful, the function adds a new column (‘json_feature’) to the DataFrame, containing the extracted values. If an error occurs during processing (JSONDecodeError, TypeError, KeyError), it returns None.

Return type:

numpy.ndarray

Example: Extracting features from a JSON column

from df_csv_excel import read_data

# Example DataFrame
data = {'json_column': ['{"a": {"b": {"c": 42}}}', '{"a": {"b": {"c": 24}}}']}
df = pd.DataFrame(data)

# Extract features
df['c_value'] = get_feature_from_json(df, 'json_column', ['a', 'b', 'c'])

Note

The functions use the json module to handle JSON parsing. If an error occurs during processing, the corresponding value in the result column is set to None.

Get Latest Row by Column

get_latest_row_by_column(df, date_column, duplicate_column)

This function retrieves the latest row for each unique value in a specified column based on the values in another date column. It is useful when you have duplicate entries in a DataFrame and want to keep only the rows with the latest date.

Parameters:

df (pandas.DataFrame) – The input DataFrame.
date_column (str) – The name of the date column used for sorting and determining the latest row.
duplicate_column (str) – The name of the column containing duplicate values, for which the latest rows will be retained.

Returns:

A DataFrame containing the latest row for each unique value in the specified duplicate column.

Return type:

pandas.DataFrame

Example

from df_csv_excel import read_data

# Example DataFrame
data = {'Email': ['john@example.com', 'alice@example.com', 'john@example.com'],
        'created_at': ['2022-01-15', '2022-01-14', '2022-01-16']}

df = pd.DataFrame(data)

# Get the latest row for each unique 'Email'
result = read_data.get_latest_row_by_column(df, 'created_at', 'Email')

Note

The input DataFrame is modified in-place during the process.

Get Email Domain and Prefix from Email Column

get_email_host(df, email_column='email')

Extract email prefixes and domains from a DataFrame column.

Parameters:

df (pandas.DataFrame) – The pandas DataFrame containing the email column.
email_column (str, optional, default: ‘email’) – The name of the email column in the DataFrame.

Returns:

Tuple of NumPy arrays containing email prefixes and domains.

Example

from df_csv_excel.read_data import get_email_host

# Assuming 'df' is your DataFrame
prefixes, domains = get_email_host(df, email_column='user_email')

This function extracts the email prefixes and domains from a specified column in a pandas DataFrame. It handles empty email cases by setting the ‘email_domain’ and ‘email_prefix’ to empty strings in such cases.

Note

This function handles empty email cases by setting the ‘email_domain’ and ‘email_prefix’ to empty strings when the email is empty.

Calculate Age

calculate_age(df, birthdate_column)

Calculate ages based on birthdates in a DataFrame.

Parameters:

df (pandas.DataFrame) – The pandas DataFrame containing the birthdate column.
birthdate_column (str) – The name of the birthdate column in the DataFrame.

Returns:

NumPy array containing age values.

Example

from df_csv_excel.read_data import calculate_age

ages = calculate_age(df, birthdate_column='birthdate')

This function calculates ages based on birthdates from a specified column in a pandas DataFrame. It handles cases where birthdates are in the future, adjusting the age accordingly.

Note

Ensure that the ‘datetime’ module is imported before using this function.