Process Dataframe Column
Process Date Column
- parse_dates(df, date_column_name, format=None)
The
parse_dates()function is designed to parse date columns in a Pandas DataFrame. It provides flexibility by allowing users to specify a date format or automatically extracting the format from an error message. You can use theparse_dates()function.- Parameters:
- Returns:
Values of the date column with datetime datatype.
- Return type:
numpy.ndarray
Example 1: Parsing dates with default settings
from df_csv_excel import read_data
# Parsing dates with default settings
df['date_column'] = read_data.parse_dates(df, 'date_column')
Example 2: Parsing dates with a specified format
from df_csv_excel import read_data
# Parsing dates with a specified format
df['date_column'] = read_data.parse_dates(df, 'date_column', format='%d/%m/%Y %H:%M:%S')
Process JSON Data in DataFrame
- get_feature_from_json(df, json_column_name, key_names)
The :func:
get_feature_from_jsonextracts the value of a nested key in a JSON string column of a Pandas DataFrame.- Parameters:
- Returns:
If successful, the function adds a new column (‘json_feature’) to the DataFrame, containing the extracted values. If an error occurs during processing (JSONDecodeError, TypeError, KeyError), it returns None.
- Return type:
numpy.ndarray
Example: Extracting features from a JSON column
from df_csv_excel import read_data
# Example DataFrame
data = {'json_column': ['{"a": {"b": {"c": 42}}}', '{"a": {"b": {"c": 24}}}']}
df = pd.DataFrame(data)
# Extract features
df['c_value'] = get_feature_from_json(df, 'json_column', ['a', 'b', 'c'])
Note
The functions use the json module to handle JSON parsing.
If an error occurs during processing, the corresponding value in the result column is set to None.
Get Latest Row by Column
- get_latest_row_by_column(df, date_column, duplicate_column)
This function retrieves the latest row for each unique value in a specified column based on the values in another date column. It is useful when you have duplicate entries in a DataFrame and want to keep only the rows with the latest date.
- Parameters:
- Returns:
A DataFrame containing the latest row for each unique value in the specified duplicate column.
- Return type:
pandas.DataFrame
Example
from df_csv_excel import read_data # Example DataFrame data = {'Email': ['john@example.com', 'alice@example.com', 'john@example.com'], 'created_at': ['2022-01-15', '2022-01-14', '2022-01-16']} df = pd.DataFrame(data) # Get the latest row for each unique 'Email' result = read_data.get_latest_row_by_column(df, 'created_at', 'Email')
Note
The input DataFrame is modified in-place during the process.
Get Email Domain and Prefix from Email Column
- get_email_host(df, email_column='email')
Extract email prefixes and domains from a DataFrame column.
- Parameters:
df (
pandas.DataFrame) – The pandas DataFrame containing the email column.email_column (
str, optional, default: ‘email’) – The name of the email column in the DataFrame.
- Returns:
Tuple of NumPy arrays containing email prefixes and domains.
Example
from df_csv_excel.read_data import get_email_host # Assuming 'df' is your DataFrame prefixes, domains = get_email_host(df, email_column='user_email')
This function extracts the email prefixes and domains from a specified column in a pandas DataFrame. It handles empty email cases by setting the ‘email_domain’ and ‘email_prefix’ to empty strings in such cases.
Note
This function handles empty email cases by setting the ‘email_domain’ and ‘email_prefix’ to empty strings when the email is empty.
Calculate Age
- calculate_age(df, birthdate_column)
Calculate ages based on birthdates in a DataFrame.
- Parameters:
df (
pandas.DataFrame) – The pandas DataFrame containing the birthdate column.birthdate_column (
str) – The name of the birthdate column in the DataFrame.
- Returns:
NumPy array containing age values.
Example
from df_csv_excel.read_data import calculate_age ages = calculate_age(df, birthdate_column='birthdate')
This function calculates ages based on birthdates from a specified column in a pandas DataFrame. It handles cases where birthdates are in the future, adjusting the age accordingly.
Note
Ensure that the ‘datetime’ module is imported before using this function.