Site icon nanglife.com

How to use pandas to compute rolling cumulative distinct count over past 24 hrs?

To compute a rolling cumulative distinct count over the past 24 hours using Pandas, you need to use the rolling method in combination with a custom aggregation function. Here’s a step-by-step guide on how to achieve this:

  1. Import Pandas and Create Sample Data:
    Ensure you have Pandas installed and then create a DataFrame with a timestamp column and the values you want to count distinctly.
  2. Set the Timestamp Column as the Index:
    Convert the timestamp column to a Pandas datetime type and set it as the index.
  3. Define a Custom Function for Rolling Distinct Count:
    Create a custom aggregation function that counts the number of distinct elements in a given window.
  4. Apply the Rolling Window:
    Use the rolling method with a 24-hour window and apply your custom function.

Here is a complete example:

import pandas as pd
import numpy as np

# Sample data
data = {
    'timestamp': pd.date_range(start='2023-01-01', periods=100, freq='H'),
    'value': np.random.randint(1, 20, 100)  # Random values between 1 and 20
}
df = pd.DataFrame(data)

# Convert timestamp to datetime and set it as the index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

# Define a custom aggregation function for counting distinct values
def rolling_cumulative_distinct_count(window):
    return window.apply(lambda x: x.nunique())

# Compute the rolling cumulative distinct count over the past 24 hours
# Note: '24H' denotes a 24-hour rolling window
result = df['value'].rolling('24H').apply(lambda x: x.nunique(), raw=False)

print(result)

Explanation:

  1. Sample Data: We create a DataFrame with a timestamp and some random values.
  2. Datetime Conversion: We ensure the timestamp column is in datetime format and set it as the index to enable time-based rolling operations.
  3. Custom Aggregation Function: We define a lambda function within the apply method to count distinct values within each window.
  4. Rolling Window: We use the rolling method with a ’24H’ window to indicate a 24-hour period. The apply method with the custom function is then used to calculate the number of distinct values within each window.

This approach allows you to compute a rolling cumulative distinct count for each hour, considering the past 24 hours of data. Adjust the freq parameter in pd.date_range and the rolling window parameter ’24H’ as needed for your specific use case.

Exit mobile version