Baseline Model for Batch Monitoring Example

Q1. Prepare the Dataset

Start with baseline_model_nyc_taxi_data.ipynb. Download the March 2024 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

Let's begin by downloading the March 2024 Green Taxi data. The shape of the downloaded data is (57457, 20), indicating 57457 rows.


import requests
import pandas as pd
from tqdm import tqdm

# Download data
files = [("green_tripdata_2024-03.parquet", "./data")]

for file, path in files:
    url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{file}"
    resp = requests.get(url, stream=True)
    save_path = f"{path}/{file}"
    with open(save_path, "wb") as handle:
        for data in tqdm(resp.iter_content(), desc=f"{file}", total=int(resp.headers["Content-Length"])):
            handle.write(data)

# Load data
march_data_2024 = pd.read_parquet("data/green_tripdata_2024-03.parquet")
march_data_2024.describe()
        

Solution: The shape is (57457, 20), so there are 57457 rows in the downloaded data.

Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the "fare_amount" column (quantile=0.5).

Hint: Explore Evidently metric ColumnQuantileMetric (from evidently.metrics import ColumnQuantileMetric).

We chose to analyze the fare_amount metric in quantiles, and we also chose the summary of the trip_distance metric.


from evidently import ColumnMapping
from evidently.report import Report
from evidently.metrics import (
    ColumnDriftMetric,
    DatasetDriftMetric,
    DatasetMissingValuesMetric,
    ColumnQuantileMetric,
    ColumnSummaryMetric,
)

# Define column mapping
column_mapping = ColumnMapping(
    target=None,
    prediction="prediction",
    numerical_features=num_features,
    categorical_features=cat_features,
)

# Create and run report
report = Report(
    metrics=[
        ColumnDriftMetric(column_name="prediction"),
        DatasetDriftMetric(),
        DatasetMissingValuesMetric(),
        ColumnQuantileMetric(column_name="fare_amount", quantile=0.5),
        ColumnSummaryMetric(column_name="trip_distance"),
    ]
)
report.run(reference_data=train_data, current_data=val_data, column_mapping=column_mapping)
report.show(mode="inline")
        

Q3. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2024).

We will now run expanded monitoring for a new batch of data to find the maximum value of the metric quantile = 0.5 on the "fare_amount" column during March 2024 (calculated daily).


# Let's get the whole dataset
march_data_2024 = pd.read_parquet("data/green_tripdata_2024-03.parquet")

# Now, let's see the maximum value of the metric quantile 0.5 for the column fare_amount
report2 = Report(
    metrics=[
        ColumnQuantileMetric(column_name="fare_amount", quantile=0.5),
    ]
)

# Run the report
report2.run(
    reference_data=None,
    current_data=march_data_2024.loc[
        march_data_2024.lpep_pickup_datetime.between(
            "2024-03-30", "2024-03-31", inclusive="left"
        )  #
    ],
    column_mapping=column_mapping,
)

# Show the report
report2.show(mode="inline")
        

Solution: The maximum value is 14.2 (option C).

Q4. Dashboard

Finally, let’s add panels with new added metrics to the dashboard. After we customize the dashboard let's save a dashboard config, so that we can access it later.

Hint: Click on “Save dashboard” to access JSON configuration of the dashboard. This configuration should be saved locally.

We set up the dashboard and found the solution to be project_folder/dashboards.


from evidently.metric_preset import DataDriftPreset, DataQualityPreset
from evidently.ui.workspace import Workspace
from evidently.ui.dashboards import (
    DashboardPanelCounter,
    DashboardPanelPlot,
    CounterAgg,
    PanelValue,
    PlotType,
    ReportFilter,
)
from evidently.renderers.html_widgets import WidgetSize

# Create workspace and project
ws = Workspace("workspace")
project = ws.create_project("NYC Taxi Data Quality Project")
project.description = "My project description"
project.save()

# Add report to workspace
ws.add_report(project.id, report)

# Configure dashboard
project.dashboard.add_panel(
    DashboardPanelCounter(
        filter=ReportFilter(metadata_values={}, tag_values=[]),
        agg=CounterAgg.NONE,
        title="NYC taxi data dashboard",
    )
)
project.dashboard.add_panel(
    DashboardPanelPlot(
        filter=ReportFilter(metadata_values={}, tag_values=[]),
        title="Inference Count",
        values=[
            PanelValue(
                metric_id="DatasetSummaryMetric",
                field_path="current.number_of_rows",
                legend="count",
            ),
        ],
        plot_type=PlotType.BAR,
        size=WidgetSize.HALF,
    )
)
project.dashboard.add_panel(
    DashboardPanelPlot(
        filter=ReportFilter(metadata_values={}, tag_values=[]),
        title="Number of Missing Values",
        values=[
            PanelValue(
                metric_id="DatasetSummaryMetric",
                field_path="current.number_of_missing_values",
                legend="count",
            ),
        ],
        plot_type=PlotType.LINE,
        size=WidgetSize.HALF,
    )
)

project.save()
        

Conclusion

This example demonstrates how to download and prepare data, train a model, evaluate its performance, and monitor data quality using Evidently. The process ensures that our model remains robust and accurate over time.