Predicting Taxi Trip Durations: A Dockerized ML Pipeline

Introduction

In this blog post, we will walk through a machine learning pipeline designed to predict taxi trip durations. The pipeline involves creating a virtual environment, parameterizing scripts, and packaging everything into a Docker container for efficient deployment.

Step 1: Checking Installed Packages

import os
import sys

if os.name == "nt":  # For Windows
    !pip freeze | findstr scikit-learn
else:  # For Linux/macOS
    !pip freeze | grep scikit-learn

Output:

scikit-learn==1.5.0

Step 2: Checking Python Version

!python -V

Output:

Python 3.9.0

Step 3: Importing Necessary Libraries

import pickle
import pandas as pd

Step 4: Loading the Model

with open("model.bin", "rb") as f_in:
    dv, model = pickle.load(f_in)

Warning:

InconsistentVersionWarning: Trying to unpickle estimator DictVectorizer from version 1.5.0 when using version 1.4.2. This might lead to breaking code or invalid results.

Step 5: Defining a Function to Read Data

categorical = ["PULocationID", "DOLocationID"]

def read_data(filename):
    df = pd.read_parquet(filename)
    df["duration"] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df["duration"] = df.duration.dt.total_seconds() / 60
    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()
    df[categorical] = df[categorical].fillna(-1).astype("int").astype("str")
    return df

Step 6: Reading Data for March 2023

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet"
df = read_data(url)

Step 7: Making Predictions

dicts = df[categorical].to_dict(orient="records")
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

Q1. Notebook (Question 1)

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

1.24
6.24
12.28
18.28

# Find the standard deviation of the predicted duration
predicted_duration_std = y_pred.std()
print(f" The standard deviation of the predicted duration is {predicted_duration_std:.2f} minutes")

Output:

The standard deviation of the predicted duration is 6.25 minutes

Q2. Preparing the Output

We want to prepare the dataframe with the output. Let's create an artificial ride_id column:

df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

Next, write the ride id and the predictions to a dataframe with results.

df_result.to_parquet(output_file, engine='pyarrow', compression=None, index=False)

What's the size of the output file?

# Create an artificial ride_id column
year = 2023
month = 3
df["ride_id"] = f"{year:04d}/{month:02d}_" + df.index.astype("str")
# Write the ride id and the predictions to a dataframe with results.
results = pd.DataFrame({"ride_id": df["ride_id"], "predicted_duration": y_pred})
print(results.head())

Output:

ride_id  predicted_duration
0  2023/03_0           16.245906
1  2023/03_1           26.134796
2  2023/03_2           11.884264
3  2023/03_3           11.997720
4  2023/03_4           10.234486

# Save it as parquet file
output_file = f"{year:04d}-{month:02d}-predictions.parquet"
results.to_parquet(output_file, engine="pyarrow", compression=None, index=False)
print(f"Predictions saved to {output_file}")

Output:

Predictions saved to 2023-03-predictions.parquet

# Check the size of the saved file
file_size = os.path.getsize(output_file) / (1024 * 1024)
print(f"Size of the saved file is {file_size:.2f} MB")

Output:

Size of the saved file is 65.46 MB

Q3. Creating the Scoring Script

Now let's turn the notebook into a script.

# Convert the notebook to a script using jupyter nbconvert
!jupyter nbconvert --to script --output-dir . starter.ipynb

# List the files in the current directory
if os.name == "nt":
    !dir
else:
    !ls -l

Output:

[NbConvertApp] Converting notebook starter.ipynb to script
[NbConvertApp] Writing 3572 bytes to starter.py

Q4. Virtual Environment

Let's put everything into a virtual environment using pipenv.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter notebook.

After installing the libraries, pipenv creates two files: Pipfile and Pipfile.lock. The Pipfile.lock file keeps the hashes of the dependencies we use for the virtual env.

Question: What's the first hash for the Scikit-Learn dependency?

# The first hash is "sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c"

Q5. Parametrize the Script

Let's make the script configurable via CLI. We'll create two parameters: year and month.

Run the script for April 2023.

What's the mean predicted duration?

7.29
14.29
21.29
28.29

Hint: just add a print statement to your script.

# For that, I switch to the starter.py script. Refer to it to see the changes.
# The mean predicted duration for 2023-04 is 14.29 minutes
# Predictions saved to 2023-04-predictions.parquet

# So, the answer is option B, 14.29

Q6. Docker Container

Finally, we'll package the script in the Docker container. For that, you'll need to use a base image that we prepared.

This is what the content of this image is:

FROM python:3.10.13-slim
WORKDIR /app
COPY ["model2.bin", "model.bin"]

Note: you don't need to run it. We have already done it.

It is pushed to agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim, which you need to use as your base image.

That is, your Dockerfile should start with:

FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

# do stuff here

This image already has a pickle file with a dictionary vectorizer and a model. You will need to use them. Important: don't copy the model to the docker image. You will need to use the pickle file already in the image.

Now run the script with Docker. What's the mean predicted duration for May 2023?

0.19
7.24
14.24
21.19

Bonus: Upload the Result to the Cloud (Not Graded)

Just printing the mean duration inside the Docker image doesn't seem very practical. Typically, after creating the output file, we upload it to the cloud storage. Modify your code to upload the parquet file to S3/GCS/etc.

Publishing the Image to Docker Hub

This is how we published the image to Docker Hub:

docker build -t mlops-zoomcamp-model:2024-3.10.13-slim .
docker tag mlops-zoomcamp-model:2024-3.10.13-slim agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

docker login --username USERNAME
docker push agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

Inference with Docker

# I created the inference.dockerfile file.
# Then, I built the docker image using the following command:
# docker build -t ride-duration-pred-service:v1 -f week4/inference.dockerfile .

# I ran the docker container using the following command:
# docker run -it --rm ride-duration-pred-service:v1 --year 2023 --month 5

# (mlops-zoomcamp-2024-ZOLEji97) C:\Users\kaslou\Desktop\code\mlops-zoomcamp-2024>docker run -it --rm ride-duration-pred-service:v1 --year 2023 --month 5
# The mean predicted duration for 2023-05 is 0.19 minutes
# Predictions saved to 2023-05-predictions.parquet

Thank you for following along with this walkthrough. Stay tuned for more posts on machine learning and MLOps!