This report details the steps taken to complete Homework 3, which involves creating a data pipeline using Mage, preparing data, and training a machine learning model.
I started Docker Desktop and then, in PowerShell (Windows terminal), I ran the following command to start Mage:
docker run -it -p 6789:6789 -v ${PWD}:/home/src mageai/mageai /app/run_app.sh mage start mlops-hw3
Mage version used: v0.9.71
Screenshot:
localhost:6789
and opened the Text Editor.homework_03
.Screenshots:
I created a pipeline for the project homework_03
.
Screenshot:
I created a data loader block named Ingest
to read the March 2023 Yellow taxi trips data.
import requests
from io import BytesIO
from typing import List
import pandas as pd
if 'data_loader' not in globals():
from mage_ai.data_preparation.decorators import data_loader
@data_loader
def ingest_files(**kwargs) -> pd.DataFrame:
dataset_trips_2023_march = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet"
response = requests.get(dataset_trips_2023_march)
if response.status_code != 200:
raise Exception(response.text)
df = pd.read_parquet(BytesIO(response.content))
return df
Screenshots:
I created a transformer block for data preparation with the previous block as its parent. The transformation includes calculating the trip duration and converting categorical columns to string type.
import pandas as pd
@transformer
def transform_dataframe(df):
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df['duration'] = (df.tpep_dropoff_datetime - df.tpep_pickup_datetime).dt.total_seconds() / 60
df = df[(df.duration >= 1) & (df.duration < 60)]
categorical = ['PULocationID', 'DOLocationID']
df[categorical] = df[categorical].astype(str)
return df
Screenshots:
I created another transformer block to train a linear regression model. The model uses pickup and dropoff locations as features and trip duration as the target.
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
@transformer
def transform(df):
categorical_columns = ['PULocationID', 'DOLocationID']
target = 'duration'
for col in categorical_columns:
df[col] = df[col].astype('category')
X = df[categorical_columns]
y = df[target]
dv = DictVectorizer(sparse=True)
X_train = dv.fit_transform(X.to_dict(orient='records'))
lr = LinearRegression()
lr.fit(X_train, y)
print("Intercept:", lr.intercept_)
return dv, lr
Screenshots:
The intercept of the linear regression model is 24.77.
This completes the steps and outputs for Homework 3. The pipeline successfully ingests, transforms the data, and trains a linear regression model.