This report details the steps taken to complete Homework 3, which involves creating a data pipeline using Mage, preparing data, and training a machine learning model.
I started Docker Desktop and then, in PowerShell (Windows terminal), I ran the following command to start Mage:
docker run -it -p 6789:6789 -v ${PWD}:/home/src mageai/mageai /app/ mage start mlops-hw3
Mage version used: v0.9.71
and opened the Text Editor.homework_03
I created a pipeline for the project homework_03
I created a data loader block named Ingest
to read the March 2023 Yellow taxi trips data.
import requests
from io import BytesIO
from typing import List
import pandas as pd
if 'data_loader' not in globals():
from mage_ai.data_preparation.decorators import data_loader
def ingest_files(**kwargs) -> pd.DataFrame:
dataset_trips_2023_march = ""
response = requests.get(dataset_trips_2023_march)
if response.status_code != 200:
raise Exception(response.text)
df = pd.read_parquet(BytesIO(response.content))
return df
I created a transformer block for data preparation with the previous block as its parent. The transformation includes calculating the trip duration and converting categorical columns to string type.
import pandas as pd
def transform_dataframe(df):
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df['duration'] = (df.tpep_dropoff_datetime - df.tpep_pickup_datetime).dt.total_seconds() / 60
df = df[(df.duration >= 1) & (df.duration < 60)]
categorical = ['PULocationID', 'DOLocationID']
df[categorical] = df[categorical].astype(str)
return df
I created another transformer block to train a linear regression model. The model uses pickup and dropoff locations as features and trip duration as the target.
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
def transform(df):
categorical_columns = ['PULocationID', 'DOLocationID']
target = 'duration'
for col in categorical_columns:
df[col] = df[col].astype('category')
X = df[categorical_columns]
y = df[target]
dv = DictVectorizer(sparse=True)
X_train = dv.fit_transform(X.to_dict(orient='records'))
lr = LinearRegression(), y)
print("Intercept:", lr.intercept_)
return dv, lr
The intercept of the linear regression model is 24.77.
This completes the steps and outputs for Homework 3. The pipeline successfully ingests, transforms the data, and trains a linear regression model.