The Question
The question and the first few rows of the data look like below.
# Calculate the f1 for Thursdays… this is not a trick question!
dates|y|yhat
1999-11-01|0|0
1999-11-02|0|1
1999-11-03|1|1
1999-11-04|1|0
1999-11-05|0|1
1999-11-06|0|0
Given that columns y
and yhat
are normally referred to true and predicted labels in machine learning, as well as F1 score is one of the metrics for classification problems in machine learning, I break down the solution into two parts.
- Filter the data which contains Thursday’s data only.
- Compute the F1 score.
Step 1: Read the Data into a DataFrame
Thanks for the powerful and rich sets of parameters to read data sources in Pandas, we can easily read this zipped tar ball into a DataFrame.
1
2
3
4
5
def read_file(file_path):
"""Read the file into DataFrame"""
df = pd.read_csv(file_path, sep='|', skiprows=1, index_col=0,
parse_dates=True)
return df
The above code has a few more parameters to address the data format of the file.
- Specify the delimiter as
sep='|'
. - Don’t read the first row.
- Use the first column as index.
- Parse date format as Date type.
After this, we will get the data whose day is Thursday by filtering out the weekday name of the index of the DataFrame.
Step 2: Compute F1 score
We can use the existing method F1_score in sci-kit learn package to get the result.
1
2
3
def compute_f1_score(df):
"""Compute the F1 score"""
return f1_score(df["y"].values, df["yhat"].values)
Result
We will then get the F1 score for Thursdays. The result is not that promising though. There are ways we need to explore and investigate to improve the performance of the model.
1
2
$ python f1.py
F1 score for Thursdays is 0.308
Full Code
The completed code is shown below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
from sklearn.metrics import f1_score
def read_file(file_path):
"""Read the file into DataFrame"""
df = pd.read_csv(file_path, sep='|', skiprows=1, index_col=0,
parse_dates=True)
return df
def compute_f1_score(df):
"""Compute the F1 score"""
return f1_score(df["y"].values, df["yhat"].values)
if __name__ == "__main__":
file_path = "./data/test.tar.gz"
df = read_file(file_path)
# Filter the DataFrame which contains Thursday's data only
df_thur = df[df.index.weekday_name == "Thursday"]
# Compute F1 score
f1 = compute_f1_score(df_thur)
# Print the result
print("F1 score for Thursdays is {0:.3F}".format(f1))