A Test

Posted by Jason Feng on August 31, 2019

The Question

The question and the first few rows of the data look like below.

# Calculate the f1 for Thursdays… this is not a trick question!
dates|y|yhat
1999-11-01|0|0
1999-11-02|0|1
1999-11-03|1|1
1999-11-04|1|0
1999-11-05|0|1
1999-11-06|0|0

Given that columns y and yhat are normally referred to true and predicted labels in machine learning, as well as F1 score is one of the metrics for classification problems in machine learning, I break down the solution into two parts.

  1. Filter the data which contains Thursday’s data only.
  2. Compute the F1 score.

Step 1: Read the Data into a DataFrame

Thanks for the powerful and rich sets of parameters to read data sources in Pandas, we can easily read this zipped tar ball into a DataFrame.

1
2
3
4
5
def read_file(file_path):
    """Read the file into DataFrame"""
    df = pd.read_csv(file_path, sep='|', skiprows=1, index_col=0,
                     parse_dates=True)
    return df

The above code has a few more parameters to address the data format of the file.

  1. Specify the delimiter as sep='|'.
  2. Don’t read the first row.
  3. Use the first column as index.
  4. Parse date format as Date type.

After this, we will get the data whose day is Thursday by filtering out the weekday name of the index of the DataFrame.

Step 2: Compute F1 score

We can use the existing method F1_score in sci-kit learn package to get the result.

1
2
3
def compute_f1_score(df):
    """Compute the F1 score"""
    return f1_score(df["y"].values, df["yhat"].values)

Result

We will then get the F1 score for Thursdays. The result is not that promising though. There are ways we need to explore and investigate to improve the performance of the model.

1
2
$ python f1.py
F1 score for Thursdays is 0.308

Full Code

The completed code is shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
from sklearn.metrics import f1_score


def read_file(file_path):
    """Read the file into DataFrame"""
    df = pd.read_csv(file_path, sep='|', skiprows=1, index_col=0,
                     parse_dates=True)
    return df


def compute_f1_score(df):
    """Compute the F1 score"""
    return f1_score(df["y"].values, df["yhat"].values)


if __name__ == "__main__":
    file_path = "./data/test.tar.gz"
    df = read_file(file_path)

    # Filter the DataFrame which contains Thursday's data only
    df_thur = df[df.index.weekday_name == "Thursday"]

    # Compute F1 score
    f1 = compute_f1_score(df_thur)

    # Print the result
    print("F1 score for Thursdays is {0:.3F}".format(f1))