Data

What data scientist spent most doing

Gil Press - Forbes, Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says

What is Machine Learning

Machine learning is automated process that extracts patterns from data. To illustrate basic concepts of machine learning we will use supervised machine learning that automatically learns and builds model of relationships between a set of descriptive features and a target feature (labels) based on a set of historical examples or instances. We can then use this model to make predictions for new instances unseen in learning period.

Following schema illustrates common process of machine learning.

What is machine learning

Define the Problem

We are gonna predict whether it rains tomorrow from today’s data in locations all over Australia.

It is a binary classification problem, we want also some insights into what features and their values signal rain tomorrow (if any). We got data from Australian Weather Bureau for education purposes and downloaded the dataset from Kaggle Weather dataset.

Python

We will use following libraries in this lecture

pandas scikit-learn numpy

# import some basic libraries to do analysis and ML
import pandas as pd
import numpy as np
import sklearn as sk
import scipy.stats as st

import warnings
warnings.filterwarnings('ignore')

# Visualization details
import plotly.io as pio
import plotly.express as px
import plotly.offline as py
import plotly.graph_objects as go
import machine_learning_python as mlp

png_renderer = pio.renderers['png']
png_renderer.height = 400
png_renderer.width = 800
pio.renderers.default = 'png'

Load and Get to Know Your Data

We load data and have the first look on what is available to us, what features we have, their types, missing values, what might be wrong, what we can fix on our side, what has to be fixed with data provider etc.

It is worth to validate the basic assumptions about the data:

  • Do we have anough samples?

  • Are basic axioms valid eg. uniqueness of ids, having one measurement per day, …

  • What labels do we have? Is it binary?

Some features that are available to us:

  1. RainTomorrow is our binary target (dependent) features we will be predicting or explaining using descriptive (independent) features.

  2. Date, Location are nominal (categorical) features spanning Australia and about 9 years of measurements. We will need to convert it to numerical features. Some locations are close together, some are very distant, we could probably use location to give information about adjacent rain and combine it with wind direction.

  3. WindDirs are nominal features about wind direction. We will have to convert it to numerical features.

  4. RainToday is nominal info about raining today. We will have to convert it to numerical features. It will be probably the feature with the most signal as it is likely that when it rains today, it will rain tomorrow.

  5. RISK_MM feature stands out, it looks like some technical measure, we will have to check later.

  6. There is a group of numerical features describing morning and afternoon measurements. We have to check distributions and fill in missing values.

# Read data from local .csv file
df = pd.read_csv('../../data/weatherAUS.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 24 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           142193 non-null  object 
 1   Location       142193 non-null  object 
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   Evaporation    81350 non-null   float64
 6   Sunshine       74377 non-null   float64
 7   WindGustDir    132863 non-null  object 
 8   WindGustSpeed  132923 non-null  float64
 9   WindDir9am     132180 non-null  object 
 10  WindDir3pm     138415 non-null  object 
 11  WindSpeed9am   140845 non-null  float64
 12  WindSpeed3pm   139563 non-null  float64
 13  Humidity9am    140419 non-null  float64
 14  Humidity3pm    138583 non-null  float64
 15  Pressure9am    128179 non-null  float64
 16  Pressure3pm    128212 non-null  float64
 17  Cloud9am       88536 non-null   float64
 18  Cloud3pm       85099 non-null   float64
 19  Temp9am        141289 non-null  float64
 20  Temp3pm        139467 non-null  float64
 21  RainToday      140787 non-null  object 
 22  RISK_MM        142193 non-null  float64
 23  RainTomorrow   142193 non-null  object 
dtypes: float64(17), object(7)
memory usage: 26.0+ MB
df.sample(5)[df.columns[:10]]
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am
30483 2010-10-16 Sydney 11.7 16.1 0.8 3.4 9.3 WNW 85.0 W
97563 2010-08-10 MountGambier 2.2 14.8 0.0 1.2 0.7 S 61.0 SE
70928 2016-10-10 Mildura 11.4 17.5 0.2 NaN 8.4 W 46.0 W
107169 2012-04-15 Albany 12.9 24.7 0.0 3.0 10.1 NaN NaN NW
107729 2013-12-28 Albany 13.8 21.5 0.0 5.2 13.3 NaN NaN E
df.sample(5)[df.columns[10:20]]
WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am
77540 WSW 13.0 19.0 71.0 53.0 1024.1 1023.9 7.0 7.0 14.9
84382 SE 9.0 15.0 65.0 62.0 1020.8 1018.3 7.0 7.0 25.9
51550 ESE 6.0 13.0 97.0 93.0 NaN NaN NaN NaN 1.1
120428 SW 13.0 20.0 55.0 49.0 1015.9 1013.1 1.0 1.0 23.2
6241 W 13.0 4.0 46.0 24.0 1021.7 1018.6 2.0 3.0 25.3
df.sample(5)[df.columns[20:]]
Temp3pm RainToday RISK_MM RainTomorrow
103697 33.4 No 0.0 No
123453 18.2 No 0.0 No
103916 30.8 No 0.0 No
55750 18.6 No 18.4 Yes
18195 15.3 No 1.8 Yes
d = df.groupby('Location').agg(Count=('Date', 'count')).sort_values('Count', ascending=False)
print(f'We have {df.Location.nunique()} unique locations.')
print(f'We have {df.Date.nunique()} days or {df.Date.nunique() // 365}+ years of measurements.')
We have 49 unique locations.
We have 3436 days or 9+ years of measurements.

Check for duplicated data.

df.groupby(['Location', 'Date']).size().sort_values(ascending=False).head(5)
Location      Date      
Adelaide      2008-07-01    1
PerthAirport  2013-11-07    1
              2013-11-01    1
              2013-11-02    1
              2013-11-03    1
dtype: int64

For some locations, we have only about half of samples of other locations.

px.bar(d, x=d.index, y='Count', title='Measurements by Location', labels={'Count': 'Measurements', 'Location': ''})
../_images/imlp_1_data_19_0.png
# average humidity by location
d = df.groupby('Location')[['Humidity3pm']].agg(
    cnt=('Humidity3pm', 'count'),
    m=('Humidity3pm', 'mean'),
    se=('Humidity3pm', 'sem')).reset_index()                           

d['ci'] = d.se * st.t.ppf(0.975, d.cnt - 1)
fig = px.bar(d, x='Location', y='m', error_y='ci', title='Average Humidity3pm by Location', labels={'m': 'Humidity3pm', 'Location': ''})
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
../_images/imlp_1_data_20_0.png

Complete, Correct, Create, and Convert

df.isnull().sum().sort_values(ascending=False)
Sunshine         67816
Evaporation      60843
Cloud3pm         57094
Cloud9am         53657
Pressure9am      14014
Pressure3pm      13981
WindDir9am       10013
WindGustDir       9330
WindGustSpeed     9270
WindDir3pm        3778
Humidity3pm       3610
Temp3pm           2726
WindSpeed3pm      2630
Humidity9am       1774
RainToday         1406
Rainfall          1406
WindSpeed9am      1348
Temp9am            904
MinTemp            637
MaxTemp            322
RISK_MM              0
Date                 0
Location             0
RainTomorrow         0
dtype: int64

There are locations completely missing values for Pressure9am, WindGustSpeed, Pressure3pm. Measurement device is probably not present at these locations at all.

Dropping does not make sense here, we would lose whole locations.

Bear in mind that we will have to implement this selective imputation everywhere we prepare data for this model. We will check later if this imputation is worth it.

g = df.groupby('Location').agg(Pressure9amCount=('Pressure9am', 'count'),
                               Pressure3pmCount=('Pressure3pm', 'count'),
                               WindGustSpeedCount=('WindGustSpeed', 'count'))
g[g.Pressure9amCount == 0]
Pressure9amCount Pressure3pmCount WindGustSpeedCount
Location
MountGinini 0 0 2715
Newcastle 0 0 0
Penrith 0 0 2957
SalmonGums 0 0 2909
print(f'If we drop rows with at least one value missing, we get {df.dropna().shape[0]:,} \
out of {df.shape[0]:,} instances.')
If we drop rows with at least one value missing, we get 56,420 out of 142,193 instances.

Complete

Common strategy is to drop rows with missing values. This could mean dropping substantial amount of data and not being able to predict for many data samples in production. It is always best to check why data are missing and fix them in data collection phase.

Strategies to populate missing values are:

  1. Use models that are tolerant to missing values like naive bayes or decision trees.

  2. Populate categorical data with mode.

  3. Populate numerical data with mean/median, or using group average.

  4. Predict missing values using regression from other columns.

We have features with many missing values eg. Sunshine and features with some missing values eg. Temp3pm probably caused by some malfunction of a measuring unit.

Columns Evaporation, Sunshine, Cloud9am, Cloud3pm have many missing values. We fill-in values using iterative regression on few selected features. When many values are missing in some feature, it is worth to add indicator feature informing ML algorithm that particular value was imputed.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df['RainToday'] = df['RainToday'].replace({'Yes': 1, 'No': 0})
df['RainTomorrow'] = df['RainTomorrow'].replace({'Yes': 1, 'No': 0})
cols = ['Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm', 'Humidity9am', 'Humidity3pm', 'RainToday']
imp = IterativeImputer(initial_strategy='median', add_indicator=True).fit_transform(df[cols].values)

d = pd.DataFrame(imp, columns=cols + [f'{c}_missing' for c in cols])
d.head()[['Evaporation', 'Evaporation_missing', 'Sunshine', 'Sunshine_missing']]
Evaporation Evaporation_missing Sunshine Sunshine_missing
0 6.251375 1.0 6.960355 1.0
1 8.201791 1.0 10.762692 1.0
2 8.671174 1.0 11.255441 1.0
3 8.275212 1.0 11.644286 1.0
4 4.537052 1.0 4.875672 1.0
d[['Evaporation', 'Evaporation_missing', 'Sunshine', 'Sunshine_missing']].describe()
Evaporation Evaporation_missing Sunshine Sunshine_missing
count 142193.000000 142193.000000 142193.000000 142193.000000
mean 5.244790 0.427890 7.430390 0.476929
std 3.475205 0.494775 3.374164 0.499469
min -0.401327 0.000000 0.000000 0.000000
25% 2.979168 0.000000 5.014699 0.000000
50% 4.800000 0.000000 7.717306 0.000000
75% 6.800000 1.000000 10.100000 1.000000
max 145.000000 1.000000 37.475535 1.000000
# replace imputed columns including indicators
# cols_wo_rt = [c for c in cols if 'RainToday' not in c]
# df.drop(cols_wo_rt, axis=1, inplace=True)
# cols_wo_rt = cols_wo_rt + [f'{c}_missing' for c in cols if 'RainToday' not in c]
# df = pd.concat([df, d[cols_wo_rt]], axis=1)
df['RainToday'] = df['RainToday'].fillna(0)

Impute the rest with median for numerical features and mode for categorical features.

from sklearn.impute import SimpleImputer

impute_columns_numerical = ['WindGustSpeed', 'Pressure9am', 'Pressure3pm', 'Temp3pm', 'WindSpeed3pm',
                            'Rainfall', 'WindSpeed9am', 'Temp9am', 'MinTemp', 'MaxTemp', 'Evaporation', 
                            'Sunshine', 'Cloud9am', 'Cloud3pm', 'Humidity9am', 'Humidity3pm']

impute_columns_categorical = ['WindGustDir', 'WindDir9am', 'WindDir3pm']    

for c in impute_columns_numerical:
    df[c] = SimpleImputer(strategy='median').fit_transform(df[c].values.reshape(-1, 1))
    
for c in impute_columns_categorical:
    df[c] = SimpleImputer(strategy='most_frequent').fit_transform(df[c].values.reshape(-1, 1))

Check that we handled everything.

df.isnull().sum().sort_values(ascending=False).head()
Date         0
Location     0
RISK_MM      0
RainToday    0
Temp3pm      0
dtype: int64

Correct

We review data for striking errors caused by eg. transformations of data on the source, manual input errors, having numerical feature values as nominal (because 1 is not the same as "1"), etc.

# basic overview about distribution of values in numerical columns
cols = ['Cloud9am', 'Humidity9am']
df[cols].describe()
Cloud9am Humidity9am
count 142193.000000 142193.000000
mean 4.649568 68.858235
std 2.294357 18.932512
min 0.000000 0.000000
25% 3.000000 57.000000
50% 5.000000 70.000000
75% 6.000000 83.000000
max 9.000000 100.000000
# boxplots
m = pd.melt(df[cols + ['Date', 'Location']], id_vars=['Date', 'Location'], value_vars=cols)
fig = px.box(m, y='value', facet_col='variable', points='outliers', height=400)
fig.update_yaxes(dict(matches=None, showticklabels=True), row=1, col=2)
../_images/imlp_1_data_40_0.png

Outliers

Outliers in data could be real eg. some flooding in Rainfall or could come from faulty measurement. Outliers skew mean of the feature and some ML algorithms calculating distances like linear models for both classification and regression are sensitive to them.

There are different techniques how to handle outliers:

  1. Cap values to stay inside inter quartile range.

  2. Bin numerical feature to create categorical values.

  3. Apply function transformation that penalizes long-tail outliers eg. taking \(log\) of values.

We cap numerical feature values \(v\) to stay be within

\[q_1 - m * \text{IQR} \le v \le q_3 + m * \text{IQR}\]

Where

  1. \(q_1\) is 25% quartile

  2. \(q_3\) is 75% quartile

  3. \(\text{IQR} = q_3 - q_1\) is inter-quartile range

  4. \(m\) is constant

IQR is defined as difference between value on 75% quartile (\(q_3\)) and 25% quartile (\(q_1\)). We can use different multiples of IQR for different features, common practice is using \(m = 1.5\) or \(m = 1.8\).

q1 = df[cols].quantile(0.25)
q3 = df[cols].quantile(0.75)
iqr = q3 - q1

df[cols] = df[cols].clip(q1 - 1.5*iqr, q3 + 1.5*iqr, axis=1)
df[cols].describe()
Cloud9am Humidity9am
count 142193.000000 142193.000000
mean 4.649568 68.911015
std 2.294357 18.779092
min 0.000000 18.000000
25% 3.000000 57.000000
50% 5.000000 70.000000
75% 6.000000 83.000000
max 9.000000 100.000000
# fixed distributions
cols = ['MinTemp', 'Rainfall', 'WindGustSpeed']
m = pd.melt(df[cols + ['Date', 'Location']], id_vars=['Date', 'Location'])
fig = px.box(m, y='value', facet_col='variable', height=400)
fig.update_yaxes(dict(matches=None, showticklabels=True), row=1, col=2)
fig.update_yaxes(dict(matches=None, showticklabels=True), row=1, col=3)
../_images/imlp_1_data_47_0.png

Irregular Cardinality

Some numerical features might be better represented as categorical features because they are coded as ordinal numbers (eg. sex coded as 0 and 1) or because they can have only some specified values.

All our numerical columns seem to have reasonable cardinality and there is probably no candidate for categorical type among numerical features.

df.nunique().sort_values().head(15)
RainTomorrow       2
RainToday          2
Cloud3pm          10
Cloud9am          10
WindDir3pm        16
WindDir9am        16
WindGustDir       16
WindSpeed9am      43
WindSpeed3pm      44
Location          49
WindGustSpeed     67
Humidity9am       83
Humidity3pm      101
Sunshine         145
Evaporation      356
dtype: int64

Create

We can use our domain knowledge and available features to create new features to determine if they have signal for predicted outcome. We can parse string features, combine features to discover good interactions and amplification effect. We can also crawl data to enhance data points with historical trends, local context, and additional data sources that are available to us (eg. satelite images) and are joinable to existing data.

We check if there was a drop or rise between morning and afternoon pressure.

df['PressureDiff'] = df['Pressure3pm'] - df['Pressure9am']
df['PressureDrop'] = df['PressureDiff'].map(lambda l: 1 if l < 0 else 0)

We include it as an indicator and also as absolute value to check what works better in following exploratory data analysis.

Date/Time Features

Date and time data usually provide valuable contextual information but we have to transform the data to make it accessible for ML algorithms. Some ML algorithms can work on time series data and learn trends like periodicity.

Common strategies:

  1. Extract year, months, days, week days into separate columns to give the model some information about possible periodicity. This is useful eg. when we process transactional data as weekends are always different from weekdays.

  2. Extract time period between events in data. Eg. how long it was from the last event to the data point.

  3. Use dates to connect data from additional datasets like holidays, weekends, eg. Facebook Prophet.

Downside being that when using model in production to predict on one measurement, we have to load preceeing measurements to calculate values for these features. In real-time setting, this requires database lookups.

We extract Year and Month features here. We also add DaysSinceRainWeek to capture how many days passed from the last rain day in rolling week and RainDaysWeek to capture how many days with rain there was in rolling week.

# Extract year and month
df['DateD'] = pd.to_datetime(df.Date, format='%Y-%m-%d')

df['Year'] = df['DateD'].dt.year.astype('category')
df['Month'] = df['DateD'].dt.month.astype('category')
df['YearMonth'] = df[['Year', 'Month']].apply(lambda r: f'{r[0]}-{r[1]:02d}', axis=1)
df[['Date', 'Year', 'Month', 'YearMonth']].head(5)
Date Year Month YearMonth
0 2008-12-01 2008 12 2008-12
1 2008-12-02 2008 12 2008-12
2 2008-12-03 2008 12 2008-12
3 2008-12-04 2008 12 2008-12
4 2008-12-05 2008 12 2008-12
def agg_days_since_rain(x):
    x = x[:-1]
    r = np.argmax(np.flip(x)) + 1 if x.any() else 20
    return r

def agg_rain_days(x):
    return np.sum(x[:-1])

def rolling_feature(d, by, on, val, window, name, agg_fce):
    cs = [by, on]
    r = d.sort_values(cs).groupby(by).rolling(window, on=on, min_periods=1)
    
    agg = r.apply(agg_fce, raw=True)
    agg[name] = agg[val] # agg has aggregated values in column val
    agg.drop([val, on], axis=1, inplace=True)
    agg.rename_axis(index=(by, 'Id'), inplace=True)
    agg.reset_index(inplace=True)
    return agg

Contextual Features

We can look beyond single row of data to calculate some contextual features. Keep in mind that when in production mode, we would need to load context for every row we want to predict for from somewhere. In current setting, we can cheaply evaluate if adding such features help performance of the model.

We use last 7 days of data per Location to calculate features

  1. RainDaysWeek capturing how many times it rained in the last week (excluding given day)

  2. DaysSinceRainWeek capturing how long it is since the last rainy day in the last week.

# Rolling_feature is custom rolling window function that takes different aggregation functions
dsrw = rolling_feature(df[['DateD', 'Location', 'RainToday']], 'Location', 'DateD', 'RainToday', 8, 'DaysSinceRainWeek', agg_days_since_rain)
rdw = rolling_feature(df[['DateD', 'Location', 'RainToday']], 'Location', 'DateD', 'RainToday', 8, 'RainDaysWeek', agg_rain_days)

# Prepare original dataset to have the same index as rv
d = df.copy()
d.set_index(['Location'], append=True, inplace=True)
d.rename_axis(('Id', 'Location'), inplace=True)
d.reset_index(inplace=True)

# Merge new features and drop technical columns
d = d.merge(dsrw, on=['Id', 'Location'])
d = d.merge(rdw, on=['Id', 'Location'])

d['NoRainWeek'] = d['DaysSinceRainWeek'].map(lambda l: 1 if l == 20 else 0)
d['DaysSinceRainWeek'] = d['DaysSinceRainWeek'].replace({20: 0})

d[(d.Location == 'Melbourne')][['Date', 'Location', 'RainToday', 'DaysSinceRainWeek', 'NoRainWeek', 'RainDaysWeek']].head(10)
Date Location RainToday DaysSinceRainWeek NoRainWeek RainDaysWeek
65745 2008-07-01 Melbourne 1.0 0.0 1 0.0
65746 2008-07-02 Melbourne 0.0 1.0 0 1.0
65747 2008-07-03 Melbourne 1.0 2.0 0 1.0
65748 2008-07-04 Melbourne 0.0 1.0 0 2.0
65749 2008-07-05 Melbourne 0.0 2.0 0 2.0
65750 2008-07-06 Melbourne 0.0 3.0 0 2.0
65751 2008-07-07 Melbourne 0.0 4.0 0 2.0
65752 2008-07-08 Melbourne 1.0 5.0 0 2.0
65753 2008-07-09 Melbourne 1.0 1.0 0 2.0
65754 2008-07-10 Melbourne 1.0 1.0 0 3.0
df = d

Convert

Some ML algorithms cannot handle non-numeric values in features like linear models or neural nets. We have to encode categorical features using either dummy values or better using one-hot encoding which helps with colinear features. It is good practice to encode categorical values also for ML algorithms that can handle them like decision trees and forrests of trees to be able to test, use, and combine multiple algorithms.

Categorical Features to Numerical

df[['WindGustDir', 'Location', 'Year']].head(5)
WindGustDir Location Year
0 W Albury 2008
1 WNW Albury 2008
2 WSW Albury 2008
3 NE Albury 2008
4 W Albury 2008

We need to convert categorical values into numerical because many ML algorithms can handle only float numbers on input. We cannot simply assign natural numbers to categories, because numbers encode ordering implicitly and this is then picked by models. For example if we encode North as 0 and South as 1, South is greater than North.

One-hot Encoding

One-hot encoding encodes categorical or ordinal features as nominal using “flag columns” or “dummy values” of zeros and ones.

When you have \(N\) distinct values in some feature, we need \(N-1\) flag columns to represent it. The \(N\)th value is represented with all flag columns of zeros. This helps with ML algorithms that are sensitive to colinearity eg. linear and logistic regression without regularization and neural nets.

from sklearn.preprocessing import OneHotEncoder

v = df['WindGustDir'].values.reshape(-1, 1)
e = OneHotEncoder(drop='first', sparse=False).fit(v)
p = pd.DataFrame(e.transform(v), columns=e.categories_[0][1:])
d = pd.concat([df['WindGustDir'], p], axis=1)
d.head(5)
WindGustDir ENE ESE N NE NNE NNW NW S SE SSE SSW SW W WNW WSW
0 W 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 WNW 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 WSW 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 NE 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 W 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

East WindGustDir is encoded using all zeros in remaining flag columns.

d[d.WindGustDir == 'E'].head(3)
WindGustDir ENE ESE N NE NNE NNW NW S SE SSE SSW SW W WNW WSW
76 E 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
167 E 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
169 E 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Exploratory Data Analysis

We explore values in available features to learn their distributions, evaluate their impact on target feature, learn how they interact and study them in detail. The output are learnings, detailed questions to data providers, knowing what transformations we have to apply to data before passing them to ML algorithm.

Categorical Feature

We could check proportion of positive level of target feature in different levels of categorical feature.

Plotting of confidence intervals gives us some idea about the absolute number and variance of samples in every level of categorical feature.

categorical_columns = list(df.select_dtypes(['object']).columns) + ['PressureDrop', 'RainToday']
categorical_columns
['Location',
 'Date',
 'WindGustDir',
 'WindDir9am',
 'WindDir3pm',
 'YearMonth',
 'PressureDrop',
 'RainToday']
png_renderer.height = 800
png_renderer.width = 1000
d = mlp.eda_categorical(df, 'RainTomorrow', max_categorical_values=12)
fig = px.bar(d, x='value', y='mean', error_y='ci', facet_col='variable',
       facet_col_wrap=3,
       labels={'value': '', 'mean': 'RainTomorrow'})
fig.update_xaxes(dict(matches=None, showticklabels=True))
../_images/imlp_1_data_81_0.png
# Annual and monthly rainy days
png_renderer.height = 400
fig = px.bar(d[d.variable.isin(['Year', 'Month'])], x='value', y='mean', error_y='ci', facet_col='variable',
       facet_col_wrap=2,
       labels={'value': '', 'mean': 'RainTomorrow'})
fig.update_xaxes(dict(matches=None, showticklabels=True))
../_images/imlp_1_data_82_0.png
# Annual and monthly rainy days
d = mlp.eda_categorical(df[['RainTomorrow', 'YearMonth']], 'RainTomorrow', max_categorical_values=0)

fig = go.Figure([
    go.Scatter(name='RainTomorrow', x=d['value'], y=d['mean'], mode='lines', showlegend=False),
    go.Scatter(x=d['value'], y=d['mean']+d['ci'], mode='lines', line=dict(width=0), showlegend=False),
    go.Scatter(x=d['value'], y=d['mean']-d['ci'], mode='lines', line=dict(width=0), showlegend=False,
               fillcolor='rgba(68, 68, 68, 0.3)', fill='tonexty')
])
fig.update_layout(title='', yaxis_title='RainTomorrow')
../_images/imlp_1_data_83_0.png

Numerical Features

We check univariate distribution of numerical feature against target variable using histograms or eg. kernel density plots. If distributions are distinct, we have a good predictive feature.

# Does Humidity3pm correlates with RainTomorrow?
import plotly.figure_factory as ff
fig = ff.create_distplot(
    [df[df.RainTomorrow == 0]['Humidity3pm'], df[df.RainTomorrow == 1]['Humidity3pm']],
    ['0', '1'],
    bin_size=1,
    show_hist=False,
    show_rug=False,
    show_curve=True,
)
fig.update_layout(xaxis_title='Humidity3pm', legend_title='RainTomorrow')
../_images/imlp_1_data_86_0.png
# Does Humidity3pm correlates with RainTomorrow?
import plotly.figure_factory as ff
fig = ff.create_distplot(
    [df[df.RainTomorrow == 0]['Rainfall'], df[df.RainTomorrow == 1]['Rainfall']],
    ['0', '1'],
    bin_size=1,
    show_hist=False,
    show_rug=False,
    show_curve=True,
)
fig.update_layout(xaxis_title='Rainfall', legend_title='RainTomorrow', xaxis_range=[-5,50])
../_images/imlp_1_data_87_0.png

Categorical vs. Categorical

We can use grouped bar charts to check how are combinations of two categorical features represented in target feature.

# Do RainToday and DaysSinceRainWeek or RainDaysWeek influence RainTomorrow?
d = mlp.eda_grouped(df, 'RainTomorrow', ['RainToday', 'DaysSinceRainWeek'])

d.DaysSinceRainWeek = d.DaysSinceRainWeek.astype('category')
px.bar(d, x='RainToday', y='mean', color='DaysSinceRainWeek', error_y='ci', barmode='group', labels={'mean': 'RainTomorrow'})
../_images/imlp_1_data_89_0.png
# Do RainToday and DaysSinceRainWeek or RainDaysWeek influence RainTomorrow?
d = mlp.eda_grouped(df, 'RainTomorrow', ['RainToday', 'RainDaysWeek'])
d.RainDaysWeek = d.RainDaysWeek.astype('category')
px.bar(d, x='RainToday', y='mean', color='RainDaysWeek', error_y='ci', barmode='group', labels={'mean': 'RainTomorrow'})
../_images/imlp_1_data_90_0.png

Categorical vs. Numerical

We can use bar charts or violin plots groupped by target feature to check distribution of some numerical feature against values of categorical feature.

  1. Bars give us point estimate of the mean value.

  2. Box plots give us spread, quartiles, and outliers.

  3. Violin plots give us pdf.

# More details about numerical feature distribution

from plotly.subplots import make_subplots
fig = make_subplots(rows=1, cols=3, shared_yaxes=False)

d = mlp.eda_categorical(df[['RainToday', 'Cloud3pm']], 'Cloud3pm')
fig.add_trace(
    go.Bar(x=d['value'], y=d['mean'], error_y=dict(type='data', array=d['ci'], color='grey')),
    row=1, col=1
)

fig.add_trace(
    go.Box(x=df['RainToday'], y=df['Cloud3pm']),
    row=1, col=2
)
fig.add_trace(
    go.Violin(x=df['RainToday'], y=df['Cloud3pm'], meanline_visible=True),
    row=1, col=3
)

fig.update_xaxes(title_text="RainToday", row=1, col=1)
fig.update_xaxes(title_text="RainToday", row=1, col=2)
fig.update_xaxes(title_text="RainToday", row=1, col=3)
fig.update_yaxes(title_text="Cloud3pm", row=1, col=1)
fig.update_traces(meanline_visible=True, row=1, col=3)
fig.update_layout(showlegend=False)
fig.show()
../_images/imlp_1_data_93_0.png

To check impact of category and numerical feature to target feature, we can use color by target feature.

# Going deeper and using color

d = mlp.eda_grouped(df[['RainToday', 'RainTomorrow', 'Cloud3pm']], 'Cloud3pm', ['RainToday', 'RainTomorrow'])

fig = make_subplots(rows=1, cols=3, shared_yaxes=False)

fig.add_trace(
    go.Bar(name='0', 
           legendgroup='a',
           marker={'color': '#2D364C'},
           x=d.RainToday.unique(), y=d[d.RainTomorrow == 0]['mean'], error_y={'array': d[d.RainTomorrow == 0]['ci']}
    ),
    1, 1
)
fig.add_trace(
    go.Bar(name='1',
           legendgroup='a',
           marker={'color': '#FF7800'},
           x=d.RainToday.unique(), y=d[d.RainTomorrow == 1]['mean'], error_y={'array': d[d.RainTomorrow == 1]['ci']}),
    1, 1
)
fig.add_trace(
    go.Box(name='0', 
           legendgroup='a', showlegend=False,
           marker={'color': '#2D364C'},
           x=df[df.RainTomorrow == 0]['RainToday'], y=df[df.RainTomorrow == 0]['Cloud3pm']
    ),
    1, 2
)
fig.add_trace(
    go.Box(name='1',
           legendgroup='a', showlegend=False,
           marker={'color': '#FF7800'},
           x=df[df.RainTomorrow == 1]['RainToday'], y=df[df.RainTomorrow == 1]['Cloud3pm']
    ),
    1, 2
)
fig.add_trace(
    go.Violin(name='0',
              legendgroup='a', showlegend=False,
              marker={'color': '#2D364C'},
              x=df[df.RainTomorrow == 0]['RainToday'], y=df[df.RainTomorrow == 0]['Cloud3pm'], meanline_visible=True),
    row=1, col=3
)
fig.add_trace(
    go.Violin(name='1',
              legendgroup='a', showlegend=False,
              marker={'color': '#FF7800'},
              x=df[df.RainTomorrow == 1]['RainToday'], y=df[df.RainTomorrow == 1]['Cloud3pm'], meanline_visible=True),
    row=1, col=3
)
fig.update_layout(
    boxmode='group', barmode='group', violinmode='group', legend={'title': 'RainTomorrow'}
)
fig.update_xaxes(title_text="RainToday", row=1, col=1)
fig.update_xaxes(title_text="RainToday", row=1, col=2)
fig.update_xaxes(title_text="RainToday", row=1, col=3)
fig.update_yaxes(title_text="Cloud3pm", row=1, col=1)
fig.update_traces(meanline_visible=True, row=1, col=3)

fig
../_images/imlp_1_data_95_0.png

Facets

We can facet plots by values of some categorical features to investigate interactions of upto 4 features agaist target feature. We generally put target variable on y axis, and use x, columns, rows, and hue for upto 4 explanatory features.

# facets
d = mlp.eda_grouped(df[['RainDaysWeek', 'RainTomorrow', 'RainToday', 'Year']], 'RainTomorrow', ['RainDaysWeek', 'RainToday', 'Year'])
d = d[(d.RainDaysWeek <= 3)]
d.RainDaysWeek = d.RainDaysWeek.astype('category')
d.RainToday = d.RainToday.astype('category')

px.line(d, x='Year', y='mean', color='RainToday', facet_col='RainDaysWeek', labels={'mean': 'RainTomorrow', 'Year': ''})
../_images/imlp_1_data_98_0.png

Interactions of Pairs of Features Using Pair Plots or Scatter Plots

We can use Seaborn pairplot to have analyze interactions and distributions of many features at once. These plots need time to plot and are best used printed :)

# beware this takes long minutes to calculate and display
vars = ['RainTomorrow', 'RISK_MM', 'Rainfall', 'Humidity3pm']
png_renderer.height = 800
px.scatter_matrix(df.sample(10000), color='RainTomorrow', dimensions=vars, opacity=0.3)
../_images/imlp_1_data_100_0.png

Covariance

It is not sometimes straightforward to inspect relationshipe between features visually when there are hundreds or thousands of features. Covariance measures how similarly values of two features A and B behave in relation to their respective means.

\[\text{cov}(a, b) = \frac{1}{n - 1}\sum_{i=1}^{n}((a_i - \bar{a}) \times (b_i - \bar{b}))\]

Positive values of covariance signal that positive relationship, negative values signal negative relationship and values around \(0\) signal no relationship. Covariance values are in units of features, it makes no sense to compare pairs of features using it.

Correlation

Normalized covariance is called correlation and it is in interval \([-1,1]\).

\[\text{corr}(a, b) = \frac{\text{cov}(a, b)}{\sigma_a \times \sigma_b}\]

Correlation captures only linear relationshipes between features, there could be more complex relationships like quadratical that we do not discover using correlation. When we calculate correlations for all features, we get correlation matrix good for visual inspection.

# correlations

corr = df.corr()
top_corrs = corr['RainTomorrow'].abs().sort_values(ascending=False)[:10].index
d = corr.loc[corr.index.isin(top_corrs)][top_corrs]

png_renderer.height = 400
png_renderer.width = 400
px.imshow(d, title='Correlations')
../_images/imlp_1_data_106_0.png

Correlation is not causation:

  1. Swallows cause hot weather, spinning windmills cause wind, playing basketball causes people to be tall.

  2. Confounding feature (hidden 3rd factor) eg. young children sleeping with night-light turned on developing short-sightedness

Causation can be only proved by random experiment.

Features with strong correlation with the target are suspicious and worth investigation. It could be a hint of hidden leaks like time traveling in features etc.

Data Preparation

Normalization

Features usually vary a lot in their ranges. This is a problem to any ML algorithm using distances like k-nearest neighbors, to PCA which works with variance which is high for high-magnitude features, or to deep learning where it slows down gradient descend or even disables its convergence completely. For example if feature \(x \in [0, 5]\) and feature \(z \in [0, 1000]\) then increase by \(1\) in \(x\) and \(z\) means something different and we cannot compare it.

Normalize into \([0, 1]\) by

\[ x_i = \frac{x_i - \min(x)}{\max(x) - min(x)}\]

Standardize by

\[ x_i = \frac{x_i - \bar{x}}{\sigma_x}\]
# examples of normalization

from sklearn.preprocessing import StandardScaler, normalize
d = df['Pressure9am'].dropna()
ds = StandardScaler().fit_transform(d.values.reshape(-1, 1))
dn = (d - d.min()) / (d.max() - d.min())

png_renderer.width = 800
fig = make_subplots(rows=1, cols=3, shared_xaxes=False, shared_yaxes=False)
fx1 = ff.create_distplot(
    [d, dn, ds[:,0]],
    ['0', '1', '2'],
    curve_type='kde',
    bin_size=1,
    show_hist=False,
    show_rug=False,
    show_curve=True
)
fig.add_trace(go.Scatter(name='Original', x=fx1.data[0]['x'], y=fx1.data[0]['y']), 1,1)
fig.add_trace(go.Scatter(name='Normalize', x=fx1.data[1]['x'], y=fx1.data[1]['y']), 1,2)
fig.add_trace(go.Scatter(name='Standard Scaler', x=fx1.data[2]['x'], y=fx1.data[2]['y']), 1,3)
fig.update_xaxes(title_text="Pressure9am", row=1, col=1)
fig.update_xaxes(title_text="Pressure9am", row=1, col=2)
fig.update_xaxes(title_text="Pressure9am", row=1, col=3)
../_images/imlp_1_data_113_0.png

Binning

Binning converts continous feature into categorical one for better interpretation of the model, better generalization, and preventing overfitting. Eg. latitude have impact on housing price, but not linear, we could bin latitude to make it categorical.

Machine Learning Crash Course, Google

We can:

  1. Create equally-wide bins - does not work well for uni-modal distributions because few bins will have many samples and many bins will cover long tail with few samples each.

  2. Create equal-frequency bins - more appropriate for uni-modal but is less explainable.

Binning handles outliers automatically.

For categorical data, it might be benefitial to squash low-frequent values under ‘Other’, you do not want necesirily to end up with sparse feature matrices.

Fixing Skewness

Taking log of values makes exponential or “long-tail” distributions more normal.

  1. Logarithm makes comparisons of values of many orders of magnitudes possible eg. change from 1 to 10 is the same as change from 10 000 to 100 000 (consider eg. value of money).

  2. In many datasets, magnitude order of the data changes within the range of the data.

  3. It handles outliers.

Box-cox

Box-cox (power) transformation is generalization of taking \(log\) of features. Box-cox transform first fits \(\lambda\) using maximum log-likelihood method to feature values.

\[\begin{split} y_i^{(\lambda)} = \begin{cases} \frac{y_i^\lambda - 1}{\lambda} & \text{if } \lambda \ne 0 \\ \ln(y_i) & \text{if } \lambda = 0 \end{cases} \end{split}\]
# example of long-tailed distribution

d = df['WindSpeed9am'].dropna()
l, lmbda = st.boxcox(d+1)

fig = make_subplots(rows=2, cols=2, shared_xaxes=False, shared_yaxes=False)
fx1 = ff.create_distplot(
    [d, np.log(d+1), np.sqrt(d), st.boxcox(d+1, lmbda=lmbda)],
    ['0', '1', '2', '3'],
    curve_type='kde',
    bin_size=1,
    show_hist=False,
    show_rug=False,
    show_curve=True
)
fig.add_trace(go.Scatter(name='WindSpeed9am', x=fx1.data[0]['x'], y=fx1.data[0]['y']), 1,1)
fig.add_trace(go.Scatter(name='Log', x=fx1.data[1]['x'], y=fx1.data[1]['y']), 1,2)
fig.add_trace(go.Scatter(name='Sqrt', x=fx1.data[2]['x'], y=fx1.data[2]['y']), 2,1)
fig.add_trace(go.Scatter(name='Box-Cox', x=fx1.data[3]['x'], y=fx1.data[3]['y']), 2,2)
../_images/imlp_1_data_119_0.png

Dimensionality Reduction

High-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization of the structure of a dataset, the dimension must be reduced in some way.

The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely that the more interesting structure within the data will be lost. Link

Scikit-learn Manifold Learning

Linear decomposition frameworks like principal component analysis (PCA) and non-linear manifold learning (t-SNE) can make these projections possible and informative. Manifold learning is not that intuitive compared to linear decomposition and needs good parameter tuning (see eg. nice visualization building some intuition in How to Use t-SNE Effectively).

Distill.pub, How to Use t-SNE Effectively

# beware, this takes about 30 minutes to compute
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

features = []
features += [('Location', OneHotEncoder(drop='first', sparse=False), ['Location'])]
features += [('YearMonth', OneHotEncoder(drop='first', sparse=False, categories='auto'), ['YearMonth'])]
features += [('Humidity3pm', SimpleImputer(strategy='mean'), ['Humidity3pm'])]

pipeline_pca = Pipeline([('ct', ColumnTransformer(features)),
                         ('sca', StandardScaler()),
                         ('pca', PCA(n_components=2))
                        ])

pipeline_tsne = Pipeline([('ct', ColumnTransformer(features)),
                          ('sca', StandardScaler()),
                          ('tsne', TSNE(n_components=2, perplexity=50)),
                         ])

o_pca = np.hstack((pipeline_pca.fit_transform(df), df['RainTomorrow'].values.reshape(-1, 1)))
d_pca = pd.DataFrame(o_pca, columns=['Component 1', 'Component 2', 'RainTomorrow'])

o_tsne = np.hstack((pipeline_tsne.fit_transform(df), df['RainTomorrow'].values.reshape(-1, 1)))
d_tsne = pd.DataFrame(o_tsne, columns=['Component 1', 'Component 2', 'RainTomorrow'])
d_pca.RainTomorrow = d_pca.RainTomorrow.astype('category')
fig = px.scatter(d_pca, x='Component 1', y='Component 2', color='RainTomorrow', title='PCA Transformation')
fig.update_traces(marker_opacity=0.3)
../_images/imlp_1_data_126_0.png
d_tsne.RainTomorrow = d_tsne.RainTomorrow.astype('category')
fig = px.scatter(x=d_tsne['Component 1'], y=d_tsne['Component 2'], color=d_tsne['RainTomorrow'], title='t-SNE Transformation')
fig.update_traces(marker_opacity=0.3)
../_images/imlp_1_data_127_0.png

Handling Class Imbalance

In a case where we have 2% to 98% imbalanced dataset, always predicting majority class gives us model with 98% accuracy.

Imbalanced datasets are common. Accuracy paradox tells us that accuracy defined as is not a good metric. Accuracy is defined as number of correctly classified instances / all instances.

Strategies:

  1. Use penalized models - model pays more when missclassifying minority class.

  2. Under-sample majority class if you have many data.

  3. Over-sample minority class if you do not have enough data.

  4. Generate synthetic samples using eg. imblearn’s SMOTE or Synthetic Minority Oversampling Technique.

  5. Use precission, recall, f1-score to evaluate model, they do not have accuracy paradox.

  6. Change model, eg. random forests are not that sensitive to imbalanced datasets.

When doing over-sampling, always sample after splitting the dataset into training and testing (, validation) datasets.

Over-sampling Using SMOTE and ADASYN Techniques

Decision boundaries.

Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning.

Our dataset is slightly imbalanced, we will see how different algorithms handle this in the next lesson.

df['RainTomorrow'].value_counts()
0    110316
1     31877
Name: RainTomorrow, dtype: int64

The First Model

Let’s try to combine all we learned into one dataset and let’s use some simple model on it.

We first encode categorical feature using OHE.

numeric_columns = list(df.select_dtypes(['float64']).columns) + list(df.select_dtypes(['int64']).columns)
numeric_columns.remove('RainTomorrow')
numeric_columns.remove('Id')

ohe = OneHotEncoder(drop='first', sparse=False).fit_transform(df[['WindDir9am', 'WindDir3pm', 'WindGustDir', 'Location', 'YearMonth']])
train = np.hstack([ohe, df[numeric_columns + ['RainTomorrow']]])

We keep 30% of samples out of sight of the model to have a good way to evaluate model performance.

from sklearn.model_selection import train_test_split
train, test = train_test_split(train, test_size=0.3)
x_train = train[:, :-1]
y_train = train[:, -1]
x_test = test[:, :-1]
y_test = test[:, -1]
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(x_train, y_train);

from sklearn.metrics import classification_report
print('Classifition report on testing dataset')
print(classification_report(y_test, model.predict(x_test)))
Classifition report on testing dataset
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     33133
         1.0       1.00      1.00      1.00      9525

    accuracy                           1.00     42658
   macro avg       1.00      1.00      1.00     42658
weighted avg       1.00      1.00      1.00     42658

It looks too good to be true!

Do you know what the problem could be?