Lesson 5

Linear model and neuralnet from scratch

Tabular model from scratch
Review Titanic dataset and the two models in excel
From excel to python
Clean version of notebook
- What does a clean version of the notebook look like?
Get comfortable in Paperspace Gradient
- How to work with jupyterlab mode instead of default mode?
- How to swift between jupyterlab mode and jupyter notebook mode?
- Learn some useful keyboard shortcuts
Things to do with clean notebook
- What are the steps or things we should do with the clean version of a course notebook?
- Where is the non-clean version?
Same notebook runs on Kaggle and everywhere
- How to check whether the notebook is running on Kaggle or elsewhere?
- How to get the data and its path right accordingly?
Libraries and format setup
- How much should we know about these libraries before starting?
- How to make the printed result look nicer in cells?
Read train.csv as Dataframe
- How to read and display a csv file in pandas dataframe format?
Find and count missing data with pandas
- How to check missing data in each cell/row?
- How to sum up missing data in each column?
Choose mode value for the missing data
- What is the most common choice for replacing the missing data regardless categorical or continuous? mode
- How to select the first mode value if there are two modes available for a column?
Be proactively curious
- Why it is impractical for Jeremy to explain every common function of every library used?
- What should you do about it?
Replace missing data with mode values
- How to fill in the missing data with mode values of those columns with or without creating a new dataframe?
Keep things simple where we can
- Why use the world’s simplest way of filling missing data?
- Does this simplest way work most of the time?
- Do we always know the complicated way would help?
Don’t throw out rows nor columns
- Do those filled columns sometimes turn out to matter much for the model?
- How does fastai library help to find out about it?
Describe your data or columns
- How to get a quick overview/description of your data?
- What do we look for in the descriptions?
See your columns in histogram
- What to do with interesting columns?
- What can you find out with histogram?
- What is long-tailed distribution of the column? What does it look like?
Log transformation on long-tailed columns
- Which models don’t like long-tailed distributions in the data? #data-describing
- What is the easiest way to turn long-tailed to centered distribution?
- Where to find more about the log and log curve?
- What does log do in one sentence? 17:11
- How to avoid the problem of log(0)? adding 1
- What does the column data (histogram) look like after transformed by log?
Most likely long-tailed data
- What kind of data are most likely to be long-tailed which need log transformation?
Describe non-numerical columns
- How to describe seemingly numerical but actual categorical columns?
- How to describe all non-numeric columns altogether?
- What does this description look like? (how it differ from those of numeric data)
Apply coefficients on categorical columns
- How to apply coefficients to categorical columns?
- What does it mean by applying dummy variables to categorical columns?
- What are the two ways of getting dummy variables and what’s Jeremy view on them?
- What does the dummy variable transformation of categorical variables look like ?
The secret power of name column
- Can a model built only on name column score No.1 in Titanic competition?
- Where to find more about it?
- This technique is not covered in this lecture
Tensor
- Why focus on pytorch rather than numpy?
- What data format does pytorch require? How to do this data format conversion?
- What is a tensor? Where did it come from?
- How to turn all independent columns into a single large tensor?
- What is the number type does tensor need? float
- How to check the shape of a tensor? (num of rows and columns)
- How to check the rank/dimensions/axis of a tensor? What is rank?
- What is the rank of a vector, a table/matrix, or a zero?
Create random coefficients
- Why we don’t need a constant here as in excel?
- How many coefficients we need? How we figure it out?
- How to create a vector of randomized numbers for the coefficients?
- How to make the coefficients value centered? Why this is important? (answered later)
Reproducibility of coefficients
- How to create the same set of random numbers for your coefficients each time running the cell?
- When to and not to use random seed to make your result reproducible?
- How not using random seed can help understand your model intuitively?
Broadcasting: data * coefficients operation on GPU
- What is broadcasting? Isn’t it just matrix and vector multiplication?
- Where did it come from?
- What are the benefits of using broadcasting?
- simple code vs lots of boilerplate
- coded and optimized in C language for GPU computation
- What’s the rule of broadcasting and where to find more about it?
- a great blog post help understand broadcasting
Normalization: make the same range of values for each column
- What would happen when the values of a column is much larger than the values of other columns?
- Why to make every data column to have the same range of values?
- How to achieve the same range for all column values?
- What are the two major ways of doing normalization?
- Does Jeremy favor one over the other?
Sum up to get predictions
- How to sum up the multiplication of each row with the coefficients, and do it for all rows?
- Is the summed-up number the prediction for each person/row of data?
A default choice for loss function
- How to make the model better? Gradient descent
- What is needed to do gradient descent? loss function
- What does a loss function do? measure the performance of coefficients
- What is Jeremy’s default/favorite choice for loss function?
- Why does Jeremy always write the loss function out manually when experimenting?
Make notebook readable/understandable in the future
- When to encapsulate all exploratory steps into a few functions?
- Why keep all these exploratory steps available (Don’t delete them)?
Update coefficients with gradient descent in Pytorch
- How to ask PyTorch to do gradients on coefficients?
- How to ask Pytorch to update values on the same coefficients tensor (not create new one)?
- What does loss function do besides giving us a loss value? What does it store?
- What function to run with loss to calculate gradients for coefficients?
- How to access the gradients of coefficients? and how to interpret the gradients?
- How to decide whether it is to subtract or add gradients to coefficient?
- How to choose on the learning rate?
- How to calc updated loss with renewed coefficients?
Split the dataset
- Why did Jeremy randomly split training and validation set for Titanic dataset?
- Why to use fastai’s random splitter function?
- How to create the training and validation set with the splitter function?
Encapsulate functions for model training
- How does Jeremy create functions like init_coeffs, update_coeffs, one_epoch, train_model from the exploratory steps above?
- How to use the train_model function to see how well the model works?
Titanic dataset is a good playground
- Why so?
Display coefficients
- How to display the final coefficients?
- How to interpret it?
- Can we make some sense of the values inside?
Accuracy as metrics
- Why not use accuracy as loss function?
- What can we use accuracy function for?
- What threshold did Jeremy use for survival?
- How to calculate accuracy and put it into a function?
Sigmoid function: ease coefficients optimization
- What you see from the predictions make you think of using sigmoid function?
- Why sigmoid function can really make optimization easier for the model?
- Why the two-ends plateau of the function is good for optimization? (to tolerate very large and small values of predictions rather than forcing every prediction to get closer to 1 or 0)
- Why the straight-line middle part of the function plot is also what we want? 48:58
- How to plot any function with just one line of code? What library is this? sympy
- How to update calc_preds function with sigmoid function easily in Jupyter? 50:52
- Why to make predictions to center on 0 before sigmoid function? (a reply by Jeremy)
- Do you remember what did Jeremy do to make prediction center on 0? (see how initial coefficients is defined, a cell link on Kaggle)
- Why allow predictions to be large or small can make weights optimization easier? (Jeremy’s reply)
- How python with Jupyter make exploratory work so easy?
- How come the learning rate jump from 0.1 before sigmoid to 2 after using sigmoid? 51:57
- When or How often (as a rule) should you use sigmoid function to your prediction? 52:23
- Does HF library specify whether they use sigmoid or not? (probably the others neither)
- What You need to watch out for optimization is the input and output not the middle for now. Why is that? 53:13
What if test dataset has extra columns?
- What would the normal consequences be?
- How does fastai deal with it for good?
Submit to Kaggle
- How and why Jeremy replaced a missing value of Fare with 0?
- How to apply the above data cleaning steps to the test set?
- How to prepare the output column expected by Kaggle?
- How to create the submit file expected by Kaggle?

Key steps from linear model to neuralnet

val_indep * coeffs vs val_indep @ coeffs
- What do we know about val_indep * coeffs mean? Is it element-wise? Is it matrix and vector multiplication?
- What do we know about val_indep @ coeffs? Is it matrix-matrix multiplication?
- Is (val_indep * coeffs).sum(axis=1) equal to val_indep @ coeffs?
- What should we know about them to distinguish them properly?
- In val_indep @ coeffs, when coeffs is a matrix, do we need to specify its shape? 59:50
- How to initiate coefficients as a one column matrix rather than a vector?
Transform existing vectors into matrix
- How to turn both trns_dep and vald_dep from existing vectors to matrices which responding to coeffs matrix?

Building a neural net

Keep linear model and neuralnet comparable in output
- How to create a layer within multi-hidden layers inside (or a matrix of coefficients rather than a vector of coefficients)?
- why to divide the coefficients of the multi-hidden layers by the number of layers (or the matrix of coefficients by the number of columns)?
- Is it to make sure the outputs of neuralnet and previous linear model are comparable?
Build the output layer
- How to build the coeffs of the output layer with correct shape which connects with the previous layer?
- How to decide the number of output of this output layer?
TRY to getting the training started
- Why Jeremy make the coefficients of the output layer to minus 0.3?
- What does it mean by this minus 0.3 can get the training start?
- (I guess Jeremy may tried -0.5 first, experiment to find it out)
Adding Constant or not
- Why we don’t need a constant for layer 1 (think of the constant of the linear model)?
- Why we must have a constant for layer 2?
- Do coefficients of layer1, layer 2 and constants all need their own gradients initiated?
Building the model
- What is a tuple and how it is used for grouping and separating the three coefficients?
- How to construct the prediction function by sending data through layer 1 and layer 2 finally add constants?
A neuralnet done but super fiddly
- How to update all three coefficients in a loop?
- Did you notice that the learning rate changed again? (1.4, last time was 2, earlier was 0.1)
- What did Jeremy say getting this model work was super fiddly?

From neuralnet (1 hidden layer) to deep learning with 2 hidden layers

Initialize coefficients of all hidden layers and constants
- How to initialize coefficients of 2 hidden layers and 1 output layer and constants, and get gradients ready for all of them, in one compact function?
- What are the shape of each coefficient matrix?
Building the 2 hidden layer model
- What are activation functions?
- What are the activation functions for 2 hidden layers?
- What is the activation function for the output layer?
- What is the most common mistake on applying activation function to the final layer?
Train the model
- Don’t forget to update gradients
- Which are those numbers Jeremy still have to fiddle to get right?
- Did this deep learning model improve on the loss and accuracy?
Dissect and Experiment large functions
- How to experiment on a large function like the init_coeffs by breaking it into small pieces and running them?
Tabular datasets: where deep learning not shining
- How should we think about that both neuralnet and deep learning models aren’t better?
- What does it mean that a carefully designed algo beat all deep learning models in Titanic competition?
- What datasets do deep learning generally perform better?

Framework: DL without super fiddling notebook

Why use framework rather than from scratch
- Why you should use a library framework in real life rather than building yourself like the above?
- When to do it from scratch?
- What can a framework do for us?
- Can it automate the obvious things like initialization, learning rate, dummy variable, normalization, etc?
- Can I still make choice on the not-so obvious things?
Feature engineering with pandas
- What is the feature engineering with pandas look like?
- How and where does Jeremy suggest on digging in pandas?
Automate the obvious when preparing dataset
- How framework make cateorifying data, filling missing data, normalization automatic?
- How to specify the dependent column to be a category?
Build multi-hidden layers with one line of code
- How to specify the shapes of two hidden layers with just two numbers?
- Do you only need to specify accuray without worrying about loss and activations?
Automate the search of best learning rate
- How does fastai help you find the range where best learning rate locates?
- How should you pick the learning rate from the range?

Predict and Submit with ease

Automate transformation of test set in one line of code
- How to automatically apply all transformations done to training and validation sets to test set?

Experiment with Ensembling

Ensemble is easy with fastai
- Does framework save so much of fiddling so that experimenting with some advanced ideas become easier?
- What is ensembling?
- Is it to combine multiple models and combine their predictions?
The simplest ensemble
- What does a simple ensemble look like?
- How to build, run and predict with 5 models with ease?
- How different are those 5 models? (only initial coefficients are different)
- How to combine their predictions?
- How much improvement does this simple ensemble get us to?
Ways to combine the predictions
- Why not use mode but mean?
- What are 3 ways of combining the predictions?
- Does one is better than the others?
- What’s Jeremy’s suggestion?

How Random Forest really work

Is this a good place to also learn pandas and numpy?
Why Random Forest
- What is the history of Random Forest and Jeremy?
- What does Jeremy think of random forest?
- Why random forest is so much easy and better?
- Why the seemingly simple logistic regression is so easy to get wrong?
Pandas categorical function
- How to import all the libraries you need at once?
- How to do fillna with pandas and log with numpy?
- What does panda categorical function do for us?
- What’s friendly display after the function applied?
- What’s the actual data transformation under the hood?
- Key points to make: No dummy variables, Pclass no long needed to be viewed as categories
Binary splits: bases of random forest
A binary splits on gender
Build a binary splits model on gender with sklearn
Build a binary splits model on Ticket prices with sklearn
Build a score machine on binary splits regardless categorical or continuous
- What is a good split?
- Is it good that within each group their dependent values are similar?
- How to measure the similarity of values within a group? std
- How to compare standard deviations between two groups appropriately? (multiply by size)
- How to calc the score for evaluating the splits based on the value of combined std of two groups?
Automate the score machine on all columns
- How to find the best binary splits by trying out all possible split points of a column?
1R model as the baseline
- What is a random forest? and what is a random forest?
- What is 1r model?
- how good was it in the 90s of ML world?
- Should we always go for complicated models?
- Should we always start with a 1r model as a baseline model for our problem?