Training

Batch Gradient Descent

Use the full training set at each Gradient Descent step

Pros

  • Actually converge
  • Sceles well with large number of features

Cons

  • Slow on very large training sets

Stochastic Gradient Descent

picks a random instance in the training set at every step and computes the gradients based only on that single instance

Or, shuffle the training set, then go through it by instance, then shuffle it again, and so on, this generally converges more slowly

Pros

  • Possible to train on huge training sets
  • Has a better chance of finding the global minimum

Cons

  • Final parameter values are good, but not optimal

Mini-batch Gradient Descent

Computers the gradients on small random sets of instance called mini-batches

Pros

  • less erratic than SGD

Cons

  • Harder to escape from local minima than SGD

Avoid Overfitting

  • Reduce degrees of freedom
  • Regulation
  • Increase the size of the training set

Regulation

Ridge Regression

Add a regularization term equal to $\alpha \sum_{i=1}^{n}\theta_{i}^{2}$

  • A model with regularization typically performs better than a model without any regularization

Lasso Regression

Add the regularizaiton term $\alpha \sum_{i=1}^{n} |\theta|$

  • Automatically performs feature selection and outputs a sparse model, is good if only a few features actually matter, when you are not sure, prefer Ridge Regression
  • behave erratically when the number of features is greater than the nubmer of training instances or when several features are strongly correlated

Elastic Net

Add the regularizaiton term $r\alpha\sum_{i=1}^{n}|\theta_{i}|+\frac{2}{1-r}\alpha\sum_{i=1}^{n}\theta_{i}^{2}$

  • generally preferred over Lasso

Early Stopping

Stop training as soon as the validation error reaches a minimum A simple and efficient regularization techniques that Geoffrey Hinton called it a "beautiful free lunch."

Reference

Hands-On Machine Learning with Scikit-Learn & TensorFlow