These will be my notes for Andrej’s Zero to Hero playlist on youtube

The spelled out intro to neural networks and back-propagation: building micrograd

Need to know what a derivative is

  • how does a function respond to a slight change in sensitivity (that is the slope) Back Propagation: for every single value, compute the derivative of that node with respect to the output. Recursive application of chain rule, backwards through the computation graph
  • Need to review some basic calculus
  • chain rule
  • Backpropagation helps us fine tune neural networks
  • Review topological sort
  • Review PyTorch
    • very efficient with tensor objects
  • A layer of neurons is a set of neurons evaluated independently
  • Gradients for inputs are not very useful because it is fixed
  • For very large amounts of data, we use batches which is usually some subset of the data

The spelled-out intro to language modeling: building makemore

  • Bigram language model: only looking at two characters at a time
  • https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html
  • https://pytorch.org/tutorials/beginner/nlp/pytorch_tutorial.html
  • https://cs231n.github.io/python-numpy-tutorial/
  • Goal: maximize likelihood of the data w.r.t. model parameters (statistical modeling)
    • equivalent to maximizing the log likelihood (because log is monotonic)
    • equivalent to minimizing the negative log likelihood
    • equivalent to minimizing the average negative log likelihood
    • log(abc) = log(a) * log(b) * log(c)
  • Be cautious of tensor APIs
  • common way to encode integers is to use one hot encoding
  • outputs of neural net is log counts (logits)
  • softmax activation function
  • the loss is the average negative log likelihood
  • how to optimize a neural net
    • start with a random guess
    • now we have a loss function (made up of differentiable functions)
    • minimize loss by tuning w’s by computing gradients of the loss with respect to w
    • one hot encoding really just selects a row of the next Liner layer’s weight matrix