2. Neural Networks

In part two of the course notes we change both our goal and function class to sequence learning and deep neural networks, respectively. We start off by traning a language model in pytorch using a feed forward neural network with gradient descent. After that, we build off from picograd's Tensor, Op, and Device developed in part one and abstract two tasks for the research scientist:

differentiation and optimization with Tensor.backward() and optim.sgd/optim.adam and
gpu kernel acceleration of BLAS and DNN operations in CUDA C

which we will then use to train other language models with different inductive biases (invariances) such as RNNs, LSTMs, BERTs, GPTs, and Diffusion Models. Finally, we will accelerate the performance of the specifically targeted nanogpt with FlashAttention kernels. This will set us up for part three of the course notes where we modify the language implementation of the deep learning framework to support distributed compilation in order to support the training and inference of nanochat.

Contents

2.1 Learning Sequences

2.1.1 Sentiment Learning with Feedforward Neural Networks

In part 1 of the course notes we trained generalized linear models of the form __*. In part 2 we modify and increase the expressivity of the function class by including non-linearities $φ$ . The feedforward neural network simply put is a series of linear and non-linear layers of the form $f : R^{d_{0}} \to R^{d * l - 1}$ so $f (x; w) := W^{(l - 1)} \circ (φ \circ W^{(l - 2)}) \circ \dots \circ (φ \circ W^{(0)}) \circ x$

where $W^{(i)} \in R^{d i - 1 \times d_{i}}$ , $φ : R^{d_{i}} \to R^{d}$ i+1 is an elementwise nonlinearity, and $w := (W^{(0)}, \dots, W^{(l - 1)})$ . Conceptually, the linear layers are performing linear transformations that rotate, reflect, shear, and scale space, whereas the nonlinear transformations perform transformations that squash and twist space.

We will now use the same model of the feedforward neural network to accomplish two other goals. Namely, representation learning, and sequence learning.

2.1.2 Representation Learning with Feedforward Neural Networks

2.1.3 Language Modeling with Feedforward Neural Networks

A sequence model, simply put, is the conditional probability distribution of an output token given an input token $p (x^{(i)} ∣ y^{(i)})$ . A sequence of tokens can be a sentence of words in the domain of language, a series of pixels in the domain of vision, or a stream of waves in the domain of audio.

Since we are modeling language as stochastic phenomena, we use the formal language of probability theory, where a probability space is a measurable space $(Ω, F)$ with a measure $\mathbb{P}: \Omega \to [0,1]$. In the domain of language, the measurable space consists of a sample space $Ω$ which is the set of all tokens modelling a vocabulary, and the event space $F$ is the set of all token combinations which model a language. The measure $P : F \to [0, 1]$ is the measure of the weight of a particular token combination (sentence, really) as an event with respect to the set of all possible token combinations (sentences) as the entire event space. Once we use a random variable $X$ to map events to $R$ , we can forget about the probability space and focus our attention on language models which are joint probability distribution over all sequences of tokens.

Language modeling with ngrams.

(1. EXPLAIN MODEL).

# FFN MODEL f: R^n -> R
import torch

class MLP():
  """
  model: Neural Language Models (Bengio et al. 2003)
  key:
  b: batch size, t: sequence length
  v: vocabulary size, e: dimension of embedding, d: dimension of model
  """
  
  def __init__(self, cfg):
    super().__init__()
    b, t, v, e, d = cfg.b, cfg.t, cfg.v, cfg.e, cfg.d
    self.wte = layers.Embedding(v+1, e)   # token embeddings table (+1 for <BLANK>)
    l1 = layers.Linear(t*e, d, b=False)
    l2 = layers.Linear(d, d, b=False)
    l3 = layers.Linear(d, v, b=False)

  def forward(self, i, targets=None):
    embs = []                             # gather the word embeddings of the previous 3 words
    for k in range(self.b):
      tok_emb = self.wte(i)               # token embeddings of shape (b, t, e)
      i = torch.roll(i, 1, 1)
      i[:, 0] = self.v                    # special <BLANK> token
      embs.append(tok_emb)

                                          # concat all of the embeddings together and pass through an MLP
    x = torch.cat(embs, -1)                  # (b, t, e * block_size)
    x = self.l1(x).tanh()
    x = self.l2(x).tanh()
    x = self.l3(x)
    yhat = x

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None: loss = F.cross_entropy(yhat.view(-1, yhat.size(-1)), targets.view(-1), ignore_index=-1)
    return yhat, loss

(2. EXPLAIN DATASET).

# FFN DATA d={(x^i,y^i)}
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

def build_dataset(t):
  import random

  words = open('./data/names.txt', 'r').read().splitlines()
  v = sorted(list(set(''.join(words))))
  encode = { c:i+1 for i,c in enumerate(v) }
  encode['.'] = 0
  decode = { i:c for c,i in encode.items() }

  def gen_dataset(words, t):
    X, Y = [], []
    for w in words:
      context = [0] * t
      for c in w + '.':
        X.append(context)
        Y.append(encode[c])
        # print(''.join(decode[i] for i in context), '-->', decode[encode[c]])
        context = context[1:] + [encode[c]]
    X, Y = torch.tensor(X), torch.tensor(Y) # X:(N,C) Y:(N)
    return X, Y

  random.seed(42)
  random.shuffle(words)
  n1, n2 = int(0.8*len(words)), int(0.9*len(words))
  Xtraining, Ytraining = gen_dataset(words[:n1], t)
  Xdev, Ydev = gen_dataset(words[n1:n2], t)
  Xte, Yte = gen_dataset(words[n2:], t)
  return Xtraining, Ytraining

(3. EXPLAIN TRAINING LOOP).

# FFN TRAINING LOOP: theta^(t+1) := theta^t - alpha*grad(L)
if __name__ == "__main__":
  b, t, v, e, d = 32, 3, 27, 10, 200                         # init hyperparameters
  X, Y = build_dataset(t)                                    # init data
  C = torch.randn((v,e), generator=g)                           # init embedding
  model = MLP()                                              # init model
  params = [C] + [p for l in model for p in l.parameters()]
  for p in params: p.requires_grad = True

  N, losses, steps = X.shape[0], [], [] # train
  for step in range(200000):
    i_b = torch.randint(0, N, (b,))
    X_b, Y_b = X[i_b], Y[i_b]
    X_bd = C[X_b].view(-1, t * e)                            # 0. embed
    for layer in model: X_bd = layer(X_bd)                   # 1. forward

    loss = X_bd.cross_entropy(Y_b)
    for layer in model: layer.out.retain_grad()
    for p in params: p.grad = None
    loss.backward()                                          # 2. backward

    for p in params: p.data += -0.01 * p.grad                # 3. update
    # optimizer.step()?

    steps.append(step)
    losses.append(loss.log10().item())
    if step % 10000 == 0: print(f"step: {step}/{200000}, loss {loss.item()}")

    plt.plot(steps, losses)

In the next two chapters of 2.2 and 2.3, we will implement two features on picograd which are the two primary tasks which pytorch abstracts away from research scientists: the backward pass with automatic differentiation, and the device acceleration of the specified forward pass.

2.2 ERM via Automatic Differentiation and Gradient Descent

So far we've been training feedforward neural networks using the magical Tensor.backward() function in order to materialize the gradient of the loss on our parameter weights $w = (W^{(0)}, \dots, W^{(l - 1)})$ which gradient descent uses in it's update rule $θ^{t + 1} : = θ^{t} - α \nabla L (θ)$ . We will now dive deeper into how .backward() is implemented.

2.2.1 Automatic Differentiation via `Tensor.backward()`

Consider the function $f : R, R \to R$ where $f : = exp (x_{1}) * sin^{2} x_{2}$ , and translate it to it's computational counterpart in python with one-dimensional Tensors:

import picograd as pg

def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
  a = pg.exp(x1)
  b = pg.sin(x2)
  c = b**2
  d = a*c
  return d

Figure 1. Python source for the function $f : R, R \to R$ where $f : = exp (x_{1}) * sin^{2} x_{2}$

Here we've broken up the function to render the subexpressions more clearly. But this isn't necessary — automatic differentiation will work if the function was expressed in one line. In part one, the development of picograd followed that of numpy — an array programming language similar to Matlab but embedded in the host language of Python, that could evaluate functions of the form $f (x)$ where Tensor objects stored their values $f (x)$ with the value: field and the function types that produced their values with Op. For instance, evaluating the specified function f from above with 9 and 10

if __name__ == "__main__":
  print(f(9, 10))

populates the Tensor.value fields. In part one of the book we verified this with a REPL-interface, but we can also represent the entire expression being evaluated with a graph of vertices and edges where the vertices are Tensors (along with their Ops and values) and the edges are their data dependencies:

Here you can see that even if the function was specified in one line, the graph of the expression always parses into Tensor vertices, and data dependency edges. You may have noticed the Tensor.grad fields, which supposedly store the values of derivatives $f^{'} (x)$ . The question now remains in how to populate these fields.

Taking a step back to differential calculus, deriving the derivative of $f : = exp (x_{1}) * sin^{2} x_{2}$ involves the application of the chain rule where $f (x) := g (h (x)) ⟹ f^{'} (x) = g^{'} (h (x)) h^{'} (x)$ . Evaluating the derivative of the function with respect to its inputs $f^{'} (x_{1})$ and $f^{'} (x_{2})$ results in

$f^{'} (x_{1}) = \frac{\partial}{\partial x _{1}} [exp (x_{1}) * sin^{2} x_{2}] = \frac{\partial}{\partial x _{1}} [exp (x_{1})] * sin^{2} x_{2} = exp (x_{1}) (1) * (sin^{2} x_{2})$ $f^{'} (x_{2}) = \frac{\partial}{\partial x _{2}} [exp (x_{1}) * sin^{2} x_{2}] = exp (x_{1}) * \frac{\partial}{\partial x _{2}} [sin^{2} x_{2}] = exp (x_{1}) * 2 sin x_{2} cos x_{2}$

symbolic and numeric differentiattion symbolic differentiation has performance issues since a large unrolled expression must be constructed in order to differentiate¹, whereas numerical differentiation has correctness issues since evaluating finite differences requires evaluating functions to a precision point resulting in numerical instability. (trace through EXAMPLE for both. talking nets widrow)

To populate the Tensor.grad fields, the simplest idea would be to literally translate the manual derivation of the derivative into code. The translation from math to code involves a design decision: should we evaluate from outputs to inputs (symbolically outside-in, graphically right-to-left) or from inputs to outputs (symbolically inside-out, graphically left-to-right)? Although the former order seems more natural with symbolic expressions, there's nothing illegal about the latter.

import picograd as pg

def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
  a = pg.exp(x1)
  b = pg.sin(x2)
  c = b**2
  d = a*c
  return d

# dict[f(x), f'(x)] of local derivatives (adjoints)
dd_da, dd_dc = [c, a] # d(a,c):=a*c ==> d'(a)=c, d'(c)=a
da_dx1 = pg.exp(x1) # a(x1):=exp(x1) ==> a'(x1)=exp(x1)
dc_db = 2*b # c(b):=b^2 ==> c'(b)=2b
db_dx2 = pg.cos(x2) # b(x2):=sin(x2) ==> b'(x2)=cos(x2)

# outputs to inputs: outside-in symbolically, right-to-left graphically
dd_dd = pg.Tensor(1) # base case
dd_da, dd_dc = [dd_dd*dd_da, dd_dd*dd_dc]
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path

dd_db = dd_dc*dc_db
dd_dx1 = dd_db*db_dx2 # DONE for x2->path

# inputs to outputs: inside-out symbolically, left-to-right graphically
dx1_dx1, dx2_dx2 = [pg.Tensor(1), pg.Tensor(1)] # base case
da_dx1 = da_dx1*dx1_dx1
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path

db_dx2 = db_dx2*dx2_dx2
dc_dx2 = dc_dc*db_dx2
dd_dx2 = dd_dc*dc_dx_2 # DONE for the x2->d path

Do you notice any difference in the number of evaluations between the two orders?

The outputs-to-input ordering takes 6 arithmetic operations (including the destructuring), whereas the input-to-output ordering take 7 arithmetic operations. This is because the former can reuse dd_dd as a dynamic programming solution to a subproblem for the two inputs, whereas the latter cannot. And taking a step back, we only want to reuse the output because the shape of the function is of $f : R^{2} \to R$ . Alternatively, if $f$ had type $f : R \to R^{2}$ , then the input-to-output ordering would be able to reuse results. This distinction is referred to as "forward-mode" vs "reverse-mode", and reflects the fact that for some function $f : R^{n} \to R^{m}$ the time complexity of forward-mode differentiation is proportional to $n$ , whereas that of forward-mode differentiation is proportional to $m$ . If the expression graph fans-in so that $n > m$ , reverse-mode is preferred. If the expression graph fans-out so that $m > n$ , forward-mode is preferred. However, if we take a step with a graph-theory lens, we can see that the derivative is the sum of paths, where each path is a product of local derivatives from the input source to the output sink. From a combinatorics perspective, we are calculating all the possible (ors) ways (ands) on how the inputs perturb the output. That is:

$\frac{\partial f}{\partial x _{i}} = \sum\prod \frac{\partial v _{j}}{\partial v _{k}}$

and as long as the operations along this path are associative — $(A B) C = A (BC)$ — then we can choose the order in how we perform these path products to minimize the number of operations. Finding the optimal ordering is an NP-hard problem because ____. For instance, if the expression graph is diamond-shaped, evaluating the derivative with forward-mode for the left-half and reverse-mode for the right-half would be more performant. In practice, we use reverse-mode as a heuristic, since most of the functions that are differentiated (so they can be optimized) in the field of machine learning are neural networks of the form $f : R^{n} \to R$

How can we generalize this into an algorithm?
All we need are 1. mappings from $f (x) \to f^{'} (x)$ and 2. a topological sort

For the derivative rules, the same way that optimizing compilers implement an optimization "manually" once which then gets reused many times, the authors of deep learning frameworks also implement derivatives manually which then become reused many times through automatic differentiation. In theory, we can differentiate any expression with f'(x) with only a few derivative rules for addition and multiplication, but in practice most frameworks provide sugar for complex derivatives.

For topological sort, we can simply reversed the ordering produced by a depth-first-search:

def toposort(self):
  order: list[Op] = []
  visited: set[Op] = set()

  def dfs(node: Op) -> None:
    if node in visited: return
    visited.add(node)
    for src in node.src: dfs(src)
    order.append(node)

  dfs(self)
  return order

class Tensor():
  def backward():
    for t in reversed(topo):
      t.backward()

We will now use this idea to modify the interpretation of our deep learning framework to not only evaluate $f (x)$ , but $f^{'} (x)$ as well. This is done by dynamically overloading the operators at runtime¹ to trace the expression graph

chain_rules = PatternMatcher([
  (Pattern(OpCode.MATMUL, name="input"), lambda output_grad, input: (_____,)),
  (Pattern(OpCode.MATVEC, name="input"), lambda output_grad, input: (_____,)),
  (Pattern(OpCode.RECIPROCAL, name="input"), lambda output_grad, input: (-output_grad * input * input,)),
  (Pattern(OpCode.SIN, name="input"), lambda output_grad, input: ((math.pi/2 - input.src[0]).sin() * output_grad,)),
  (Pattern(OpCode.LOG2, name="input"), lambda output_grad, input: (output_grad / (input.src[0] * math.log(2)),)),
  (Pattern(OpCode.EXP2, name="input"), lambda output_grad, input: (input * output_grad * math.log(2),)),
  (Pattern(OpCode.SQRT, name="input"), lambda output_grad, input: (output_grad / (input*2),)),
  (Pattern(OpCode.ADD), lambda output_grad: (1.0*output_grad, 1.0*output_grad)),
  (Pattern(OpCode.MUL, name="input"), lambda output_grad, input: (input.src[1]*output_grad, input.src[0]*output_grad)),
])

class Tensor:
  def _forward(self, f:Callable, *other:Tensor) -> Tensor: #extra_args=(), **kwargs)
    out_tensor = evaluator.eval_uop([self, other], out_uop)

  def backward(self, grad:Tensor|None=None) -> Tensor:
    """
    backward performs by collecting tensors, computing gradients with automatic differentiation, and updating said tensors.
    """
    # 1. collect all tensors that requires grad by topologically sorting the graph of uops and filter
    all_uops = self.uop.toposort()
    tensors_require_grad: list[Tensor] = [t for tref in all_tensors if (t:=tref()) is not None and t.uop in all_uops and t.requires_grad]
    uops_require_grad = [t.uop for t in tensors_require_grad]
    assert grad is not None or self.shape == tuple(), "when no gradient is provided, backward must be called on a scalar tensor"
    if not (self.is_floating_point() and all(t.is_floating_point() for t in tensors_require_grad)): raise RuntimeError("only float Tensors have gradient")
    
    # 2. compute the gradient with a map of tensors to partials
    if grad is None: grad = Tensor(1.0, dtype=self.dtype, device=self.device, requires_grad=False) # base case is 1.0
    tens2grads = Tensor._automatically_differentiate(self.uop, grad.uop, set(uops_require_grad)) # skipping materializing zerod grads for now
    grads = [Tensor(g, device=t.device) for t,g in zip(tens2grads.keys, tens2grads.values)] # initialize tensor grads on device
    
    # 3. update the tensors that require grad with the gradient's partials
    for t,g in zip(tensors_require_grad, grads):
      assert g.shape == t.shape, f"grad shape must match tensor shape, {g.shape!r} != {t.shape!r}"
      t.grad = g if t.grad is None else (t.grad + g) # accumulate if t.grad exists
    return self

  @staticmethod
  def _automatically_differentiate(root:Op, root_grad:Op, targets:set[Op]) -> dict[Op, Op]:
    """
    _differentiate backpropagates partials on a topologically sorted expression graph with the chain rule
    and produces the gradient in the form of a map of ops to their partials (which, in turn, are ops)
    """
    tens2grads = {root: root_grad}

    # 1. topological sort
    in_target_path: dict[Op, bool] = {}
    for u in root.toposort(): in_target_path[u] = any(x in targets or in_target_path[x] for x in u.src)
    dfs = list(root.toposort()) # lambda node: node.op not in {OpCode.DETACH, OpCode.ASSIGN} and in_target_path[node])) # don't flow through DETACH/ASSIGN or anything not in target path

    # 2. backpropagation with the chain rule
    for tensor in reversed(dfs):
      if tensor not in tens2grads: continue

      local_grads: tuple[Op|None, ...]|None = cast(tuple[Op, ...]|None, chain_rules.rewrite(tensor, ctx=tens2grads[tensor]))
      if local_grads is None: raise RuntimeError(f"failed to compute gradient for {tensor.op}\n\nin {str(tensor)[0:1000]}...")
      assert len(local_grads) == len(tensor.src), f"got {len(local_grads)} gradient, expected {len(tensor.src)}"

      for tensor,local_grad in zip(tensor.src, local_grads): # <--------------------- MOOOSE: why are we accumulating inside ad()? don't we do it in backward()??
        if local_grad is None: continue
        if tensor in tens2grads: tens2grads[tensor] = tens2grads[tensor] + local_grad # accumulate if tensor exists
        else: tens2grads[tensor] = local_grad # o/w initialize

To implement automatic differentiation with Tensor.backward(), there is a design decision to be made — the choice of implementing it dynamically or just-in-time², similar to the decision of how to implement types for general programming languages³. This stands in contrast to the alternative of performing a just-in-time, source-to-source transformation.

Let's now move onto automatically differentiating the functions of neural networks, specifically the FFN language model from earlier. (johnson/ryan adams ordering) n^2 vs n^3

2.2.2 Stochastic Gradient Descent via `optim.sgd`

2.2.3 Adam via `optim.adam`

2.3 High Dimensional Linear Transformations on GPU

2.3.1 Throughput-Oriented Processor Architecture

Figure x: https://modal.com/gpu-glossary

The above is really a continuum. for instance, cpus we saw in last part invest some chip area in multi cores. but not many cores.

4 optimizations for throughput: (are you willing to change the pdp11 SISD programming model?)

Simplify the hardware by remove caches, branch predictors, memory prefetchers multirecore > not instruction-level parallelism Use savings to place more cores (MIMD)
Amortize control overhead (fetch/decode) with SIMD (single control unit. registers turn into vector registers)

SIMD is explicit in the CPU programming models (AVX/NEON)
SIMD is implicit in GPU programming models

fundamental tradeoff is if you have 1,000,00 alus, in worst case, divergence (lower MFU) degrades 1/nth of peak throughput -> at some point, you start scaling cores(control and data) instead of alus (data) -> common SIMD widths are 8-64

Latency Hidnig (multiplex tasks on a single physical core) hide latency with concurrent/multiplexed threads, not speculation -> thread context storage -> not true for dnn workloads?? https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf fundamental tradeoff: per-thread-state vs latency
amortize instruction overheads with more complex instructions (AES, encode/decode/GEMM(8x8))

amortizes instruction control overhead (like SIMD). more alus per instruction
amorties overhead of read/write to register file (feed operands from one alu to another without a big general-purpose register file) -> "ASIC inside an instruction"

Figure x: https://modal.com/gpu-glossary

one GPU has 144 SMs (h100) one SM (4 core cluster) cpu: thread of vector lanes gpu: warp of cuda threads. nvidia advertising calls these "cores". misnomer. because they don't have separete control. cuda "cores" tensor "cores. (not a core. it's ___)

144 SMs4cores32 exec X units = Y exec units.

(one quadrant/warp-scheduler can execute 1 warp instruction of 32 threads per clock) (no OOO. no superscalar. max ILP of 1 in a given cycle. but can run 1 operation over 32 vector lanes (alus) or 1 complex operation block 8x8 matmul accumulate)

Figure x: https://modal.com/gpu-glossary

2.3.1 Throughput-Oriented Memory Hierarchies

Hazy Research

the higher order bit from part 1: fundamental tradeoff from fast/close but small large but slow/far

tradeoff: capacity (amount) <-> latency (speed) <-> throughput (bw) fundamental constraint in hardware (energy): fundamentally, large memories are further away on average larger memories imply longer wires (are inherently farther) longer wires imply longer time and lower bandwidth fundamental design: narrow interface to compensate for longer wires???

Programming a Naive GEMM

2.3.2 Throughput-Oriented Memory Hierarchies

2.3.2 From CUDA C to PTX with NSight Compute

2.3.3 Accelerating GEMM with Better Data Reuse: Coalesching, Caching, Tiling

2.3.4 Accelerating GEMM with Better Scheduling:

2.3.5 Accelerating GEMM with Tensor Cores:

2.3.6 Productively Programming Tensor Cores: ThunderKittens, CUTLASS

2.4 Learning Sequences with Different Inductive Biases

2.4.1 Convolutional Neural Network (CNN)

2.4.2 Recurrent Neural Networks (RNN)

2.4.3 Long Short-Term Memory Networks (LSTM)

2.4.4 Bidirectional Encoder Representations from Transformers (BERT)

2.4.5 Generative Pretrainted Transformers (GPT)

2.4.6 Diffusion

2.5 Accelerating `nanogpt`

In fact, symbolic differentiation also has the limitation that it cannot differentiate numerical functions embedded in a turing-complete programming language with control flow. i.e loops, branches, recursion

⁴

The slogan of automatic differentiation is the accuracy of symbolic with the performance of numeric

⁵

There is another comparision to be made with traditional systems, where, in the same way that all data operations — functions — of all programs can be reduced or desugared down to elementary arithmetic operations found in the arithmetic logic units of virtual or physical interpreters, so too can derivatives. That is, theoretically we could derive the derivative of all neural networks only with the derivatives of addition, and multiplication. But for convenience, we'll provide sugar derivatives for more complex derivatives.

just-in-time, and not statically, since deep learning frameworks such as torch and jax are embedded domain specific language

for instance, type errors happen at compile-time with Rust's rustc language implementation, and at run-time with Python's cpython implementation.

SITP