
Structure and Interpretation of Tensor Programs
The Hacker's Guide to the Internals of Machine Learning Systems
Train nanogpt and nanochat with picograd: the bridge from micrograd to tinygrad
Made with 🖤 by j4orz with GPU MODE
First Edition.
Dedication
This course is dedicated to my father Dr. Thomas Zhang ND., R.TCMP, R.Ac.
I owe everything to him.
In 2023, he guided me as my plans for graduate school were changed to helping my sister recover from mismanaged schizophrenia. That same year, he had two hemorrhagic strokes, and passed away shortly after. My personal relationship to the psychiatric mind and neurological brain led to a deep reflection on my life's work up ahead. The result is the following course notes and lectures.
I will make you proud dad. May you rest in pure land. We'll meet again.
We’ll Meet Again by Vera Lynn 1939. Cover by Johnny Cash 2002.

Preface
As a compiler writer for domain specific cloud languages, I became frustrated with the non-constructiveness and disjointedness of my learning experience in the discipline of machine learning systems, particularly with domain specific tensor languages — I needed a book which provided the necessary foundation in the intersection of mathematics, compilers, and computer architecture in order to contribute meaningfully to MLSYS. You can think of these course notes, their accompanying lectures, and the capstone project as the whirlwind tour form of SICP/PAPL applied to substance of neural networks, deep learning frameworks, and massively parallel processors.
If you have similar frustrations, you may benefit from it too.
Good luck on your journey.
Are you ready to begin?
Introduction
Throughout our rich three thousand year history of mathematics, there's existed time immemorial tension between the finitely discrete and the infintely continuous descriptions of reality. That same beautiful tension is repeating itself with computation. And it's our opportunity as programmers to experience that beauty. Let's take a deeper look into what we exactly mean1.
A Discrete Hope
Taking a look at where mathematics all began, we find ourselves in the world of Ancient Greece. One way to conceptualize the Ancient Greece zeitgeist is to imagine that of Silicon Valley. Although the average programmer might have more monetary benefit to their profession, they both have access to experience the beauty of reality. In the same way a beginner programmer entering the profession possesses a childlike awe that somehow the code they express on their keyboard will lay down the bricks of the intergalactic network, so too does the novice mathematician inpsiringly wonder how the symbols they inscribe on paper reflect the astronomical stars of space.
In order to build the cathedrals and bazaars of digital reality, programmers orchestrate the tools of languages and runtimes — created for them by toolsmiths we call compiler writers and kernel engineers — in concert with one another to write programs. While the ancient greek mathematician did not have access to mechanical computation to execute the instructions of said program, they nonetheless had access to formal language — created for them by the toolsmith we call logicians — in which they were able to express proofs. The programmer uses the compiler writer's types, and the mathematicain uses the logician's propositions.2
The programmer and mathematician share a similar essence not only in the tools of type and proposition, but also in the finitely discrete nature of their intellectual creations in proof and program. That is, the programmers of the valley and the mathematicians of ancient greece share the same dream to describe and thus reduce reality with discrete language. The only minor difference is the programmer is somewhat more extreme by wanting to reduce everything down two specific integers; that of 0 and 1.
The Continuous Strikes Back
- Incommensurability of square root 2
- Sunrise Problem (of Induction)
- Probability
- Bayes Rule
Discrete vs Continuous in Computation
- Inductive Bias: Model
- Inductive Principle: Loss Function
- Empirical Risk Minimization: Maximum Likelihood Estimation
- SW1.0 and SW2.0: https://colah.github.io/posts/2015-09-NN-Types-FP/
Onward!
You have a journey ahead of you.
Are you ready to begin?
The introduction waxes and wanes philosophically with big picture metaphysics of what we can know (epistemology) and what exists (ontology) — what is coloquially known to the valley as being a wordcel. Feel free to skip to chapter 1 to get the show shape rotation on the road!
This is formally known as the Curry-Howard Correspondance.
1. Elements of Learning
In part 1 of the course notes, we transition from deterministic types to stochastic tensors
by training a house price prediction model with numpy's, understanding
the mechanics of the statistical learning approach to artificial intelligence.
After that, we start building our own deep learning framework picograd, and
develop the Tensor data structure with linear transformations
encoded via .matmul()s which are accelerated on CPU.
This will prepare us for the next part of the course notes where we change the
learning task to sequence learning and the function class to deep neural networks.
Contents
- 1.1 Learning Price Directly with Regression
- 1.2 High Dimensional Arrays
- 1.3 Accelerating Basic Linear Algebra Subroutines on CPU
- 1.4 Learning Sentiment Iteratively with Classification
1.1 Learning Price Directly with Regression
The primary goal of the statistical learning approach to artificial intelligence is induction, which is to use historical data to make accurate predictions in new situations. More formally, the goal is to recover some general function from particular data , which can then be evaluated to predict some future value given some future observation . That is, the goal is generalization. Machine learning has its roots in statistical inference, and so we also refer to the data as observations or evidence, and the function as hypotheses or models. For instance, consider the task of house price prediction1 where the input space is the size in square feet, and the output space is the price, so the function we are interested in recovering has type . Since enumerating through every house size and price pair in is intractable (otherwise all pairs could be inserted inside a lookup or database), an assumption about the structure of the data needs to be made, referred to as the inductive bias.
<SHOW_DATA>
1.1.1 Fitting a Line
The simplest assumption to make is to assume that there exists a linear relationship between the data so that a function of the form parameterized by slope and intercept where fits the data well. After the inductive bias on the family of functions has been made, the learning algorithm must find the function with the best fit by selecting parameters and . Since artificial learning algorithms don't have visual cortex like biological humans2, the notion of "good fit" needs to be quantified with a second function of the form , referred to as the loss function, which can be used to systematically find the function of best fit and thus best predictor3. For now, we'll make the unjustified decision (justifying it later4) to use the least squares loss so that and denote the loss function with to reflect the fact that the input-output pairs are fixed, and that the parameters determine the goodness of fit. Evaluating the least squares loss over the entire dataset results in , and minimizing this quantity is the final step, denoted by
To summarize, we have selected
- an inductive bias with the family of linear functions which describes the relationship between the input and output space
- an inductive principle with the least squared loss which measures accuracy with an objective function
- and will find the line of best fit by estimating the parameters which minimze the empirical risk, denoted as
With the linear regression model, we can solve this optimization with a direct method.
todo: parameter, linear/affine, matrices as data
1.1.2 Fitting a Line, in Higher Dimensions
Vector Space
1.1.3 Fitting a Line, Probabilistically
The most widely adopted mathematical language for formalizing our intuitions around non-deterministic stochastic phenomena is probability theory (as opposed to alternative frameworks such as probabilistic logic or uncertainty quantification). In probability theory statements are neither true nor false, but rather, truth is distributed across a weighted set of random events . Similarly, random variables and their probability distributions do not take on definite values but rather sets of values. Probability theory is the generalization of aristotelian logic.
Probability Space
1.1.4 Fitting a Line, with Non-Linear Features
Feature Space
1.2 High Dimensional Arrays
1.2.1 Tensor Language
class Tensor():
1.2.2 Op Intermediate Representation
class Op():
1.2.3 Device Runtime
class Device():
1.3 Accelerating Basic Linear Algebra Subroutines on CPU
1.3.1 Latency-Oriented SISD Fantasy: From IBM 360 to Macbook
Throughout the last 60 years of computer architecture, we've seen the number of transistors double roughly every two years, enabling the transition across said different computer classes from room-sized mainframe computers, to refrigerator-sized minicomputers, to personal-sized microcomputers, and even to embedded ones. Although the form factors have drastically changed, the computer architectures have roughly remained the same.
Across the computer classes of IBM's System/360s, to DEC's PDP11s, to Apple's Macbooks, all architectures in essence follow the original stored-program sequential instruction processing architecture laid out by Burks, Goldstine, and von Neumann in their 1946 paper5 Preliminary Discussion of the Logical Design of an Electronic Computing Instrument, dubbed the von Neumann Architecture, which roughly speaking consists of five elements placed into three categories:
- Processor and Memory
- Input and Output
- Control
<FIGURE_2>
In order to execute a program's specifies sequence of instructions, the control unit orchestrates all components in concert with one another in a fetch, decode, execute cycle in order to interpret each instruction and carry out it's computation on the processor against the memory's state. These instructions are specified by the instruction set architecture and can be encoded in human-readable languages referred to assembly code, or the direct binary encoding referred to as machine code. Although these days programs are specified in higher level languages with a larger semantic gap from the machine, understanding assembly is crucial to any low-level machine learning systems programmer in order to obtain an understanding as close as possible to the series of instructions the processor actually6 executes.

$ time = instructions*\frac{instructions}{clock}\frac{clocks}{second} $
Optimization 1: Increase (frequency)
Optimization 2: Increase (instruction-level parallelism)
Optimization 3: Avoid stalls (speculation: caches, branch predictors, memory prefetchers)
What to do with twice the silicon?? (Orange line)
1.3.2 Bottlenecks and Rooflines
Is the throughput performance of matrix multiplication satisfactory? Well, if the hardware is theoretically capable of delivering more performance throughput measured in FLOP/S, why leave any compute on the table? This intuition is captured in the roofline cost model where ...
Figure x: https://modal.com/gpu-glossary
One of the key operations used in scientific computing (and machine learning)
for each $i \in [0,n) $ $j\in [0,p)$ so that . Translating the mathematical definition directly to a computational algorithm results in the following $[n \to n^3]$ implementation that computes dot products:
def matmul(A, B):
n, m, p = len(A), len(A[0]), len(B[0])
C = [[0.0 for _ in range(p)] for _ in range(n)]
for i in range(n):
for j in range(p):
sum = 0.0
for k in range(m):
sum += A[i][k] * B[k][j]
C[i][j] = sum
return C
if __name__ == "__main__":
A = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
B = [[7.0, 8.0], [9.0, 10.0], [11.0, 12.0]]
C = matmul(A, B)
print(C)
How long do you expect the wall clock time to take for a square matrix of size 4096? On the order of nanoseconds, microseconds, milliseconds, seconds, minutes, hours, days?
Since the algorithm has a cubic amount of work, the number of floating point operations is roughly , and the measured running time on an Apple M3 takes X hours, meaning the througput is 10 MFLOP/S, which is Z% of the W peak theoretical TFLOP/S of the M3 machine.
function matmul(A, B) {
const [n, m, p] = [A.length, A[0].length, B[0].length];
const C = Array.from({ length: n }, () => Array(p).fill(0));
for (let i = 0; i < n; i++) {
for (let j = 0; j < p; j++) {
let sum = 0;
for (let k = 0; k < m; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
return C;
}
50 MFLOP/S
fn matmul(A: &[Vec<f64>], B: &[Vec<f64>]) -> Vec<Vec<f64>> { let (n,m,p) = (A.len(), A[0].len(), B[0].len()); let mut C = vec![vec![0.0; p]; n]; for i in 0..n { for j in 0..p { let mut sum = 0.0; for k in 0..m { sum += A[i][k] * B[k][j]; } C[i][j] = sum; } } C } fn main() { let A = vec![vec![1.0, 2.0, 3.0], vec![4.0, 5.0, 6.0]]; let B = vec![vec![7.0, 8.0], vec![9.0, 10.0], vec![11.0, 12.0]]; let C = matmul(&A, &B); println!("{:?}", C); }
1 GFLOP/S
1.3.3 M1 Processor Architecture
1.3.4 Macbook's Memory Hierarchy
- vectorized, cache blocked 40GFLOP/S (0.2% peak)
- multicore 200GFLOP/S (40% CPU vector peak). (1% of SoC peak)
- bf16, matrix units 1.5TFLOP/S (7.5% of SoC peak)
- metal 3.2TFLOP/S (15% of SoC peak)
- npu (int16) 16 TOP/S (80% of SoC peak) 19TFLOP/S A100 4PFLOP/S H100 14PFLOP/S B100 20PFLOP/S B200 6 orders of magnitude speedup (1,000,000) speedup <-- kills proebsting's law!
1.4 Learning Sentiment Iteratively with Classification
We'll now tackle our first learning problem in the domain of language with sentiment analysis — classifying whether a given piece of text has a positive or negative connotation — by modifying
- our model to support discrete categorical targets
- our optimization method to an iterative algorithm, namely gradient descent
Before diving into the model to support discrete categorical targets in , we will take a closer look into the optimization method of gradient descent because there are two reasons for why we want to move from direct to iterative methods, the first being memory bottlenecks, and the second being non-linear function classes.
1.4.1 Linear Regression, and the Limitations of Direct Methods
The first reason why we want to switch from direct to iterative optimization methods is because direct methods are memory-bound on materializing the matrix .
1.4.2 Inductive Bias and Principle with Logistic Regression and Cross Entropy
The second reason why we want to switch from direct to iterative optimization methods is because even if the number of dimensions is small enough so that we are not memory-bound, direct methods simply will not work for other function classes besides linear regression. Consider ___.
Inductive Bias
In order to train a model which learns the sentiment of text, we will collect data where the input space are feature vectors and the output space are and which encode the negative and positive cases of binary classification7 (we will expand the number of categorical responses in the next section with multi-class classification). Then the function of interest we'd like to recover from the data is of form . Recall that machine learning consists of selecting a family of functions as the inductive bias, a loss function as the inductive principle, and estimating the parameters by empirically minimizing the risk.
For the inputs,
For the model class , we will continue to use a a weighted sum of the input vector with where so that , but add the function $\sigma : \reals \to [0,1]$ where referred to as the logistic or sigmoid8 function so that the output is a valid probability. We'll interpret as the log odds and as and the complement as . However, the current type of is $f:\reals \to [0,1]$, whereas the task is to assign a negative () or positive () sentiment to the provided input feature vector . To ammeliorate this we will add a final decision boundary where $$ sentiment(\mathbf{x}) \colonequals \begin{cases} 1 &\text{if } p(Y=1|X=x) > 0.5 \ -1 &\text{otherwise} \end{cases} $$
(FIGURE.MODEL ARCHITECTURE)
We can vectorize the evaluation of on each $i \in [0..n)$ with a data matrix so that
import torch
def f(xi: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
"""Compute sigmoid(w^T x) for a single example xi."""
logits = torch.matmul(w, xi)
return torch.sigmoid(logits)
if __name__ == "__main__":
w = torch.randn(3)
X = torch.randn(5, 3)
for xi in X: # does this work?
yi_hat = f(xi, w)
print(yi_hat.item())
For example, let's trace through the evaluation of our model with the following input example :
Inductive Principle
For the loss function
1.4.3 Iterative Optimization via Gradient Descent
For estimating the parameters that minimize the loss , we will optimize iteratively using gradient descent, for the two reasons already mentioned of being memory-bottlenecked on materializing the with the linear regression model class, and, with other classes of functions (which is currently the case), we don't have ___.
Gradient descent, simply put, is an optimization method that uses differential calculus, namely, the fact that the gradient provides hints — the direction of steepest descent — on how to iteratively modify the parameters closer to an optimum. The gradient descent algorithm can be expressed in a one line update rule so that the goal of is implemented by:
where is referred to as the learning rate or step size. We now dive deeper into differential calculus and generalize the derivative to higher dimensional spaces9 to better understand what's happening under the hood.
1.4.4 Multinomial Logistic Regression
1.4.5 Generalized Linear Models
While predicting house prices is not AGI, we follow the hello world in order to provide an easy example while introducing the foundational mechanics and workhorses of machine learning, which are crurical in part two of the course notes where we cover more industrial machine learning. In software 1.0 you should know how to reverse a linked list. In software 2.0 you should know how to fit a line.
Our visual cortex won't work in input-output spaces with dimensions higher than 3 anyways.
When loss functions are convex we can find the best predictor which minimizes the loss which used to be very important before the empirical success of deep neural networks as a function class which are highly non-convex, which we will see in part two of the course notes.
With the inductive principle of maximum likelihood estimation.
In that paper, they refer to components of the computer architecture as "organs". With each new class of computer unlocking new applications, humans find grand-inspiring names such as the digital computer, intergalactic network, the cloud, and artificial intelligence. The more things change, the more things stay the same.
Even in the case of x86-64 based instruction set architectures, which decode the instructions into microinstructions.
The encodings are arbitrary. The output space can alternatively be the set which we avoid to prevent confusion around the categorical encodings and the probabilities $p \in [0,1]$ assigned to them.
Perhaps a more suitable name for the model is "sigmoidal classification" rather than "logistic regression".
What is known as the Fréchet derivative, which is defined on any vector space equipped with a norm
2. Neural Networks
In part two of the course notes we change both our goal and function class to
sequence learning and deep neural networks, respectively. We start off by
traning a language model in pytorch using a feed forward neural network with gradient descent.
After that, we build off from picograd's Tensor, Op, and Device developed in part one
and abstract two tasks for the research scientist:
- differentiation and optimization with
Tensor.backward()andoptim.sgd/optim.adamand - gpu kernel acceleration of BLAS and DNN operations in
CUDA C
which we will then use to train other language models with different inductive biases (invariances) such as RNNs, LSTMs, BERTs, GPTs, and Diffusion Models. Finally, we will accelerate the performance of the specifically targeted nanogpt with FlashAttention kernels. This will set us up for part three of the course notes where we modify the language implementation of the deep learning framework to support distributed compilation in order to support the training and inference of nanochat.
Contents
- 2.1 Learning Representations, Learning Sequences
- 2.2 ERM via Automatic Differentiation and Stochastic Gradient Descent
- 2.3 Accelerating BLAS operations on GPU
- 2.3.1 Throughput-Oriented Processor Architecture
- 2.3.2 Throughput-Oriented Memory Hierarchies
- 2.3.3 Accelerating GEMM with Better Data Reuse: Coalescing, Caching, Tiling
- 2.3.4 Accelerating GEMM with Better Scheduling:
- 2.3.5 Accelerating GEMM with Tensor Cores
- 2.3.6 Productively Programming Tensor Cores: ThunderKittens, CUTLASS
- 2.4 Learning Sequences with Different Inductive Biases
- 2.5 Accelerating
nanogptwith DNN operations on GPU
2.1 Learning Sequences
2.1.1 Sentiment Learning with Feedforward Neural Networks
In part 1 of the course notes we trained generalized linear models of the form __*. In part 2 we modify and increase the expressivity of the function class by including non-linearities . The feedforward neural network simply put is a series of linear and non-linear layers of the form so
where , is an elementwise nonlinearity, and . Conceptually, the linear layers are performing linear transformations that rotate, reflect, shear, and scale space, whereas the nonlinear transformations perform transformations that squash and twist space.
We will now use the same model of the feedforward neural network to accomplish two other goals. Namely, representation learning, and sequence learning.
2.1.2 Representation Learning with Feedforward Neural Networks
2.1.3 Language Modeling with Feedforward Neural Networks
A sequence model, simply put, is the conditional probability distribution of an output token given an input token . A sequence of tokens can be a sentence of words in the domain of language, a series of pixels in the domain of vision, or a stream of waves in the domain of audio.
Since we are modeling language as stochastic phenomena, we use the formal language of probability theory, where a probability space is a measurable space with a measure $\mathbb{P}: \Omega \to [0,1]$. In the domain of language, the measurable space consists of a sample space which is the set of all tokens modelling a vocabulary, and the event space is the set of all token combinations which model a language. The measure is the measure of the weight of a particular token combination (sentence, really) as an event with respect to the set of all possible token combinations (sentences) as the entire event space. Once we use a random variable to map events to , we can forget about the probability space and focus our attention on language models which are joint probability distribution over all sequences of tokens.
Language modeling with ngrams.
(1. EXPLAIN MODEL).
# FFN MODEL f: R^n -> R
import torch
class MLP():
"""
model: Neural Language Models (Bengio et al. 2003)
key:
b: batch size, t: sequence length
v: vocabulary size, e: dimension of embedding, d: dimension of model
"""
def __init__(self, cfg):
super().__init__()
b, t, v, e, d = cfg.b, cfg.t, cfg.v, cfg.e, cfg.d
self.wte = layers.Embedding(v+1, e) # token embeddings table (+1 for <BLANK>)
l1 = layers.Linear(t*e, d, b=False)
l2 = layers.Linear(d, d, b=False)
l3 = layers.Linear(d, v, b=False)
def forward(self, i, targets=None):
embs = [] # gather the word embeddings of the previous 3 words
for k in range(self.b):
tok_emb = self.wte(i) # token embeddings of shape (b, t, e)
i = torch.roll(i, 1, 1)
i[:, 0] = self.v # special <BLANK> token
embs.append(tok_emb)
# concat all of the embeddings together and pass through an MLP
x = torch.cat(embs, -1) # (b, t, e * block_size)
x = self.l1(x).tanh()
x = self.l2(x).tanh()
x = self.l3(x)
yhat = x
# if we are given some desired targets also calculate the loss
loss = None
if targets is not None: loss = F.cross_entropy(yhat.view(-1, yhat.size(-1)), targets.view(-1), ignore_index=-1)
return yhat, loss
(2. EXPLAIN DATASET).
# FFN DATA d={(x^i,y^i)}
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
def build_dataset(t):
import random
words = open('./data/names.txt', 'r').read().splitlines()
v = sorted(list(set(''.join(words))))
encode = { c:i+1 for i,c in enumerate(v) }
encode['.'] = 0
decode = { i:c for c,i in encode.items() }
def gen_dataset(words, t):
X, Y = [], []
for w in words:
context = [0] * t
for c in w + '.':
X.append(context)
Y.append(encode[c])
# print(''.join(decode[i] for i in context), '-->', decode[encode[c]])
context = context[1:] + [encode[c]]
X, Y = torch.tensor(X), torch.tensor(Y) # X:(N,C) Y:(N)
return X, Y
random.seed(42)
random.shuffle(words)
n1, n2 = int(0.8*len(words)), int(0.9*len(words))
Xtraining, Ytraining = gen_dataset(words[:n1], t)
Xdev, Ydev = gen_dataset(words[n1:n2], t)
Xte, Yte = gen_dataset(words[n2:], t)
return Xtraining, Ytraining
(3. EXPLAIN TRAINING LOOP).
# FFN TRAINING LOOP: theta^(t+1) := theta^t - alpha*grad(L)
if __name__ == "__main__":
b, t, v, e, d = 32, 3, 27, 10, 200 # init hyperparameters
X, Y = build_dataset(t) # init data
C = torch.randn((v,e), generator=g) # init embedding
model = MLP() # init model
params = [C] + [p for l in model for p in l.parameters()]
for p in params: p.requires_grad = True
N, losses, steps = X.shape[0], [], [] # train
for step in range(200000):
i_b = torch.randint(0, N, (b,))
X_b, Y_b = X[i_b], Y[i_b]
X_bd = C[X_b].view(-1, t * e) # 0. embed
for layer in model: X_bd = layer(X_bd) # 1. forward
loss = X_bd.cross_entropy(Y_b)
for layer in model: layer.out.retain_grad()
for p in params: p.grad = None
loss.backward() # 2. backward
for p in params: p.data += -0.01 * p.grad # 3. update
# optimizer.step()?
steps.append(step)
losses.append(loss.log10().item())
if step % 10000 == 0: print(f"step: {step}/{200000}, loss {loss.item()}")
plt.plot(steps, losses)
In the next two chapters of 2.2 and 2.3, we will implement two features
on picograd which are the two primary tasks which pytorch abstracts away from research scientists:
the backward pass with automatic differentiation, and the device acceleration of the specified forward pass.
2.2 ERM via Automatic Differentiation and Gradient Descent
So far we've been training feedforward neural networks using the magical
Tensor.backward() function in order to materialize the gradient of the loss on our
parameter weights which gradient descent uses in it's update rule
.
We will now dive deeper into how .backward() is implemented.
2.2.1 Automatic Differentiation via Tensor.backward()
Consider the function where ,
and translate it to it's computational counterpart in python with one-dimensional Tensors:
import picograd as pg
def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
a = pg.exp(x1)
b = pg.sin(x2)
c = b**2
d = a*c
return d
Figure 1. Python source for the function where
Here we've broken up the function to render the subexpressions more clearly.
But this isn't necessary — automatic differentiation will work if the function was expressed in one line.
In part one, the development of picograd followed that of numpy —
an array programming language similar to Matlab but embedded in the host language of Python,
that could evaluate functions of the form
where Tensor objects stored their values with the value: field
and the function types that produced their values with Op.
For instance, evaluating the specified function f from above with 9 and 10
if __name__ == "__main__":
print(f(9, 10))
populates the Tensor.value fields.
In part one of the book we verified this with a REPL-interface,
but we can also represent the entire expression being evaluated with a
graph of vertices and edges where the vertices are Tensors
(along with their Ops and values) and the edges are their data dependencies:
Here you can see that even if the function was specified in one line,
the graph of the expression always parses into Tensor vertices, and data dependency edges.
You may have noticed the Tensor.grad fields, which supposedly store the values of derivatives .
The question now remains in how to populate these fields.
Taking a step back to differential calculus, deriving the derivative of involves the application of the chain rule where . Evaluating the derivative of the function with respect to its inputs and results in
symbolic and numeric differentiattion symbolic differentiation has performance issues since a large unrolled expression must be constructed in order to differentiate1, whereas numerical differentiation has correctness issues since evaluating finite differences requires evaluating functions to a precision point resulting in numerical instability. (trace through EXAMPLE for both. talking nets widrow)
To populate the Tensor.grad fields, the simplest idea would be
to literally translate the manual derivation of the derivative into code.
The translation from math to code involves a design decision:
should we evaluate from outputs to inputs (symbolically outside-in, graphically right-to-left) or from inputs to outputs (symbolically inside-out, graphically left-to-right)?
Although the former order seems more natural with symbolic expressions,
there's nothing illegal about the latter.
import picograd as pg
def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
a = pg.exp(x1)
b = pg.sin(x2)
c = b**2
d = a*c
return d
# dict[f(x), f'(x)] of local derivatives (adjoints)
dd_da, dd_dc = [c, a] # d(a,c):=a*c ==> d'(a)=c, d'(c)=a
da_dx1 = pg.exp(x1) # a(x1):=exp(x1) ==> a'(x1)=exp(x1)
dc_db = 2*b # c(b):=b^2 ==> c'(b)=2b
db_dx2 = pg.cos(x2) # b(x2):=sin(x2) ==> b'(x2)=cos(x2)
# outputs to inputs: outside-in symbolically, right-to-left graphically
dd_dd = pg.Tensor(1) # base case
dd_da, dd_dc = [dd_dd*dd_da, dd_dd*dd_dc]
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path
dd_db = dd_dc*dc_db
dd_dx1 = dd_db*db_dx2 # DONE for x2->path
# inputs to outputs: inside-out symbolically, left-to-right graphically
dx1_dx1, dx2_dx2 = [pg.Tensor(1), pg.Tensor(1)] # base case
da_dx1 = da_dx1*dx1_dx1
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path
db_dx2 = db_dx2*dx2_dx2
dc_dx2 = dc_dc*db_dx2
dd_dx2 = dd_dc*dc_dx_2 # DONE for the x2->d path
Do you notice any difference in the number of evaluations between the two orders?
The outputs-to-input ordering takes 6 arithmetic operations (including the destructuring),
whereas the input-to-output ordering take 7 arithmetic operations.
This is because the former can reuse dd_dd
as a dynamic programming solution to a subproblem for the two inputs, whereas the latter cannot.
And taking a step back, we only want to reuse the output because the shape of the function
is of . Alternatively, if had type ,
then the input-to-output ordering would be able to reuse results.
This distinction is referred to as "forward-mode" vs "reverse-mode", and reflects
the fact that for some function
the time complexity of forward-mode differentiation is proportional to ,
whereas that of forward-mode differentiation is proportional to .
If the expression graph fans-in so that , reverse-mode is preferred.
If the expression graph fans-out so that , forward-mode is preferred.
However, if we take a step with a graph-theory lens, we can see that the derivative is the sum of paths,
where each path is a product of local derivatives from the input source to the output sink.
From a combinatorics perspective, we are calculating all the possible (ors) ways (ands)
on how the inputs perturb the output. That is:
and as long as the operations along this path are associative — — then we can choose the order in how we perform these path products to minimize the number of operations. Finding the optimal ordering is an NP-hard problem because ____. For instance, if the expression graph is diamond-shaped, evaluating the derivative with forward-mode for the left-half and reverse-mode for the right-half would be more performant. In practice, we use reverse-mode as a heuristic, since most of the functions that are differentiated (so they can be optimized) in the field of machine learning are neural networks of the form
How can we generalize this into an algorithm?
All we need are 1. mappings from and 2. a topological sort
For the derivative rules, the same way that optimizing compilers implement an optimization "manually" once which then gets reused many times, the authors of deep learning frameworks also implement derivatives manually which then become reused many times through automatic differentiation. In theory, we can differentiate any expression with f'(x) with only a few derivative rules for addition and multiplication, but in practice most frameworks provide sugar for complex derivatives.
For topological sort, we can simply reversed the ordering produced by a depth-first-search:
def toposort(self):
order: list[Op] = []
visited: set[Op] = set()
def dfs(node: Op) -> None:
if node in visited: return
visited.add(node)
for src in node.src: dfs(src)
order.append(node)
dfs(self)
return order
class Tensor():
def backward():
for t in reversed(topo):
t.backward()
We will now use this idea to modify the interpretation of our deep learning framework to not only evaluate , but as well. This is done by dynamically overloading the operators at runtime1 to trace the expression graph
chain_rules = PatternMatcher([
(Pattern(OpCode.MATMUL, name="input"), lambda output_grad, input: (_____,)),
(Pattern(OpCode.MATVEC, name="input"), lambda output_grad, input: (_____,)),
(Pattern(OpCode.RECIPROCAL, name="input"), lambda output_grad, input: (-output_grad * input * input,)),
(Pattern(OpCode.SIN, name="input"), lambda output_grad, input: ((math.pi/2 - input.src[0]).sin() * output_grad,)),
(Pattern(OpCode.LOG2, name="input"), lambda output_grad, input: (output_grad / (input.src[0] * math.log(2)),)),
(Pattern(OpCode.EXP2, name="input"), lambda output_grad, input: (input * output_grad * math.log(2),)),
(Pattern(OpCode.SQRT, name="input"), lambda output_grad, input: (output_grad / (input*2),)),
(Pattern(OpCode.ADD), lambda output_grad: (1.0*output_grad, 1.0*output_grad)),
(Pattern(OpCode.MUL, name="input"), lambda output_grad, input: (input.src[1]*output_grad, input.src[0]*output_grad)),
])
class Tensor:
def _forward(self, f:Callable, *other:Tensor) -> Tensor: #extra_args=(), **kwargs)
out_tensor = evaluator.eval_uop([self, other], out_uop)
def backward(self, grad:Tensor|None=None) -> Tensor:
"""
backward performs by collecting tensors, computing gradients with automatic differentiation, and updating said tensors.
"""
# 1. collect all tensors that requires grad by topologically sorting the graph of uops and filter
all_uops = self.uop.toposort()
tensors_require_grad: list[Tensor] = [t for tref in all_tensors if (t:=tref()) is not None and t.uop in all_uops and t.requires_grad]
uops_require_grad = [t.uop for t in tensors_require_grad]
assert grad is not None or self.shape == tuple(), "when no gradient is provided, backward must be called on a scalar tensor"
if not (self.is_floating_point() and all(t.is_floating_point() for t in tensors_require_grad)): raise RuntimeError("only float Tensors have gradient")
# 2. compute the gradient with a map of tensors to partials
if grad is None: grad = Tensor(1.0, dtype=self.dtype, device=self.device, requires_grad=False) # base case is 1.0
tens2grads = Tensor._automatically_differentiate(self.uop, grad.uop, set(uops_require_grad)) # skipping materializing zerod grads for now
grads = [Tensor(g, device=t.device) for t,g in zip(tens2grads.keys, tens2grads.values)] # initialize tensor grads on device
# 3. update the tensors that require grad with the gradient's partials
for t,g in zip(tensors_require_grad, grads):
assert g.shape == t.shape, f"grad shape must match tensor shape, {g.shape!r} != {t.shape!r}"
t.grad = g if t.grad is None else (t.grad + g) # accumulate if t.grad exists
return self
@staticmethod
def _automatically_differentiate(root:Op, root_grad:Op, targets:set[Op]) -> dict[Op, Op]:
"""
_differentiate backpropagates partials on a topologically sorted expression graph with the chain rule
and produces the gradient in the form of a map of ops to their partials (which, in turn, are ops)
"""
tens2grads = {root: root_grad}
# 1. topological sort
in_target_path: dict[Op, bool] = {}
for u in root.toposort(): in_target_path[u] = any(x in targets or in_target_path[x] for x in u.src)
dfs = list(root.toposort()) # lambda node: node.op not in {OpCode.DETACH, OpCode.ASSIGN} and in_target_path[node])) # don't flow through DETACH/ASSIGN or anything not in target path
# 2. backpropagation with the chain rule
for tensor in reversed(dfs):
if tensor not in tens2grads: continue
local_grads: tuple[Op|None, ...]|None = cast(tuple[Op, ...]|None, chain_rules.rewrite(tensor, ctx=tens2grads[tensor]))
if local_grads is None: raise RuntimeError(f"failed to compute gradient for {tensor.op}\n\nin {str(tensor)[0:1000]}...")
assert len(local_grads) == len(tensor.src), f"got {len(local_grads)} gradient, expected {len(tensor.src)}"
for tensor,local_grad in zip(tensor.src, local_grads): # <--------------------- MOOOSE: why are we accumulating inside ad()? don't we do it in backward()??
if local_grad is None: continue
if tensor in tens2grads: tens2grads[tensor] = tens2grads[tensor] + local_grad # accumulate if tensor exists
else: tens2grads[tensor] = local_grad # o/w initialize
To implement automatic differentiation with Tensor.backward(),
there is a design decision to be made
— the choice of implementing it dynamically or just-in-time2,
similar to the decision of how to implement types for general programming languages3.
This stands in contrast to the alternative of performing a just-in-time, source-to-source transformation.
Let's now move onto automatically differentiating the functions of neural networks, specifically the FFN language model from earlier. (johnson/ryan adams ordering) n^2 vs n^3
2.2.2 Stochastic Gradient Descent via optim.sgd
2.2.3 Adam via optim.adam
2.3 High Dimensional Linear Transformations on GPU
2.3.1 Throughput-Oriented Processor Architecture
Figure x: https://modal.com/gpu-glossary
The above is really a continuum. for instance, cpus we saw in last part invest some chip area in multi cores. but not many cores.
4 optimizations for throughput: (are you willing to change the pdp11 SISD programming model?)
-
Simplify the hardware by remove caches, branch predictors, memory prefetchers multirecore > not instruction-level parallelism Use savings to place more cores (MIMD)
-
Amortize control overhead (fetch/decode) with SIMD (single control unit. registers turn into vector registers)
- SIMD is explicit in the CPU programming models (AVX/NEON)
- SIMD is implicit in GPU programming models
fundamental tradeoff is if you have 1,000,00 alus, in worst case, divergence (lower MFU) degrades 1/nth of peak throughput -> at some point, you start scaling cores(control and data) instead of alus (data) -> common SIMD widths are 8-64
-
Latency Hidnig (multiplex tasks on a single physical core) hide latency with concurrent/multiplexed threads, not speculation -> thread context storage -> not true for dnn workloads?? https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf fundamental tradeoff: per-thread-state vs latency
-
amortize instruction overheads with more complex instructions (AES, encode/decode/GEMM(8x8))
- amortizes instruction control overhead (like SIMD). more alus per instruction
- amorties overhead of read/write to register file (feed operands from one alu to another without a big general-purpose register file) -> "ASIC inside an instruction"
Figure x: https://modal.com/gpu-glossary
one GPU has 144 SMs (h100) one SM (4 core cluster) cpu: thread of vector lanes gpu: warp of cuda threads. nvidia advertising calls these "cores". misnomer. because they don't have separete control. cuda "cores" tensor "cores. (not a core. it's ___)
144 SMs4cores32 exec X units = Y exec units.
(one quadrant/warp-scheduler can execute 1 warp instruction of 32 threads per clock) (no OOO. no superscalar. max ILP of 1 in a given cycle. but can run 1 operation over 32 vector lanes (alus) or 1 complex operation block 8x8 matmul accumulate)
Figure x: https://modal.com/gpu-glossary
2.3.1 Throughput-Oriented Memory Hierarchies
Hazy Research
Hazy Research
the higher order bit from part 1: fundamental tradeoff from fast/close but small large but slow/far
tradeoff: capacity (amount) <-> latency (speed) <-> throughput (bw) fundamental constraint in hardware (energy): fundamentally, large memories are further away on average larger memories imply longer wires (are inherently farther) longer wires imply longer time and lower bandwidth fundamental design: narrow interface to compensate for longer wires???
Programming a Naive GEMM
2.3.2 Throughput-Oriented Memory Hierarchies
2.3.2 From CUDA C to PTX with NSight Compute
2.3.3 Accelerating GEMM with Better Data Reuse: Coalesching, Caching, Tiling
2.3.4 Accelerating GEMM with Better Scheduling:
2.3.5 Accelerating GEMM with Tensor Cores:
2.3.6 Productively Programming Tensor Cores: ThunderKittens, CUTLASS
2.4 Learning Sequences with Different Inductive Biases
2.4.1 Convolutional Neural Network (CNN)
2.4.2 Recurrent Neural Networks (RNN)
2.4.3 Long Short-Term Memory Networks (LSTM)
2.4.4 Bidirectional Encoder Representations from Transformers (BERT)
2.4.5 Generative Pretrainted Transformers (GPT)
2.4.6 Diffusion
2.5 Accelerating nanogpt
In fact, symbolic differentiation also has the limitation that it cannot differentiate numerical functions embedded in a turing-complete programming language with control flow. i.e loops, branches, recursion
The slogan of automatic differentiation is the accuracy of symbolic with the performance of numeric
There is another comparision to be made with traditional systems, where, in the same way that all data operations — functions — of all programs can be reduced or desugared down to elementary arithmetic operations found in the arithmetic logic units of virtual or physical interpreters, so too can derivatives. That is, theoretically we could derive the derivative of all neural networks only with the derivatives of addition, and multiplication. But for convenience, we'll provide sugar derivatives for more complex derivatives.
just-in-time, and not statically, since deep learning frameworks such as torch and jax are embedded domain specific language
for instance, type errors happen at compile-time with Rust's rustc language implementation, and at run-time with Python's cpython implementation.
Afterword
To continue deepening your knowledge, the following courses are also a good next step. You might find this book complementary to your reading, since all three streams were woven into a single presentation for each concept presented throughout the book. Once you feel comfortable, you should graduate towards contributing to larger machine learning systems.
Good luck on your journey. I'll see you at work.
Tensor Programming
- Cambridge: Information Theory, Pattern Recognition and Neural Networks by David Mackay
- Tubingen ML4202: Probabilistic Machine Learning by Philipp Hennig
- Stanford CS124: From Languages to Information by Dan Jurafsky
- Stanford CS229: Machine Learning by Andrew Ng
- Stanford CS230: Deep Learning by Andrew Ng
- Stanford CS224N: NLP with Deep Learning by Christopher Manning
- Eureka LLM101N: Neural Networks Zero to Hero by Andrej Karpathy
- Stanford CS336: Language Modeling from Scratch by Percy Liang
Tensor Interpretation
- UPenn STAT 4830: Numerical Optimization for Machine Learning by Damek Davis
- MIT 18.S096: Matrix Calculus by Alan Edelman and Steven Johnson
- MIT 6.172: Performance Engineering by Charles Leiserson and Julian Shun
- MIT 6.S894: Accelerated Computing by Jonathan Ragan-Kelley
- Berkeley CS267: Applications of Parallel Computers by Katthie Yellick
- UIUC ECE408: Programming Massively Parallel Processors by Wen-mei Hwu
- Stanford CS149: Parallel Computing by Kayvon Fatahalian
- Stanford CS217: Hardware Accelerators for Machine Learning by Ardavan Pedram and Kunle Olukotun
- Carnegie Mellon 18-447: Computer Architecture by Onur Mutlu
- Carnegie Mellon 18-742: Parallel Computer Architecture by Onur Mutlu
- ETH 227: Programming Heterogeneous Computing Systems with GPUs by Onur Mutlu
Tensor Compilation
- Google DeepMind: How to Scale Your Model: A Systems View of LLMs on TPUs
- HuggingFace: Ultra-Scale Playbook: Training LLMs on GPU Clusters
- Berkeley CS265: Compiler Optimization by Max Willsey
- Cornell CS4120: Compilers by Andrew Myers
- Cornell CS6120: Advanced Compilers by Adrian Sampson
- Cornell CS4787: Principles of Large-Scale Machine Learning by Chris De Sa
- Cornell CS6787: Advanced Machine Learning Systems by Chris De Sa
- Carnegie Mellon 15-411: Compiler Design by Frank Pfenning
- Carnegie Mellon 15-745: Optimizing Compilers by Phil Gibbons
- Rice COMP412: Compiler Construction by Keith Cooper
- Rice COMP512: Advanced Compiler Construction by Keith Cooper
Bibliography
- Anthropic. n.d. “Transformer Circuits Thread.”
- Austin et al., "How to Scale Your Model", Google DeepMind, online, 2025.
- Ansel, Jason, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, et al. 2024. “PyTorch 2: Faster Machine Learning through Dynamic Python Bytecode Transformation and Graph Compilation.” ACM, April.
- Abelson, Harold. 1996. Structure and Interpretation of Computer Programs, Second Edition. MIT Press.
- Aho, Alfred V, Monica S Lam, Ravi Sethi, and Jeffrey D Ullman. 2015. Compilers: Principles, Techniques, & Tools. Pearson.
- Baydin, Atilim Gunes, Barak A Pearlmutter, Radul, Alexey Andreyevich, and Jeffrey Mark Siskind. 2015. “Automatic Differentiation in Machine Learning: A Survey.” ArXiv.org
- Bright, Paige, Alan Edelman, and Steven G Johnson. 2025. “Matrix Calculus (for Machine Learning and Beyond).” ArXiv.org
- Bryant, Randal E, and David R O’hallaron. 2016. Computer Systems: A Programmer’s Perspective. Boston: Pearson Education.
- Blondel, Mathieu, and Vincent Roulet. 2024. “The Elements of Differentiable Programming.” ArXiv.org
- Boehm, Simon. 2022. “How to Optimize a CUDA Matmul Kernel for CuBLAS-like Performance: A Worklog.” Siboehm.com
- Chen, Tianqi, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, et al. 2018. “TVM: An Automated End-To-End Optimizing Compiler for Deep Learning.” ArXiv, February.
- Cho, Kyunghyun. 2025. “Machine Learning: A Lecture Note.” ArXiv.org
- Cooper, Keith D, and Linda Torczon. 2022. Engineering a Compiler. Morgan Kaufmann.
- Cormen, Thomas H, Charles Eric Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press.
- Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Cambridge, Massachusetts: The MIT Press.
- Griewank, Andreas, and Andrea Walther. 2009. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Philadelphia, Pa.: Society For Industrial & Applied Mathematics ; Cambridge.
- Hack, Sebastian. 2007. Register Allocation for Programs in SSA Form.
- Harris, Charles R., K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, et al. 2020. “Array Programming with NumPy.” Nature 585 (7825): 357–62.
- Harris, Sarah. 2021. Digital Design and Computer Architecture: RISC-V Edition. S.L.: Morgan Kaufmann Publisher.
- Hennessy, John L, and David A Patterson. 2025. Computer Architecture: A Quantitative Approach. Cambridge, Ma: Morgan Kaufmann.
- Hwu, Wen-Mei W, David B. Kirk, and Izzat El Hajj. 2022. Programming Massively Parallel Processors: A Hands-on Approach. S.L.: Morgan Kaufmann.
- Jurafsky, Dan , and James H. Martin. 2025. “Speech and Language Processing.” Stanford.edu. 2025.
- Kang, Wanmo, and Kyunghyun Cho. 2025. “Linear Algebra for Data Science.” 2025.
- Klein, Philip N. 2013. Coding the Matrix: Linear Algebra through Applications to Computer Science. Newton, Mass: Newtonian Press.
- Krishnamurthi, Shriram. 2025. “Programming Languages: Application and Interpretation.” 2025.
- Mackay, David J C. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge: Cambridge University Press.
- Møller, Anders, and Michael I Schwartzbach. 2024. “Static Program Analysis.” Cs.au.dk. 2024.
- Murphy, Kevin P. 2023. Probabilistic Machine Learning: Advanced Concepts. MIT Press.
- Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. Cambridge: MIT Press.
- Ng, Andrew, and Tengyu Ma. 2023. CS229 Lecture Notes.
- Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” ArXiv.org
- Ragan-Kelley, Jonathan, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. “Halide.” Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation.
- Rastello, Fabrice, and Florent Bouchez Tichadou. 2022. SSA-Based Compiler Design. Springer Nature.
- Shankhdhar, Pranjal. 2025. “Outperforming CuBLAS on H100: A Worklog.” 2025.
- Suhan, Alex, Davide Libenzi, Ailing Zhang, Parker Schuh, Brennan Saeta, Jie Young Sohn, and Denys Shabalin. 2021. “LazyTensor: Combining Eager Execution with Domain-Specific Compilers.” ArXiv.org
- Spector, Benjamin F, Simran Arora, Aaryan Singhal, Daniel Y Fu, and Christopher Ré. 2024. “ThunderKittens: Simple, Fast, and Adorable AI Kernels.” ArXiv.org
- Stepanov, Alexander A, and Daniel E Rose. 2015. From Mathematics to Generic Programming. Upper Saddle River, Nj: Addison-Wesley.
- Tarjan, Robert E. 1988. Data Structures and Network Algorithms. Philadelphia: Society For Industrial And Applied Mathematics.
- Tazi et al., "The Ultra-Scale Playbook: Training LLMs on GPU Clusters", 2025.
- Trefethen, Lloyd N, and David Bau. 1997. Numerical Linear Algebra. SIAM.
- Uwe Naumann. 2012. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. Philadelphia: Society For Industrial And Applied Mathematics.