Singularity Systems: Zero to Hero
Dragon Book 2.0: The Hacker's Guide to Tensor Programs and Compilers
Singularity Systems: Zero to Hero is a SystemsML course designed for the age of heterogenous compute that powers software 2.0. This textbook spans the entire semantic chasm from the transistor to the tensor by building 4 compilers (yes four) from scratch starting from the bottom up and ending with a stack of programming languages capable of compiling the llama family of models.
Roadmap
v1.0: pretraining llama
v1.5: distributed pretraining llama
v2.0: post-training (r1) and inference systems (vllm/sglang)
Contributing
The textbook and its 4 compilers are all open source.
For discussion come join us at the #singularity-systems
work group in the GPU Mode Discord.
Citation
@article{j4orz2025singsys,
author = "j4orz",
title = "Singularity Systems: Zero to Hero",
year = "2025",
url = "https://j4orz.ai/zero-to-hero/"
}
Dedication
I dedicate this textbook to my father, Thomas Zhang.
In 2023, his wisdom guided me as my plans to pursue a PhD were changed to leading my sister's recovery program with her schizophrenia diagonosis. Not long after, he showed an even deeper courage of his own, meeting a severe stroke with strength, courage, and honor.
Singularity Systems Overview
Contents
Singularity Overview — Software 2.0
Systems Overview — "Golden Age" Infrastructure Buildout
implementing 4 compilers might be intimidating. In the same way artists paint over and over, and mathematicians rederive over and over, language implementors should represent over and over.
,--. ,--.
((O ))--((O ))
,'_`--'____`--'_`.
_|---------------------------------|_ to the tensor
| |Tensor Compiler: Torch -> Triton| | ^
| |Tiling Compiler: Triton -> PTX | | ^
| |Vector Compiler: CUDA -> PTX | | ^
| |Scalar Compiler: C -> RISC-V| | ^
| |---------------------------------| | passing assembly
| |:::::::::::::::::::::::::::::::::| | ^
| |::::::::::::::µarch::::::::::::::| | ^
|_|:::::::::::::::::::::::::::::::::|_| ^
|---------------------------------| from the transistor
__..-' `-..__
.-| : .----------------. : |-.
,\ || | |\______________/| | || /.
/`.\:| | || __ __ __ || | |;/,'\
:`-._\;.| || '--''--''--' || |,:/_.-':
| : | || .----------. || | : |
| | | || '----SSt---' || | | |
| | | || _ _ _ || | | |
:,--.; | || (_) (_) (_) || | :,--.;
(`-'|) | ||______________|| | (|`-')
`--' | |/______________\| | `--'
|____________________|
`.________________,'
(_______)(_______)
(_______)(_______)
(_______)(_______)
(_______)(_______)
| || |
'--------''--------'
Course Information Singularity Systems: Zero to Hero follows up from Neural Networks: Zero to Hero. We convert
While micrograd helps research scientists to understand the leaky abstraction of backpropagation, picograd is intended for systems programmers and performance engineers looking to further understand the compilers and chips of deep learning.
Try it out with:
pip install picograd
Prerequisites
- solid deep learning (llama)
- solid systems programming (C || C++ || Rust)
Syllabus Core: Deep Learning Compilers
- dfdx(nd): implements an interpreter for neural networks (HIPS/autograd)
- brrr: accelerates the interpreter with vector processing (pytorch1)
- pt2: constructs a compiler for neural networks (pytorch2)
- az1use1: 3d parallelism
Throughout the past decade, modern day AI infrastructure has rapidly evolved
to meet the needs of deep neural networks — most notably with the throughput
performance of GPUs moving from TFLOPS
to PFLOPS
. Datacenter
computing now has the goal of machines with EFLOP
speeds, now that that the
throughput of the fastest (non-distributed) supercomputers on TOP500 LINPACK
workloads are just reaching EFLOP
levels.
Although the brain is an existence proof of physics powering 20PFLOP
machines
with 20W
, the problem with the semiconductor physics of today is two-fold:
- instruction-level parallelism from out-of-order superscalar pipelines hits diminishing returns
- frequency scaling is hitting against Dennard scaling's power wall
and so this free-single-thread-performance-lunch aspect to Moore's law that transitioned us across computer classes from minis to micros and from micros to mobile is "over".
As a result computer architects are moving from homogenous general hardware to heterogenous specialized hardware, which means that the complexity of extracting program performance leaks upwards from the hardware — these days, to unlock the full performance of hardware, it's the programmer's responsibility to program the vector processors in multi-core/many-core machines.
The problem with the vector processing of multi-core/many-core machines is two-fold:
- programming model: compilers were sufficiently smart with autovectorization
- execution model: program speedups were bound by Amdahl's law
But the industry sidestepped these problems by changing the programming model to SIMT on SIMD (CUDA) and finding domains whose execution models had more parallelism (deep neural networks).
The challenge (producing a golden age) of compiler engineers and chip architects face is to find the optimal mapping from intelligence to energy. This means creating new programming languages and machiens while minimizing the accidental complexity that naturally builds up along the way:
- The Golden Age of Compiler Design (Lattner)
- A New Golden Age for Computer Architecture (Hennessy and Patterson)
Singularity Systems
The Singularity Systems: Zero to Hero course follows up from
where Neural Networks: Zero to Hero
left off: we will convert micrograd
into picrograd
, where the main difference is that:
- micrograd is a backprop engine with scalar-support helping researchers understand that backpropagation as an abstraction is in fact leaky (gradient activations, normalizations)
- picograd leans closer towards modern day deep learning frameworks with tensor-support for both pytorch1 interpretation and pytorch2 compilation.
While picograd
is more oriented towards low-level system programmers and
performance engineers, this framework remains pedagogical and remains a
point-wise compiler. This means that we will "only" support:
- 1 model: llama
- 1 programming model: eager
- 2 execution model: eager, graph
- 2 hardware architectures: amd cpu, nvidia gpu
- 2 precisions: fp32, tf32
Welcome to the golden age of Systems ML!
8. Tensor Compilers
This chapter implements an interpreter for neural network. By the end of this chapter, you will have a working implementation of the multidimensional array abstraction pioneered by numpy, and autodifferentiation pioneered by HIPS autograd.
Contents
- nd: forward pass with
loss = model(x)
- dfdx(nd): backward pass with
loss.backward()
,opt.step()
- ffn to gpt: lstm, rnn, gpt
- beyond nanogpt: attention variants, kv cache, speculative decoding
References
Part 1 — nd: forward pass with loss = model(x)
The following is the model definition and inference loop for a FFN language model
following (Bengio et al. 2003).
The API is a 1-1 match with PyTorch — take a second to convince yourself by
replacing import picograd
with import torch
and sampling from the inference
loop.
"""
Dimension key:
B: batch size
T: sequence length
V: vocabulary size
E: embedding dimension (E != D)
D: model dimension
"""
import picograd
# from jaxtyping import
# *********************MODEL*********************
B, T = 32, 3
V, E, D = 27, 10, 200
class Linear:
def __init__(self, D_in, D_out, bias=True):
self.W_DiDo = picograd.randn((D_in, D_out)) * 0.01
self.b_Do = picograd.zeros(D_out) if bias else None
def __call__(self, X_Di):
self.X_Do = X_Di @ self.W_DiDo
if self.b_Do is not None: self.X_Do += self.b_Do
self.out = self.X_Do
return self.X_Do
def parameters(self):
return [self.W_DiDo] + ([] if self.b_Do is None else [self.b_Do])
class Tanh:
def __call__(self, X_BD):
self.X_BD = picograd.tanh(X_BD)
self.out = self.X_BD
return self.X_BD
def parameters(self):
return []
model = [
Linear(T * E, D, bias=False), Tanh(),
Linear(D, D, bias=False), Tanh(),
Linear(D, V, bias=False)
]
C_VE = picograd.randn((V,E)) #, generator=g)
params = [C_VE] + [p for l in model for p in l.parameters()]
for p in params:
p.requires_grad = True
print("model loaded to cpu")
# *********************INFERENCE LOOP*********************
for _ in range(20): # 20 samples
output, context = [], [0] * T
while True:
X_1T = picograd.tensor([context]) # B=1 for inference, T=3, in [0..27] (set to 0 for init)
X_1TE = C_VE[X_1T] # using 0..27 as indices into C_VE for each B=1 example of context length T
X_1cTE = X_1TE.view(-1, T*E) # B=1, TE
X = X_1cTE
for h in model:
X = h(X)
y_hat = F.softmax(X, dim=1)
# sample and autoregressively update context
token = picograd.multinomial(y_hat, num_samples=1, replacement=True).item()#, generator=g).item()
context = context[1:] + [token]
output.append(decode[token])
if token == 0:
break
print(''.join(output))
At the end of part 1 we will be able to autoregressively sample from the
untrained model with the inference loop.
Contents
- Prediction, Posteriors, Parameters
- Pre-training: llama2, Post-training: r1
SAT.JUN.14 229 PREDICTION
- dnns
- theory
- posteriors: information theory loss functions: (cross entropy. relative entropy. compression)
- params: mle stats, param optimization is argmin(optimization) of gradient(matrix calculus)
A.1 Prediction
The primary goal of supervised machine learning is to leverage patterns from stochastic phenomena to predict quantities of interest without enumerating an entire population. That is, to recover the underlying distribution from a random sample. This is often the case with machine learning, where given dataset with , needs to be recovered. The assumption of independence is the statistical questionably axiomatic counterpart to the probabilistic one of totality. While the assumption is philosophically and theoretically interesting to question with alternative probabilistic frameworks and statistical modeling, the iid assumption is a useful one to make progress in modeling. Returning to the problem setup, when , the task is classification, and when , the task is regression.
There are a variety classes of functions that can be used to approximate , but the primary distinction is between parametric and non-parametric models. The former have a fixed number of parameters with respect to the size of the dataset where differentiation is used to compute gradients in order to optimize loss functions. This stands in contrast to the latter which has variable number of parameters where integration is used to compute expectations in order to update posteriors. Before the start of the deep learning "revolution" in 2012 with the ImageNet "moment" the parametric/non-parametric distinction was more well known. There were primarily three "schools of thought":
- the "kernelists": proponents of support vector machines (SVMs)
- the "bayesians": proponents of gaussian processes (GPs)
- the "connectionists": proponents of neural networks (NNs)
where SVMs and GPs are non-parametric, and NNs are parametric. This chapter will cover all three, highlighting their differences and similarities, and — as usual in the artificial sciences — how this taxonification is in fact falsely trichotomous. That is, by the end of this chapter the reader should understand how SVMs/GPs are effectively shallow networks whereas NNs are deep networks, and how the former is much better understood theoretically while the latter is more successful experimentally.
todo: murphy's global/local latent distinction (with cho EBMs?)
1.1 Non-Parametric Inference with Posterior Updates
Gaussian Processes
Bayesian Logistic Regression
Bayesian Neural Networks
1.2 Parametric Inference with Parameter Estimation
Generalized Linear Models (GLMs): Linear and Logistic Regression
Linear Regression
Recall that the supervised learning problem is considered regression when . spotify dataset
The linear regression model recovers that assumes is affine with respect to . That is,
where error is unbiased. expectation(e) is 0. error is model (unmodelled features im x)/data(measurement) uncertainty, then the loss is the likelihood.
With the model now defined, the parameter needs to be estimated from the data . This is done by using the negatve log likelihood as the loss function to minimize so that is fixed with respect to the data and known variance :
where is implemented by first evaluating the gradient and then iteratively applying gradient descent for each time step , .
First, evaluating the gradient looks like:
And so the swapping indices for the entire gradient gives . Recall now that the second step in implementing after taking is to then iteratively apply gradient descent for each time step , .
Logistic Regression
Recall that the supervised learning problem is considered classification when . The simple case when the output space is is referred to as binary classification, and in the case when the output space is is referred to as multi-class classification.
Many predictive values of interest that can be posed as answers to yes/no questions are suitable to be modeled as binary classifiers. For instance, benign or malignant in health care, innocent or guilty in law, real or spam in engineering. The running example used for the next two models will be the domain of music — to classifying a set of rap lyrics as a praise or diss. The set of lyrics that is of particular interest is Kendrick's featured verse in Kanye's "No More Parties in LA":
Hey baby you forgot your Ray Bans
And my sheets still orange from your spray tan
It was more than soft porn for the K-man
She remember my Sprinter, said "I was in the grape van"
Uhm, well cutie, I like your bougie booty
Come Erykah Badu me, well, let's make a movie
Hell, you know my repertoire is like a wrestler
I show you the ropes, connect the dots
A country girl that love Hollywood
Mama used to cook red beans and rice
Now it's Denny's, 4 in the morning, spoil your appetite
Liquor pouring and niggas swarming your section with erection
Smoke in every direction, middle finger pedestrians
R&B singers and lesbians, rappers and managers
Music and iPhone cameras
This shit unanimous for you, it's damaging for you, I think
That pussy should only be holding exclusive rights to me, I mean
He flew you in this motherfucker on first class
Even went out his way so you could check in an extra bag
Now you wanna divide the yam like it equate the math?
That shit don't add up, you're making him mad as fuck
She said she came out here to find an A-list rapper
I said baby, spin that round and say the alphabet backwards
You're dealing with malpractice, don't kill a good nigga's confidence
Just cause he a nerd and you don't know what a condom is
The head still good though, the head still good though
Make me say "Nam Myoho Renge Kyo"
Make a nigga say big words and act lyrical
Make me get spiritual
Make me believe in miracles, Buddhist monks and Cap'n Crunch cereal
Lord have mercy, thou will not hurt me
Five buddies all herded up on a Thursday
Bottle service, head service, I came in first place
The opportunity, the proper top of breast and booty cheek
The pop community, I mean these bitches come with union fee
And I want two of these, moving units through consumer streets
Then my shoe released, she was kicking in gratuity
And yeah G, I was all for it
She said K Lamar, you kind of dumb to be a poet
I'mma put you on game for the lames that don't know they're a rookie
Instagram is the best way to promote some pussy
The sentiment of "No More Parties in LA" is extremely difficult for a machine to classify correctly due to Kendrick's wordplay, double entendres, and insinuations. Consider the positive/negative dictionaries where positive={baby, cutie, love, good, spiritual, miracles, buddhist, buddies, opportunity, promote} and negative={shit, spoil, damaging, motherfucker, mad, fuck, malpractice, kill, mercy, hurt, fee, dumb}. Then a possible feature extractor which maps raw input to dimensioned feature representation is defined by:
Dim | Feature | Value |
---|---|---|
10 | ||
12 | ||
1 |
Now that the input space , feature extractor , and output space is defined, the goal is to select some function .
todo: use regression. motivate the need for the range to be a p \in [0,1]
Logistic Regression
The logistic regression model is a discriminative with a decision boundary (in the case of , is usually set to so as to divide the mass by 2) that assumes that the parameter is affine with respect to . That is,
where is a non-linear function , and where is referred to as the logit since the inverse is defined as the log odds ratio.
With the model now defined, the parameter needs to be estimated from the data . This is done by using the negatve log likelihood as the loss function to minimize so that is fixed with respect to the data
where todo kl->ce.
where is implemented by first evaluating the gradient and then iteratively applying gradient descent for each time step , .
First, to evaluate the gradient, , the negative log likelihood as loss function is simplified by defining so that . Note that is not the target label but the probability of the target label. Then, since the derivative is linear with the derivative of the sum is the sum of the derivatives where , taking the derivative for a single example for a single parameter where looks like
and so the evaluating the derivative of all examples where looks like
And so the swapping indices for the entire gradient gives . Recall now that the second step in implementing after taking is to then iteratively apply gradient descent for each time step , .
Multinomial Logistic Regression
GLM: and is
sigmoid justification
- nice derivative \frac{d\sigma}{d?} = \sigma(1-\sigma) (which helps with calculus -> optimization -> parameter estimation)
- glms justify the choice of logistic/sigmoid non-linear squashing function 1/1+exp(-z).
Deep Neural Networks (DNNs)
Feature | Dimension | Value |
---|---|---|
12 | ||
11 | ||
1 |
which apriori seems difficult to learn a useful mapping for, given that the first and second dimension of the feature representation which respectively encode positive/negative sentiment are roughly tied. For instance, what is the feature extractor that maps the following set of lyrics (with subtle insinuations) into a useful feature representation for a sentiment classifier?
A country girl that love Hollywood
Mama used to cook red beans and rice
Now it's Denny's, 4 in the morning, spoil your appetite
Liquor pouring and niggas swarming your section with erection
The short answer is that it's difficult because the concept and negative sentiment becomes apparent after reading the entire paragraph rather than any specific word. This is the motivation for representation learning.
A.2 Posterior Updates
Probability Theory
By default, mathematical reasoning is understood to be deterministic where a statement is either true (holds) or false (does not hold). Any variable can only take on one specific value at a time. However there are other times where what's desirable is describing phenomena that is in fact non-deterministic (while still remaining precise). Even if base reality is fundamentally deterministic, in practice there are many scenarios where carrying out calculation is intractable. i.e, predicting tomorrow's weather with the position of every water molecule.
The most widely adopted mathematical language for formalizing our intuitions around non-deterministic stochastic phenomena is probability theory (as opposed to alternative frameworks such as probabilistic logic or uncertainty quantification). In probability theory statements are neither true nor false. Rather, truth is distributed across a weighted set of random events . Similarly, random variables and their probability distributions do not take on definite values but rather sets of values. Probability theory is the generalization of aristotelian logic.
Probability Space: Measurable Space + Measure
These three concrete implementations create the toolbox which serve as workhorses for probability theory and require an abstract interface. The Kolmogorov's 1933 formulation of probability theory in Grundbegriffe der Wahrscheinlichkeitsrechnung is the universally accepted model for random events and random variables. The guiding light in the axiomaticization are the first two out of three properties that should satisfy: non-negativity and normalization, which formalize impossible (surely false) and guaranteed (surely true) statements. They ensure that probabilities are never less than 0 nor greater than 1.
The foundations start with an ambient sample space which is the set of all possible outcomes in an experiment. It's natural to expect the probability function to be defined on , which is acceptable when the sample space is countable. That is, when , the first two properties are possible to satisfy:
- non-negativity:
- normalization:
but when the sample space is infinitely uncountable then normalization is not possible. That is, when , TODO... The next approach then is to define probabilities for the powerset with . The powerset is the set of all subsets, and is referred to as the event space. The event space must be a field. That is, it must be closed under union, intersection, and complement, satisfying:
- .
- .
- .
This formulation of is now defined on but still does not support sample spaces where let alone . The property of closure of union and intersection need to be extended to support sample spaces that are countable and uncountable infinite. Starting with the former, a field that satisfies
- .
is then a -field and is defined for . This -field with the sample space form a measurable space. Finally, the original goal that motivated the need to redefine 's domain from to was to support uncountable infinities where . This motivates the need to add additional properties the event space must satisfy which is... this is a borel sigma field.
Returning to the probability function , the original first two properties are satisified, in addition to a third and final one:
- non-negativity:
- normalization:
- additivity:
A measurable sapce with a measurable function is the abstract model for a probability space.
Example 1: finite probability space
In the notebook's running example, consider describing a rappers vocabulary as some sample space , and the following probability assignment (take a moment to confierm that they are all non-negative and normalize to 1).
Let be the r.v that maps words to their lengths, and let be the r.v that maps the word to the ordinal-based unicode number of the first letter
- the rv taking on value constructs the event modeled by the subset and the p
- the rv taking on value constructs the event modeled by the subset
The event where and constructs the subset . The event where or constructs the subset
Example 2: continuous probability space
Discrete Random Variables
Continuous Random Variables
Rulez Rule: Product Rule, Sum Rule, Bayes' Rule
With an axiomaticization in place where random events and random variables are modeled as measurable subsets and measurable functions of a measurable space , the focus is now on the core — invariant to sample space modifications — probabilistic concepts that comprise the notion of updating beliefs.
It all starts with of conditional probability, which is the probability of some event given that event was observed. If the Kolmogorov axiomaticization admits modeling stochastic events and as a venn diagram within an ambient sample space, then conditional probability changes the fidelity of description so that is the new (and narrowed) "sample space" being considered. That is, probability of given is denoted with and is defined as
and can itself be proved a valid probability space. In other scenarios is given, but is not. This is amenable by rearranging the definition and solving for the latter with . and is referred to as the product rule (emphasis on the product) or chain rule (emphasis on the factors).
In other scenarios is given (and now by the product rule), but is not. This too is amenable by using event as a "background event", where assuming events are mutually exclusive, then . This is referred to as the sum rule, and can be composed with the product rule to further decompose the terms of products into terms of conditionals.
Finally, in other scenarios where is given but is not, this is amenable by rewriting the numerator and denominator of conditional probability using the product rule and sum rule:
and is referred to as bayes' rule. The scenario where is accessible but is not comes up many times practice, where is some latent event and is an observable event. This happens in science (why bayes rule is also referred to as the logic of science) where is used to model hypothesis and evidence where the posterior (latent given observed) is updated by evaluating the product of the prior and the likelihood (observed given latent) normalized by the evidence .
Example 3: conditional probability
Let be the sample space used from the previous examples, once again be the r.v that maps the word to the ordinal-based unicode number of the first letter, and let be a new identity r.v that maps the word to it's hash. What's ? The answer by definition of conditional probability is , where because of additivity.
todo: examples 4/5/6: product rule/sum rule/bayes rule
Inference: Joint Random Variables
from jax import numpy as jnp
from jaxtyping import Array, Float
def bayes(joint: Float[Array, "dx dy"]) -> Float[Array, "dx dy"]:
prior = 0
likelihood = 0
evidence = 0
posterior = 0
return prior, likelihood, evidence, posterior
Information Theory
Kullback-Leibler Divergence
Entropy
Compression
Algorithmic Information Theory
A.3 Parameter Estimation
Statistics
In probabilistic models distributions of latent hypotheses are usually unknown and their posteriors are inferred with bayes rule by taking the product of the prior , likelihood , and normalizing by the evidence . When these latter distributions (more specifically, the parameters that characterize them), the parameters need to be recovered from data via parameter estimation.
Empirical Risk Minimization (ERM)
Risk is a quantity where the objective is to minimize. Reward is a quantity where the objective is to maximize.
loss functions and risk.
- theta_likelihood from L_likelihood (likelihood from the probabilistic perspective)
- theta_0-1 from L_01
- theta_absolute from L_abs
Unbiased Estimators
Maximum Likelihood Estimation (MLE)
Maximizing the likelihood uses the probability model itself as the loss function. Following the notion of minimizing risk, the maximization corresponds to minimizing the negative likelihood . One more transformation is applied by mapping the likelihood to log space which simplifies the routine by decomposing the joint probability of the data into a product of likelihoods with the chain rule (assuming ). The parameters which minimize negative log likelihood are the same that minimize negative likelihood since logarithms perserve monotonicity. All together, the optimization problem is:
In the following examples is implemented symbolically by solving for .
Example 1: maximizing likelihood with
Assume Then using negative log likelihood as the loss function corresponds to the following optimization:
and solving for the arguments which minimize the expression corresponds to taking the derivative with respect to and solve for those that equal 0:
Example 2: maximizing likelihood with
Assume Then using negative log likelihood as the loss function corresponds to the following optimization:
and solving for the arguments which minimize the expression corresponds to taking the derivative with respect to and solve for those that equal 0:
Example 3: maximizing likelihood with
Assume Then using negative log likelihood as the loss function corresponds to the following optimization:
and solving for the arguments which minimize the expression corresponds to taking the derivative with respect to and solve for those that equal 0:
Maximum a Posterior Estimation (MAP)
Matrix Calculus
Recall that for some scalar function that the derivative is usually defined as where the derivative is the ratio between the differentials (inifinitesmal differences) of input and output . While is defined on , this is not always the case for higher dimensional euclidean spaces.
Moreover, the problem with the formulation of derivatives as is not limited to denotational semantic lawyering — there's also a problem with the derivative rules themselves. Consider the function , which squares a matrix. The power rule of scalar functions does not generalize here, because .
Defining as a ratio does not generalize. Instead, the correct approach is to define the derivative as linearization which defines the derivative as the linear operator you apply to a change in output to receive a change in output. That is, . This formulation of the derivative generalizes scalar and vector differential calculus over and into higher dimensional vector spaces for matrix and tensor calculus over and . In essence, linearization is locally approximating complex surfaces in vector spaces (more specifically, Banach spaces equipped with norm ) with linear operators () so that .
Derivatives, Gradients, and Jacobians
- for some function , is the term such that . For this expression to be defined, where .
- for some function , is the term such that . For this expression to be defined, where . In this case, is often referred to as the gradient of and denoted by
- for some function , is the term such that . For this expression to be defined, where . In this case, is often referred to as the jacobian of and denoted by
numerical example.
Automatic Differentiation
as opposed to symbolic or numerical.
Optimization
Gradient Descent
Stochastic Gradient Descent
Adam
Shampoo
Muon
and more generally we define the derivative as the linear operator L on vector space V which you apply to a change in input in order to obtain the change in output. That is, Δout = (lin.op)[Δin]. which generalizes to higher dimensions. While defining f'x ≜ df/dx is legal when f: ℝ -> ℝ, it's usually not clear what df/dx means in higher dimensions, since /: ℝ^n, R^m -> ? is not defined.
f: ℝ^d -> ℝ
Consider f'(x) when f: ℝ^d -> ℝ. Then, given that dx ∈ ℝ and df ∈ ℝ^d, the linear operator f'(x) must (formalized by the Riecz-Frchét Representation Theorem) be the dot product with some column vector, (multiplication with the column vector transposed). Let's call this column vector the "gradient" of f, where the component-wise definition with indices is:
∇f ∈ ℝ^d, f'(x) = ∇fᵀx
∇f_i ≜ ∂f/∂x_i ∀ i ∈ [0..d]
and all together, we use f'(x) to linearize df:
f(x+dx)-f(x) = df
= f'(x)dx
= ∇fᵀdx
f: ℝ^m -> ℝ^n
Moving on, consider f'(x) where f: ℝ^m -> ℝ^n. Then given that dx ∈ ℝ^m and df ∈ ℝ^n, the linear operator f'(x) must (formalized by the Riecz-Frchét Representation Theorem once again) expressable as a matrix multiplication with some matrix A ∈ ℝ^(mxn). Let's call this matrix the "jacobian", where the component-wise definition with indices is the following:
Jf ∈ ℝ^(nxm), f'(x) = Jdx
Jf_ij ≜ ∂fi/∂xj ∀ i ∈ [0..n], j ∈ [0..m]
and all together, we use f'(x) to linearize df:
f(x+dx)-f(x) = df
= f'(x)dx
= Jdx
f: ℝ^m, ℝ^n -> ℝ^p [f: ℝ^(nxm), ℝ^(mxp) -> ℝ^(nxp)]
For example, given f: ℝ^m -> ℝ^n, f(x) := Ax
==> df = f(x+dx)-f(x)
= (Ax+dx)-Ax
= Adx
==> f'(x) = A = J
While it's very elegant that the derivative of Ax turns out to be A, training neural networks requires that is A is not just constant, but variable with respect to f. To make the deep learning context clearer, we will use W instead of A:
fi(W,x) ≜ ΣWij*xj
f(x) := [f1(x), f2(x), ..., fn(x)]
but we can just promote x to be a matrix too. Finally, consider f'(x) when
f: ℝ^(nxm), ℝ^(mxp) -> ℝ^(nxp)
ℒ: ℝ^d0 -> ℝ ∋ ℒ(Z) where Z := X @ Y
==> dℒ = dℒ/dZ : dZ [by Riecz. Representation Theorem]
= dℒ/dZ : (dX@Y + X@dY) [by product rule on dZ]
= dℒ/dZ : (dX@Y) + dℒ/dZ : (X@dY) [: linear]
= (dℒ/dZ@Yᵀ) : dX + (Xᵀ@dℒ/dZ) : dY [by TODO...circularity of trace]
==> dℒ/dX = dℒ/dZ@Yᵀ ∧ dℒ/dY = Xᵀ@dℒ/dZ
Stepping back, defining the derivative this way generalizes the rise over run with scalars, the gradient with vectors, and the jacobian with matrices into the same notion: they are the linear operators defined on vector spaces that you apply to a change in input in order to obtain the change in output, and the types/shapes are defined the way they are out of necessity, rather than by fiat. Moreover, this also sharpens the semantics of the denotation in higher dimensions given that df/dx is often ill-defined. i.e. while df/dx makes sense when df, dx ∈ ℝ^n, dividing tensors with mismatched shapes becomes nonsensical.
This also means that this definition extends to any vector space where a linear operator is defined, for instance, the space of matrices (vectorization of matrices), and the space of functions (calculus of variations).
While training deep neural networks do not involve functions that map matrices to matrices or functions to functions, defined this way it's clear that many of the "grads" we take in pytorch are actually jacobians that all compose into one single gradient for the entire expression graph which represents some function f: ℝ^n -> ℝ.
2b — Derivative rules: backward methods
defining rules for any f: V->V
Sum rule
--------
f: V -> V
f(x) := g(x) + h(x)
==> df = dg + dh
f'(x)dx = g'(x)dx + h'(x)dx [by def]
f'(x) = g'(x) + h'(x) [/ dx]
Product rule
------------
f: V -> V
f(x) := g(x)h(x)
==> df = f(x+dx)-f(x)
= g(x+dx)h(x+dx)-g(x)h(x) [by def of f]
= [g(x)+g'(x)dx][h(x)+h'(x)dx]-g(x)h(x) [by def of i(x+dx)=i(x)+i'(x)]
= g(x)h(x) [expand product]
+ g(x)h'(x)dx
+ h(x)g'(x)dx
+ g'(x)h'(x)(dx)^2
- g(x)h(x)
= g(x)h'(x)dx + h(x)g'(x)dx + g'(x)h'(x)(dx)^2 [g(x)h(x) cancels]
= g(x)h'(x)dx + h(x)g'(x)dx [g'(x)h'(x)(dx)^2 -> 0]
= [g(x)h'(x) + h(x)g'(x)]dx
= f'(x)dx
Chain rule
----------
f: V -> V
f(x) := g[h(x)]
==> df = f(x+dx)-f(x)
= g[h(x+dx)]-g[h(x)]
= g[h(x)+h'(x)dx]-g[h(x)] [by def of h(x+dx)=h(x)+h'(x)]
= g'[h(x)]h'(x)dx [g(x+dx)-g(x) = dg = g'(x)dx
where x=h'(x)dx]
2c — Automatic differentiation: calculus on computational graph
Recall that we need to evaluate the gradient of a neural network's loss function in order to train it with parameter estimation. The method used to compute derivatives is automatic differentiation — an algorithmic technique as opposed to symbolic techniques that are inefficient or numeric ones that are inaccurate. Given some expression graph that represents a neural network f: ℝ^d_0 -> ℝ, autodiff will recursively apply the chain rule for each subexpression h: ℝ^d_i -> ℝ^d_j.
c = a*b (1)
d = e+f (2)
g = c*d (3)
.-.
( a <---+
`-' | * .-. (1)
.-. +----( c <---+
( b )<--+ `-' |
`-' | * .-. (3)
.-. +-----( g )
( e )<--+ / .-. (2)| `-'
`-' +----( d )<--+
.-. | `-'
( f )<--+
`-'
Fig 4. expression graph
Notice in figure 4 that even though evaluation of the expression is denoted from left to right, the edges of the graph form a directed acyclic graph (tree) rooted at the output of the expression which simplifies graph modification.
Interpreting each step of computation (1, 2, 3) is referred to as the "forward pass" and constructs the graph in memory which either creates a new tree amongst a forest (2) or roots the forest into a single tree (3).
Evaluating the derivative of the output with respect to all inputs is referred to as the "backward pass" — as opposed to "second forward pass" because deep learning frameworks implement reverse-mode differentiation as opposed to forward-mode.
The difference between forward-mode and reverse-mode differentiation is the direction you step through the graph which can end up influencing speed depending on the shape of the function. Stepping from inputs to outputs is considered a forward traversal, and the opposite is considered to be a backward traversal.
For instance, consider the applicaiton of the chain rule when evaluating the f'(x) for f(x) := cos(x^2). Which subexpression do you derive first? Evaluating d/dx[cos(x)] = sinx first is reverse mode, and evaluating d/dx[x^2] = 2x first is forward mode. With neural networks, most expression graphs represent functions of the form f: ℝ^n -> ℝ, and so the default traversal for all deep learning frameworks is reverse mode.
However, regardless of the order of traversal, the effective computation is the same: a summation over all possible products of paths. The base case starts with ∂f/∂f = 1, propagating the local dout/din, and recursively applies the chain rule for each subexpression. This is why reverse-mode differentiation is also referred to as "backpropagation". Stepping through the backward for the same function f as above where d denotes the global derivative and ∂ denotes the local derivative (overloading ∂'s classic usage for partial derivative)
---forward
c = a*b (1)
d = e+f (2)
g = c*d (3)
---backward
∂g/∂g = 1 [base case] (4)
dg/dc = ∂g/∂g*∂g/∂c [chain rule]
= ∂g/∂g*d [product rule]
= d [∂g/∂g cached] (5)
dg/dd = ∂g/∂g*∂g/∂d [chain rule]
= ∂g/∂g*c [product rule]
= c [∂g/∂g cached] (6)
dg/da = ∂g/∂g*∂g/∂c*∂c/∂a [chain rule]
= ∂g/∂g*∂g/∂c*b [product rule]
= d*b [∂g/∂g*∂g/∂c cached] (7)
dg/db = ∂g/∂g*∂g/∂c*∂c/∂b [chain rule]
= ∂g/∂g*∂g/∂c*a [product rule]
= d*a [∂g/∂g*∂g/∂c cached] (8)
dg/de = ∂g/∂g*∂g/∂d*∂d/∂e [chain rule]
= ∂g/∂g*∂g/∂d*1 [sum rule]
= c [∂g/∂g*∂g/∂d cached] (9)
dg/df = ∂g/∂g*∂g/∂d*∂d/∂f [chain rule]
= ∂g/∂g*∂g/∂d*1 [sum rule]
= c [∂g/∂g*∂g/∂d cached] (10)
.-. (7)
( a <---+
`-' | * .-. (1) (5)
.-. +----( c <---+
( b )<--+ `-' |
`-' (8) | * .-. (4)
.-. (9) +-----( g )
( e )<--+ + .-. (2)|(6) `-'
`-' +----( d )<--+
.-. | `-'
( f )<--+
`-'(10)
What's delightful from a user experience point of view is that autodifferentiation allows the user to implicitly construct the expression graph by specifying the forward pass, and the backward pass is abstracted away with an output.backward().
TODO: show expression graph for a neural network. TODO: let's now revisit the definition of the derivative (2b) and rederive a few derivative rules (2c)
- +(tensor, tensor)
- @(tensor, tensor)
- tanh(tensor)
2d — Advanced AD: jacobian-vector and vector-jacobian products
2e — Gradient descent: non-convex optimization
Finally, recall evaluating the derivative of the computational graph's output with respect to all inputs is used to implement argmin ℒ(Θ) by iteratively adjusting θ̂ via gradient descent: θ^(t+1) := θ^t - α∇ℒ(Θ) = θ^t - α∇-ΣlogP(y^i|x^i;Θ)
Bibliography
Topological Order
(Iterative Deepening Depth-First Search)
- Linear Algebra Kang/Cho
- Matrix Calculus Scardapane -> Bright/Edelman
- Probability Theory Piech -> Chan -> Tao -> Varadhan
- Machine Learning Jurafsky/Martin -> Ng/Ma -> MacKay -> Murphy -> Cho
- Deep Learning Cho -> Goodfellow -> Zhang et al -> Ermon/Grover
- Compilers Pfenning et al. -> Myers -> Appel-> Sampson -> Muchnick
- Chips Harris -> Hennessy -> Herlihy -> Kirk/Hwu
References
Aho, Alfred V, et al. Compilers : Principles, Techniques, and Tools. Pearson Education, 2006.
Appel, A. W., & Ginsburg, M. (2004). Modern Compiler Implementation in C. Cambridge University Press.
Bakhvalov, D. (2024). Performance Analysis and Tuning on Modern CPUs. Independently Published.
Bright, P., Edelman, A., & Johnson, S. G. (2025). Matrix Calculus (for Machine Learning and Beyond). ArXiv.org https://arxiv.org/abs/2501.14787.
Chan S. H. (2021). Introduction to Probability for Data Science. Michigan Publishing https://probability4datascience.com/.
Cho, K. (2015, November 24). Natural Language Understanding with Distributed Representation. ArXiv.org https://doi.org/10.48550/arXiv.1511.07916.
Cho, K. (2025). Machine Learning: a Lecture Note. ArXiv.org https://arxiv.org/abs/2505.03861Goodfellow.
Cooper, Keith D. “COMP 512 Lecture Notes: Advanced Compiler Construction.” Rice, 2015, https://www.clear.rice.edu/comp512/Lectures/.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. The MIT Press https://www.deeplearningbook.org/.
Grover, A. (2018). CS236 Deep Generative Models Course Notes. https://deepgenerativemodels.github.io/notes/.
Harris, S. L., & Harris, D. (2021). Digital Design and Computer Architecture, RISC-V Edition. Morgan Kaufmann https://pages.hmc.edu/harris/ddca/.
Hennessy, J. L., & Patterson, D. A. (2019). Computer Architecture: a Quantitative Approach. Morgan Kaufmann.
Herlihy, M., Nir Shavit, Luchangco, V., & Spear, M. (2020). The Art of Multiprocessor Programming. Newnes.
Kirk, D., & Wen-Mei Hwu. (2022). Programming Massively Parallel Processdors: a Hands-on Approach. Morgan Kaufmann Elsevier.
Krishnamurthi, Shriram. “Programming Languages: Application and Interpretation.” Plai, 2022, https://www.plai.org/.
Jurafsky, D., & Martin, J. H. (2014). Speech and Language Processing: an Introduction to Natural Language processing, Computational linguistics, and Speech Recognition. Dorling Kindersley Pvt, Ltd. https://web.stanford.edu/~jurafsky/slp3/ed3book.pdfKang.
Kang W., & Cho, K. (2025). Linear Algebra for Data Science. Kyunghyuncho.me https://kyunghyuncho.me/linear-algebra-for-data-science/Mackay.
Mackay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge University Press https://www.inference.org.uk/itprnn/book.pdf.
Møller , Anders, and Michael I. Schwartzbach. Static Program Analysis. 2024, https://cs.au.dk/~amoeller/spa/.
Muchnick, S. S. (2014). Advanced Compiler Design and implementation. Morgan Kaufmann
Murphy, K. P. (2022). Probabilistic machine learning : an introduction. MIT Press https://probml.github.io/pml-book/book1.html.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press https://probml.github.io/pml-book/book2.html.
Myers, Andrew. “CS 4120 Lecture Notes.” Cornell, 2023, https://www.cs.cornell.edu/courses/cs4120/2023sp/notes/.
Ng, A., & Ma, T. (2023). CS229 Lecture Notes https://cs229.stanford.edu/main_notes.pdf.
Piech, C. (2023). Probability for Computer Scientists. https://chrispiech.github.io/probabilityForComputerScientists/en.
Pfenning, Frank, et al. “15-411/611 Lecture Notes.” CMU, 2024, https://www.cs.cmu.edu/~janh/courses/411/24/schedule.html.
S R S Varadhan. (2001). Probability theory. Courant Institute Of Mathematical Sciences ; Providence, R.I. https://www.ams.org/books/cln/007/cln007-endmatter.pdf
Sampson, Adrian. “CS 6120 Lectures: Advanced Compilers” Cornell, 2025, https://www.cs.cornell.edu/courses/cs6120/2025sp/lesson/.
Scardapane, S. (2024). Alice’s Adventures in a Differentiable Wonderland -- Volume I, A Tour of the Land. ArXiv.org https://arxiv.org/abs/2404.17625.
Siek, Jeremy G. Essentials of Compilation. MIT Press, 1 Aug. 2023.
Tao, T. (2015). 275A Probability Theory Course Notes. https://terrytao.wordpress.com/category/teaching/275a-probability-theory/.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. Cambridge University Press https://d2l.ai/.