Flammarion's L'atmosphère météorologie populaire 1888.

The Structure and Interpretation of Tensor Programs

The Hacker's Guide to Differentiable, Parallel, Machine Learning Systems.

by j4orz.
First Edition draft.
Comments and feedback are welcome.

Dedication

I dedicate this textbook to my father, Dr. Thomas Zhang ND., R.TCMP, R.Ac.

At 23 years old in 2023, he guided me as my plans for graduate school were changed to helping my sister's recovery from a diagnosis of schizophrenia. That same year, he had two hemorrhagic strokes, and passed away shortly after. My personal relationship to the psychiatric mind and neurological brain led to a deep reflection on my life's work up ahead. The result is the following textbook on compilers for artificial intelligence, an intersection of programming languages, and machine learning.

I hope to make you proud, dad. May you rest in pure land. We'll meet again.

We’ll Meet Again by Vera Lynn 1939. Cover by Johnny Cash 2002.

Foreward

Thomas Cole's The Course of Empire: Destruction. Oil on canvas. 1836.

This textbook is created for all intelligent life forms whether artificial or biological. However, each line of the book has been handcrafted with care to resonate and be understood by other fellow humans. As a result, the reading experience of the textbook should be a uniquely complementary to a context-limited large language model. For now.

Contents

greeks -> gutingen -> google -> gpt science artificial: https://monoskop.org/images/9/9c/Simon_Herbert_A_The_Sciences_of_the_Artificial_3rd_ed.pdf declarative knowledge: math procedural knowledge: computation

programmers are the original speedrunners.

  • sicp/fics
  • htdp/dcic ap.b and prelims will represent data before using the data. (bot up. space. elements) in reverse order of discovery (time). mathematics forms a unity. programming as a mathematical discipline also shares that unity.
  • sicp/fics
  • htdp/dcic -> will never be dated. because they are universal.

arith -> alg -> anal

yin and yang data <-> function data can be modelled by function (lazy)

Civilizations and their Canons

The foundation of civilization are shared stories. This is because in order for society to function, we need to be able to relate to one another with via shared context. This shared context is considered to be truths that are self-evident. In other words, they are the axioms of humanity.

These stories that provide a shared context for civilization are referred to as the canon. The canon is a body, standard, or code in which things are compared against. The etymology of canon actually originates from the Greek kanon (κανών), a straight rod used by architects and artificers as a measuring stick for making straight lines.

If canons serve as the foundation for civilization, then there are two ways for a civilization to collapse — burn the canon so that none remain, or expand the canon to hundreds and thousands of books so that everyone has their own story, resulting in the inability to relate with one another.

This is why Greek civilization has the Iliad and Odyssey, Roman civilization has the Bible, and that of the Indian civilization has Mahabharata and Ramayana. An individual only has to read and deeply understand a few stories well, in order to become civilized, and coexist in harmony with the rest of society.

So canons are the solution to organize society at scale, and the book that became the most predominant in Western civilization was that of the Bible, arising from Jewish civilization. This is because the Jewish religion connected their experience as a nation (exodus) and provided a prehistory to the beginning of the universe via the book of Genesis — something that lacked from the previous religions of those such as the Greeks or Egyptians.

Following Thomas Aquinas' rational explanation of how the Bible as canon provided the foundation for civilization, he defined policies of individual agents, referred to as practical virtues:

  • temperance is to optimize internal organization
  • justice is to optimize external organization
  • prudence is to justify goals
  • courage is to balance exploration vs exploitation

These individual policies however are necessary but not sufficient for a civilization's canon. For that, you need policies of collective agents referred to as divine virtues:

  • faith is the commitment to the optimal collective agent
  • love is to serve a sacred purpose in the other
  • hope is to invest before a shared system exists

This optimal collective agent is defined as the best possible, most top level collective agent, that can and is continually discovered through rational inference. That collective agent, is defined as God, by Thomas Aquinas. These axiomatic virtues of God are also referred to as the divine will, and served as the canon (universal morality) for Western civilization until the "death of God" during the Enlightenment. Since then, tther attempts have been made to redefine the foundations of ethics (universal morality) without reference to collective agents such as utilitarianism for example.

However, this textbook is not about ethics, but rather programming. More specifically, programming tensor compilers for artificial intelligence. While civilizations have their canons, disciplines have theirs as well. Philosophers have Plato. Lawyers have Blackstone. Doctors have Gray. Programmers have Knuth. But what do Artificial Scientists have? We'll return to this question.

What can a human being know? Intelligence=compression=ability to make models

  • intelligence is a multi-generational property. we can't even figure out turing complete languages by ourselves.

  • individuals have more intelligence than generations

  • civilizations have more intelligence than individuals -> what does a civilizational intellect look like? global optimum of the modelign function -> a civilizational tradition that lasts a few hundred years. a canon.

need 1000 year unbroken intellectual tradition (canon) failing due to the scaling problems of human minds via natural language- Assignment 5

Disciplines and their Canons

We finally return to the discussion on disciplines and their canons.

golden age of the greeks, the germans, and the googlers. second golden age of compilers and chips. we'll know in a few decades.

Programming. SICP, HTDP, DCIC. Wirth, Knuth, Tarjan, CLRS, EoP. in the dawn of the llm, who is teaching introductory computer science course with graph theory and tensor calculus? today's introductory cs: study of problems in P. in the same way we shift philosophy to the sciences. introductory cs shifts graduate courses to undergraduate courses.

you can teach this bottom up or top down. the presentation of the material is bottom up.

Bringing us back to the present, the year of 2023 was seminal in the history of artificial intelligence as the world, woke up to the advancements in one of the greatest philosophical projects which started a century ago at the Dartmouth workshop. The deep learning approach to growing intelligent machinery flew against the consensus view held amongst experts that intelligence required some master algorithm to be divined from the laws of nature just as physics did. While deep neural networks that learned representations end to end did in fact employ the precise tooling of mathematics, the act of training these systems is more akin to evolutionary biology than it is to traditional software.

Slowly but surely, watershed moments within academia filtered their way down towards mainstream consciousness in the form factor of consumer products. The best example being large technological breakthroughs in speech recognition and vision recognition in the early 2010s making its to smartphone assistants like Alexa, Siri, and Google Assistant. Finally, in 2017, language modelling had its big breakthrough with the transformer based neural networks presented in (Vaswani et al. 2017), which moved away from infinite look back design choice made with RNNs in (Mikolov et al. 2010) and back to autoregressive causal masks like the NADE network in (Larochelle, Murray 2011) and the MADE network in (Germain et al. 2015). The difference was that instead of using convolutions as the causal mask, they used the attention mechanism, turning neural networks into general set-processing machines.

While academia was very excited with the new attention mechanism presented in the transformer-based neural network, it was OpenAI who noticed signs of life in scale (Kaplan et al. 2020) and took the industry bet with GPT3 (Brown et al. 2020), GPT4 (OpenAI 2023).

  • 2018 (GPT1 117M params): grammar
  • 2019 (GPT2 1.5B params): prose, poetry, metaphor
  • 2020 (GPT3 175B params): long stories
  • 2023 (GPT4 1.76T params): college-level exams

When OpenAI took pretrained GPT3/GPT4 and built ChatGPT by following Lecun's cake philosophy (IFT + RLHF), the reaction from the mainstream was visceral. All of a sudden, the big questions about the fundamental nature of our reality came screeching back into our lives like distant relatives during the holidays. Before we get into these big questions, we will lay down some principled groundwork for why a system like ChatGPT is possible to build in the first place.

  • reasoning and memory https://x.com/jxmnop/status/1938705379711430938

all the way to gpt-oss

Chapter 1: Preliminaries Chapter 2: Serial Compilation C on CPU Chapter 3: Parallel Compilation CUDA C on GPU Chapter 4: Differentiable Interpretation PyTorch using CUDA

Chapter 5: Tiled Compilation Triton on GPU Chapter 6: Differentiable Compilation PyTorch compiling to Triton

looking back after a decade, i claim karpathy's course will be deemed seminal. -> influenced stanford to go "line by line from scratch" -> gen z calls it "line by line from scratch" -> procedural epistemology: "best way to teach it to a computer". <add_minsky_quote>

the art of computer programming -> the art of multiprocessor programming elements of euclid -> elements of X -> elements of programming neural networks: zero to hero -> singularity systems: zero to hero

sicp bridges the entire semantic gap from scheme to register machine. like sicp, you create gpt2. and then you create the compilers for gpt2.

humanity has discovered the use of a new language: tensors. just like karpathy translated these papers into textbook form.... attention paper. growing neural networks. internet fossil fuel...now reasoners. continual learning is the next frontier. i want to do the same for pt2, triton, thunderkittens, cutile. -> programming is a mathematical discipline. you need models, which is mathema. -> it used to be just discrete. but now you need continuous.KKk

this textbook's goal is to imporove the trajectory of civilization. and processing

Le Moyen Age et la Renaissance Paris. 1848-1859

A computational process is indeed much like a sorcerer's idea of a spirit. It cannot be seen or touched. It is not composed of matter at all. However, it is very real. It can perform intellectual work. It can answer questions.

Elements of Learning

In chapter 1, we cover the elements of learning machines. In chapter 1.1, we explain how programmers interested in building learning machines must make the same jump from Aristotle's deterministic, propositional logic to that of Laplace's stochastic, probabilistic logic. In chapter 1.2, we construct probabilities with tensors following torch.Tensor, accelerating them with vector processors, and build probability distributions on top of those tensors following torch.distributions. In chapters 1.3 and 1.4, we cover inference with a linear modeling assumption, and the linear algebra computation needed to carry out such inference, following torch.linalg and cuBLAS. The elements of learning machines covered in chapter 1 will provide the foundation needed to train deep neural networks in chapter 2, following torch.nn, torch.optim, and cuDNN.

The School of Athens. Raffaello Sanzio da Urbino. 1509-1511

1.1. Types to Tensors

This chapter provides a historical introduction to the tension between continuous and discrete descriptions of reality, providing context for programmers that need to change gears in the way they model reality from the types of software 1.0 to the tensors of software 2.0.

Contents

Aristotle and Euclid to Kolmogorov and Laplace

Programming as a discipline shares the same dream with that of the philosophers of Ancient Greece. That is, to describe continuous, and fuzzy reality with a discrete language called mathematics. Mathematics is like a universal code library that has been maintained for millenia, without version management, a unified namespace, central maintainers, nor a code of conduct! Programmers are heir to a great tradition in describing reality with discrete utterances.

To better understand the zeitgeist of these Greek Mathematicians1, a good first approximation is picturing the excitement they felt upon realizing that the geometry of the planet and the stars can be reduced to number, as nodifferent than that of the programmer's thrill when diving into the source of kernels, and compilers, understanding how these complex systems are all reduced down to two specific numbers: 0 and 1.

The canonical path taken by the Greeks to learn the mathematics that describe various fidelities of reality2 was to attend schools of thought which were dedicated to educate civilization. Of the many mathematicians and philosophers depicted above in Raffaello's School of Athens, the two that are of concern with respect to computation are Aristotle and Euclid. Aristotle, the compiler writer, who developed logic. And Euclid, the

-> merge stepanov and harper/wadler. aristotle's species, genus, and differentia church/turing's tape(TM)/copy-paster(lambda calculus)

PROBLEM OF INDUCTION

H: "All ravens are black" O: "I've only seen black ravens" H=>O: "if all ravens are black, then we will only see black ravens"

H O

T T F T F F T F

enumerate the truth states of H and O, in order to see that O is T does not uniquely determine H.

#![allow(unused)]
fn main() {
enum Nat {
    Zero,
    Succ(Box<Self>)
}

fn gcd(n: Nat) -> Nat {
    todo!()
}

fn pred(n: Nat) -> Option<Nat> {
    match n {
        Nat::Zero => None,
        Nat::Succ(n) => Some(*n),
    }
}
}
#![allow(unused)]
fn main() {
enum List<T> {
    None,
    Some((T, Box<Self>))
}
fn classify_text(in: List<List<Nat>>) -> bool {
    let pos = hs(["good","great","love","amazing","awesome","nice","cool","fantastic","wonderful","happy"]);
    let neg = hs(["bad","terrible","hate","awful","horrible","worse","worst","angry","sad","disappointing"]);
    let mut score: f32 = 0.0;

    for t in tokens.iter() {
        // simple state machine: handle negators/intensifiers/softeners
        if negators.contains(t) { negate_window = 3; continue; }
        if intens.contains(t)   { intensity *= 1.5; continue; }
        if soften.contains(t)   { intensity *= 0.5; continue; }

        let mut val = 0.0;
        if pos.contains(t) { val += 1.0; }
        if neg.contains(t) { val -= 1.0; }

        if negate_window > 0 {
            val = -val;
            negate_window -= 1;
        }

        score += val * intensity;
        intensity = 1.0; // one-shot boost
    }

    let mut score: f32 = 0.0;
}
}

bayes and laplace - Binomial Inference of the Sunrise - Gaussian Inference of Height

one of the first algorithm's was euclid's algorithm. algorithm is a terminating procedure -> today algorithms are evaluated by turing machine or lambda calculus using a computer -> we compute algorithms

what's the algorithm for computing sentiment analysis, speech recognition, or face detection? impossible to enumerate everything.

  • aristotle -> laplace
  • wittgenstein LT -> PI logic program minsky -> connectionist program rosenblatt types -> tensors (rosenblatt)

Minsky to Rosenblatt

(early wittgenstein to late wittgenstein)

1

the math practiced in Babylon, Egypt, Athens, and Alexandria.

2

physics, chemstry, biology, psychology, phenomenology, sociology, policy

Nat

#![allow(unused)]
fn main() {
#[derive(Debug)]
enum Nat {
    Z,
    S(Box<Self>)
}

fn succ(x: Nat) -> Nat {
    Nat::S(Box::new(x))
}

fn pred(x: Nat) -> Option<Nat> {
    match x {
    Nat::Z => None,
    Nat::S(n) => Some(*n),
    }
}

fn plus(x: Nat, y: Nat) -> Nat {
    match x {
    Nat::Z => y,
    Nat::S(n) => Nat::S(Box::new(plus(*n, y))),
    }
}
}

Correctness

#![allow(unused)]
fn main() {
#[cfg(test)]
mod test_nat {
    use super::*;
    use proptest::prelude::*;
    
    #[test]
    fn test_succ() -> () {
        let two = Nat::S(Box::new(Nat::S(Box::new(Nat::Z))));
        let output = succ(two);
        println!("{:?}", output);
    }

    #[test]
    fn test_pred() -> () {
        let two = Nat::S(Box::new(Nat::S(Box::new(Nat::Z))));
        let output = pred(two);
        println!("{:?}", output);
    }

    #[test]
    fn test_plus() -> () {
        let two = Nat::S(Box::new(Nat::S(Box::new(Nat::Z))));
        let one = Nat::S(Box::new(Nat::Z));
        let output = plus(one, two);
        println!("{:?}", output);
    }
}
}

Performance

1.2. Probability

In the previous chapter we covered how programmers need to make the transition from the types of software 1.0 to the tensors of software 2.0. In this chapter we go over how random variables and their probability distributions are constructed with tensors, following the design of torch.Tensor and torch.distributions.

Contents

Probabilities

  • data
  • circuit
  • math
  • code

Random Variables, Distributions

  • data
  • circuit
  • math
  • code
# Tensor -->

1.3. Linear Models and their Algebra

Thus far we covered the probabilistic and stochastic nature of software 2.0, which requires computations on distributed truth values such as torch.Tensor and it's various torch.distributions. In this chapter we cover the linear workhorses of statistical inference, which include linear models and their linear algebra solvers, following torch.linalg accelerated by cuBLAS.

Contents

1.3.1 Prediction

TIKZ: FUNCTION/WEIGHT SPACE AND DATA SPACE

The primary goal of supervised machine learning is to recover an underlying probability distribution by inverting a function. In other words, the process of learning maps a data space of observable inputs and outputs to a function space. Elements in the data space are referred to as data, evidence, and observables, while elements in the function space are referred to as hypotheses, and models. Once a function has been learned, evaluation of said function is used to predict future quantities of interest.

More precisely, given a dataset with the data space being comprised of high dimensional input and output vector spaces and , the underlying probability distribution needs to be recovered, and evaluated for future predictions.

The primary distinction in predictive models is that of parametricity — whether the models have a fixed or variable number of parameters with respect to the size of the dataset . This chapter will use language modeling as a running example to introduce the workhorses of both parametric and non-parametric models1.

1.3.2 Parametric Models

Linearly Parametric

The hello world of learning machines is that of curve fitting Consider the data above. The linear modeling assumption is to assume that the underlying function to generate the observables is linear in the latent weights .

Model

    % This quicksort algorithm is extracted from Chapter 7, Introduction to Algorithms (3rd edition)
    \begin{algorithm}
    \caption{Quicksort}
    \begin{algorithmic}
    \PROCEDURE{Quicksort}{}
        \IF{} 
            \STATE  \CALL{Partition}{}
            \STATE \CALL{Quicksort}{}
            \STATE \CALL{Quicksort}{}
        \ENDIF
    \ENDPROCEDURE
    \PROCEDURE{Partition}{}
        \STATE 
        \STATE 
        \FOR{ \TO }
            \IF{}
                \STATE 
                \STATE exchange
                 with 
            \ENDIF
            \STATE exchange  with 
        \ENDFOR
    \ENDPROCEDURE
    \end{algorithmic}
    \end{algorithm}
import torch

# ---- toy data ----
torch.manual_seed(0)
N, D = 1000, 3
X = torch.randn(N, D)
w_true = torch.tensor([2.0, -3.0, 0.5])
b_true = 0.7
y = X @ w_true + b_true + 0.1*torch.randn(N)

# ---- closed-form least squares: beta = argmin ||A beta - y||_2 ----
A = torch.hstack([X, torch.ones(N, 1)])      # add intercept column
beta = torch.linalg.lstsq(A, y.unsqueeze(-1)).solution  # shape [D+1, 1]
w_hat, b_hat = beta[:-1, 0], beta[-1, 0]

print("ŵ:", w_hat)
print("b̂:", b_hat.item())

# quick check
y_hat = X @ w_hat + b_hat
mse = torch.mean((y_hat - y)**2)
print("MSE:", mse.item())
  • model
  • inference
  • eval

Host

    % This quicksort algorithm is extracted from Chapter 7, Introduction to Algorithms (3rd edition)
    \begin{algorithm}
    \caption{Quicksort}
    \begin{algorithmic}
    \PROCEDURE{Quicksort}{}
        \IF{} 
            \STATE  \CALL{Partition}{}
            \STATE \CALL{Quicksort}{}
            \STATE \CALL{Quicksort}{}
        \ENDIF
    \ENDPROCEDURE
    \PROCEDURE{Partition}{}
        \STATE 
        \STATE 
        \FOR{ \TO }
            \IF{}
                \STATE 
                \STATE exchange
                 with 
            \ENDIF
            \STATE exchange  with 
        \ENDFOR
    \ENDPROCEDURE
    \end{algorithmic}
    \end{algorithm}
#![allow(unused)]
fn main() {
// Click the eye icon in the upper right to reveal the Rust implementation.
fn quicksort() {
    // This line will be hidden initially
    let x = 10;
    // ...
}
}

Device

    % This quicksort algorithm is extracted from Chapter 7, Introduction to Algorithms (3rd edition)
    \begin{algorithm}
    \caption{Quicksort}
    \begin{algorithmic}
    \PROCEDURE{Quicksort}{}
        \IF{} 
            \STATE  \CALL{Partition}{}
            \STATE \CALL{Quicksort}{}
            \STATE \CALL{Quicksort}{}
        \ENDIF
    \ENDPROCEDURE
    \PROCEDURE{Partition}{}
        \STATE 
        \STATE 
        \FOR{ \TO }
            \IF{}
                \STATE 
                \STATE exchange
                 with 
            \ENDIF
            \STATE exchange  with 
        \ENDFOR
    \ENDPROCEDURE
    \end{algorithmic}
    \end{algorithm}
#![allow(unused)]
fn main() {
// Click the eye icon in the upper right to reveal the Rust implementation.
fn quicksort() {
    // This line will be hidden initially
    let x = 10;
    // ...
}
}

Non-linearly Parametric

model solve: sgd/adam eval

1.3.3 Non-parametric Models

Gaussian Processes

  • visualization
  • circuit
  • math
  • code

1

Although parametically non-linear deep neural networks have had major success in generative language modeling (covered in the next section), there are properties that non-parametric models exhibit in which industrial large language models lack. Thus, they are covered in this book, to serve as foundation for the discipline of machine learning systems.

The logistic regression model is a discriminative with a decision boundary (in the case of , is usually set to so as to divide the mass by 2) that assumes that the parameter is affine with respect to . That is,

where is a non-linear function , and where is referred to as the logit since the inverse is defined as the log odds ratio.

With the model now defined, the parameter needs to be estimated from the data . This is done by using the negatve log likelihood as the loss function to minimize so that is fixed with respect to the data

where todo kl->ce.

where is implemented by first evaluating the gradient and then iteratively applying gradient descent for each time step , .

First, to evaluate the gradient, , the negative log likelihood as loss function is simplified by defining so that . Note that is not the target label but the probability of the target label. Then, since the derivative is linear with the derivative of the sum is the sum of the derivatives where , taking the derivative for a single example for a single parameter where looks like

and so the evaluating the derivative of all examples where looks like

And so the swapping indices for the entire gradient gives . Recall now that the second step in implementing after taking is to then iteratively apply gradient descent for each time step , .

1.4. Non-Linear Models and their Stochastic Optimizers

With torch.Tensor, torch.distributions, and torch.linalg in hand, we are ready to construct the primary rockstars of deep learning frameworks, following torch.nn and torch.optim. We will see how difficult it is to evaluate the gradient of the loss function of a deep neural network in closed form, motivating the need for automatic differentiation, which we will cover in chapter two.

Anatomy of an Autograd

cuDNN!

Level 1: DNN

Resources

Differentiable Compilation

Resources

Serial Compilation

Parsing llm.c (gpt2) to AST

CFG-SSA + Linear Scan

SoN + Graph Coloring

Parallel Compilation

Differentiable Compilation

Afterword

This textbook focused on prediction and generation. There is also discovery and action.

Prerequisites

Familiarity with the design, training, and inference of machine learning is required in order to interpret and compile them. This is no different from learning how to program before implementing a programming language itself.

  • Cambridge ITPRNN: Information Theory, Pattern Recognition and Neural Networks by David Mackay
  • Tubingen ML4202: Probabilistic Machine Learning
  • Stanford CS109: Probability for Computer Scientists by Chris Piech
  • Stanford CS229: Machine Learning by Andrew Ng
  • Stanford CS224N: NLP with Deep Learning by Christopher Manning
  • Eureka: Neural Networks Zero to Hero by Andrej Karpathy
  • Stanford CS336: Language Modeling from Scratch by Percy Liang

Corequisites

SCTP follows the breadth-first spirit of SICP's accelerated introduction to computation. SICP iteratively deepens the meaning and semantics of computation by providing concise coverage on substitution, stack/heap, operational interpreter, and ends with a register machine. SCTP follows suit with blas/dnn operations, autograd, learning, and compilation. Along your adventure with tensor programs, you may find some of the following specialized courses useful to consult with.

  • Brown CS053: Coding the Matrix by Philip Klein
  • MIT 18.S096: Matrix Calculus by Alan Edelman and Steven Johnson
  • Stanford CS149: Parallel Computing by Kayvon Fatahalian
  • Berkeley CS267: Applications of Parallel Computers by Katthie Yellick
  • Berkeley CS265: Compiler Optimization by Max Willsey
  • Cornell CS4120: Compilers by Andrew Myers
  • Cornell CS6120: Advanced by Adrian Sampson
  • Cornell CS4787: Principles of Large-Scale Machine Learning by Chris De Sa
  • Cornell CS6787: Advanced Machine Learning Systems by Chris De Sa
  • Carnegie Mellon 18-447: Computer Architecture by Onur Mutlu
  • Carnegie Mellon 15-411: Compiler Design by Frank Pfenning
  • Carnegie Mellon 15-745: Optimizing Compilers by Phil Gibbons
  • Carnegie Mellon 10-414: Deep Learning Systems by Tianqi Chen
  • Rice COMP412: Compiler Construction by Keith Cooper
  • Rice COMP512: Advanced Compiler Construction by Keith Cooper

Bibliography

Abelson, Harold. 1996. Structure and Interpretation of Computer Programs, Second Edition. MIT Press.

Aho, Alfred V, Monica S Lam, Ravi Sethi, and Jeffrey D Ullman. 2015. Compilers: Principles, Techniques, & Tools. Pearson.

Bright, Paige, Alan Edelman, and Steven G Johnson. 2025. “Matrix Calculus (for Machine Learning and Beyond).” ArXiv.org. 2025. https://arxiv.org/abs/2501.14787.

Cho, Kyunghyun. 2025. “Machine Learning: A Lecture Note.” ArXiv.org. 2025. https://arxiv.org/abs/2505.03861.

Cooper, Keith D, and Linda Torczon. 2022. Engineering a Compiler. Morgan Kaufmann.

Cormen, Thomas H, Charles Eric Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Cambridge, Massachusetts: The MIT Press. https://www.deeplearningbook.org/.

Hack, Sebastian. 2007. Register Allocation for Programs in SSA Form.

Harris, Sarah. 2021. Digital Design and Computer Architecture: RISC-V Edition. S.L.: Morgan Kaufmann Publisher.

Hennessy, John L, and David A Patterson. 2019. Computer Architecture: A Quantitative Approach. Cambridge, Ma: Morgan Kaufmann.

Hwu, Wen-Mei W, David B. Kirk, and Izzat El Hajj. 2022. Programming Massively Parallel Processors: A Hands-on Approach. S.L.: Morgan Kaufmann.

Jurafsky, Dan , and James H. Martin. 2025. “Speech and Language Processing.” Stanford.edu. 2025. https://web.stanford.edu/~jurafsky/slp3/.

Kang, Wanmo, and Kyunghyun Cho. 2025. “Linear Algebra for Data Science.” 2025. https://drive.google.com/file/d/1rQKTjknuHE3HC_9Gyovn4DYZLFe7nBQZ/view.

Klein, Philip N. 2013. Coding the Matrix : Linear Algebra through Applications to Computer Science. Newton, Mass.: Newtonian Press.

Krishnamurthi, Shriram. 2025. “Programming Languages: Application and Interpretation.” Plai. 2025. https://www.plai.org/.

Mackay, David J C. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge: Cambridge University Press.

Møller, Anders, and Michael I Schwartzbach. 2024. “Static Program Analysis.” Cs.au.dk. 2024. https://cs.au.dk/~amoeller/spa/.

Murphy, Kevin P. 2023. Probabilistic Machine Learning: Advanced Concepts. MIT Press.

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. Cambridge: MIT Press.

Ng, Andrew, and Tengyu Ma. 2023. “CS229 Lecture Notes.” https://cs229.stanford.edu/main_notes.pdf.

Patt, Yale N, and Sanjay J Patel. 2020. Introduction to Computing Systems : From Bits and Gates to C/C++ & Beyond. New York, Ny: Mcgraw-Hill.

Rastello, Fabrice, and Florent Bouchez Tichadou. 2022. SSA-Based Compiler Design. Springer Nature.

Scardapane, Simone. 2025. “Alice’s Adventures in a Differentiable Wonderland.” ArXiv.org. 2025. https://arxiv.org/abs/2404.17625v3.

Stepanov, Alexander, and Paul McJones. 2019. Elements of Programming. Semigroup Press.

Tarjan, Robert E. 1988. Data Structures and Network Algorithms. Philadelphia: Society For Industrial And Applied Mathematics.

Trefethen, Lloyd N, and David Bau. 1997. Numerical Linear Algebra. SIAM.

Index