Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

1. Elements of Learning

In part one of the book, we transition from the deterministic types of software 1.0 to the stochastic tensors of software 2.0 by training a house price prediction model with numpy's np.ndarray in order to understand the mechanics of the statistical learning approach to artificial intelligence. After that, we will implement our own multidimensional array for teenygrad with

  1. teenygrad.Tensor, providing a multidimensional array
  2. BLAS-like kernels providing acceleration on multi-core processors like CPUs which we will then use to train a language sentiment model.

This will prepare us for the second part of the course where we change 1. the goal from learning price/sentiment to learning sequences and 2. the function class from generalized linear models to deep neural networks

Contents

1.1 Continuously Stochastic Software 2.0

Although separated by 2000 years, the art of programming practiced in Silicon Valley shares the same dream as that of mathematics in Ancient Greece — that of discovering discrete descriptions of reality. While programmers prefer reducing all descriptions down to number (two natural numbers specifically, that of 0 and 1) while the mathematicians prefered reducing down to shape, what is shared in common is the preference towards the deterministically finite. Both programmer and mathematician use tools provided by toolsmiths referred to as interpreter/compiler writers and logicians: the programmer uses the values, types of a programming language, whereas the mathematician uses the proofs, and propositions of a mathematical logic.

Although the discipline of mathematics as originally started in Ancient Greece shared unanimous discomfort in founding descriptions of reality in the continuously infinite the usefulness of said descriptions became apparent as centuries passed, with one prominent example being the so-called sunrise problem in which many mathematicians switched from the deterministic tools of the proposition to the stochastic ones of the probability. History is repeating itself for programmers are faced with the same challenge with a new method of programming software into artificial intelligence. This is the primary thread which will be pulled in this whirlwind tour of the Structure and Interpretation of Tensor Programs. where train deep neural networks like nanogpt with our very own deep learning framework called teenygrad. But to prepare ourselves, let's learn more about the sunrise problem which although appears estoric, may provide valuable insight into how history is rhyming with itself.

1.1.1 Probabilities on Events with Set Theory

1.1.2 Propositions to Beliefs with Conditional Probability

(GOFAI -> machine learning)

1.1.3 Random Variables and their Distributions

  • continous: gaussian (for regression) and exponential (for GLMs)
  • discrete: bernouilli (for classification)

1.1.4 Statistical Inference: Probabilities from Data

  • MLE of gaussian, exponential, and bernouilli
  • MAP of gaussian, exponential, and bernouilli

1.2 Learning Price Directly with Regression in numpy

Although thus far nothing like ChatGPT has been discussed nor implemented, the generative nature of evaluating a probabilistic model and the inferring nature of inverting a function from data serves as the core foundation to the statistical learning approach to artificial intelligence.

Moving forwards, this essence of software 2.0's recovering functions from data continues to hold (in contrast to software 1.0's specifying functions with instructions) throughout the book, while the kinds of data and functions gradually become more complex. After part one, describing reality's fuzzy, continuous, and stochastic phenomena will become second nature, which is exactly what is needed to reproduce human-like capabilities of vision and language albeit with machines. In particular, the second part of the book 2. Neural Networks dives deep into the latter with large language models.

To obtain said foundation, the remaining chapters of part one cover regression and classification and the implementation of high dimensional arrays. In chapter 1.2 Learning Price Directly with Regression in numpy a function of the form is learned from housing data which can then be evaluated to predict the regressor of house prices with numpy's high dimensional vector abstration np.ndarray. Then, in chapter 1.3 Programming High Dimensional Data and Functions with teenygrad and 1.4 Accelerating High Dimensional Functions with the BLAS an accelerated high dimensional vector teenygrad.Tensor is programmed, which is subsequently used in the last chapter 1.5 Learning Sentiment Iteratively with Classification in teenygrad where a function of the form is learned from data which can then be evaluated to predict the classifcation of language sentiment.

1.2.1 The Simplest Inductive Bias

The primary goal of the statistical learning approach to artificial intelligence is induction, which is to use data in order to make accurate predictions in new situations with some function . When there's access to historical data of the expected output for some input , we refer to this regime as supervised learning, for the learning algorithm is being "supervised" with the right answer.

In particular, the dataset is split into a training set , and a testing set . Once the function is recovered from the training set, it's ability to generalize is evaluated on the test set. that is, predict a value given some unseen observation . Machine learning has its roots in statistical inference, and so we also refer to the data as observations or evidence, and the function as hypotheses or models.

For instance, consider the task of house price prediction1 where the collected dataset are input-output pairs of square feet, and dollars, respectively. Let's take a look at this data.

import matplotlib.pyplot as plt
X, Y = [1500, 2100, 800], [500000, 800000, 250000]
plt.scatter(X,Y)
plt.title("Price vs Size")
plt.xlabel('ft^2')
plt.ylabel('$')
plt.show()

In the case of house price prediction, the input space is the size in square feet, and the output space is the price, which means the function to recover has type . Since enumerating through every house size and price pair in is intractable (otherwise all pairs could be inserted inside a lookup), an assumption about the structure of the data needs to be made, referred to as the inductive bias. The simplest assumption to make is to assume that there exists a linear relationship between the data so that is determined by two parameters of slope and intercept . That is,

where is used to denote the differentiation between input and parameter2. Implementing computable looks like the following, where each evaluation of fhat() passes in a random number for the slope parameter of x. Feel free to modify the range for and run the code a few times to see the different values fhat() produces as predictions.

def fhat(x: float, m: float, b: float) -> float:
  yhat = x*m+b
  return yhat

if __name__ == "__main__":
  import random
  X, Y = [1500, 2100, 800], [500000, 800000, 250000]
  m, b = random.uniform(-3.0, 3.0), random.uniform(-3.0, 3.0)

  for (xi,yi) in zip(X,Y):
    yihat = fhat(xi,m,b)
    print(f"expected: ${yi}, actual: ${yihat:.2f}")

And instead of evaluating fhat() for each input xi sequentially in a loop, we can change the type of the input from a scalar to a vector so that the predictions can be evaluated in a single vector-scalar multiplication (and subsequently vector-scalar addition) where

Some languages are adding support for multidimensional arrays in their standard libraries (todo, footenote with cpp), but all numerical computing in python which involve the evaluation of high dimensional functions are conducted in the numpy package which provides the numpy.ndarray data structure and operations accelerated by vector instructions in hardware.

import numpy as np

def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray:
  print(f"X_n.shape:{X_n.shape}")
  yhat_n = X_n*m+b # vector-scalar multiply and add uses vectorized hardware instructions
  return yhat_n

if __name__ == "__main__":
  import random
  X, Y = np.array([1500, 2100, 800]), [500000, 800000, 250000]
  m, b = random.uniform(-3.0, 3.0), random.uniform(-3.0, 3.0)

  yhats = fhatbatched(X,m,b) # evaluating fhat on all inputs in one function call
  for y, yhat in zip(Y, yhats):
    print(f"expected: ${y}, actual: ${yhat:.2f}")

Because the two code snippets seem to have run in relatively the same amount of time, a reasonable question at this point is to ask how much faster is the batched implementation with hardware accelerated vector instructions compared to the vanilla-python sequential implementation?

The answer becomes apparent when the size of vectors is increased. For instance, when the number of house data in our dataset increases to , , , the difference in wallclock time between the two implementations becomes more drastic. Run the code below which updates n from 3 to 10,000,000 (which will take a few seconds) in order to measure the difference in wallclock time. (Note that the construction of dataset Y is omitted to simply compare the evaluation of the sequential and vectorized implementations of .)

import numpy as np

def fhat(x: float, m: float, b: float) -> float: return x*m+b
def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray: return X_n*m+b

if __name__ == "__main__":
  import random, timeit
  X = np.random.rand(10_000_000)
  m, b = random.uniform(-3.0, 3.0), random.uniform(-3.0, 3.0)

  def sequential(): return [fhat(x,m,b) for x in X]
  def vectorized(): return fhatbatched(X,m,b)
  scalar_time, vector_time = timeit.timeit(sequential, number=5), timeit.timeit(vectorized, number=5)

  print(f"sequential time: {scalar_time:.2f} sec")
  print(f"vectorized time: {vector_time:.2f} sec")

Analyzing the two times printed to stdout, the majority if not all of the execution time was spent waiting for the sequential implementation to complete. This is because np.ndarray's implementations of * and + are accelerated using vectorized hardware instructions, which we will implement for ourselves in sections 1.3 Programming High Dimensional Data and Functions and 1.4 Accelerating High Dimensional Functions with the BLAS. For now, we will continue to use the np.ndarray to carry out the task of price prediction (even though the specific housing dataset we are working with only has 516 entries.)

1.2.2 Fitting a Line in Higher Dimensions

Before we recover the parameters and which characterize the function that best fits the data, let's increase the expressivity of our function in order to model price not simply as a relation of size but also of the other variables available from the dataset. This means modifying the domain of our function from to so that house price is modeled as a function of form

where is the vector of parameters and so that is the bias. Note that we are now using to denote the size of the feature space rather than slope so that accepts not a scalar but a vector , with the codomain remaining . So for instance, is size, is age, is the number of bedrooms, is the location, each weighted by a parameter indicating how important that feature is relative to the final price. We can still define the batched evaluation over the entire dataset with

where is a data/design matrix which stores vectors in . In Python, this looks like the following:

import numpy as np

def fhatbatched(X_n: np.ndarray, theta: np.ndarray) -> np.ndarray:
  print(f"X_n.shape:{X_n.shape}")
  yhat_n = X_n@theta # vector-scalar multiply and add uses vectorized hardware instructions
  return yhat_n

if __name__ == "__main__":
  import random
  X, Y = np.array([[1500], [2100], [800]]), [500000, 800000, 250000]  # X is (n, m) where n=3, m=1
  theta = np.random.uniform(-3.0, 3.0, size=(1,))  # theta is shape (m,)

  yhats = fhatbatched(X,theta) # evaluating fhat on all inputs in one function call
  for y, yhat in zip(Y, yhats):
    print(f"expected: ${y}, actual: ${yhat:.2f}")

1.2.3 The Inductive Principle of Maximizing Likelihood

After the inductive bias on the family of functions has been made, the learning algorithm must find the function with a good fit. Since artificial learning algorithms don't have visual cortex like biological humans2, the notion of "good fit" needs to defined in a systematic fashion. This is done by selecting the parameter which maximizes the likelihood of the data . Returning to the linear regression inductive bias we've selected to model the house price data, we assume there exists noise in both our model (epistemic uncertainty) and data (aleatoric uncertainty), so that where

prices are normally distributed conditioned on seeing the features with the mean being the equation of the line where , then we have that

Returning to the linear regression model, we can solve this optimization with a direct method using normal equations.

def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray: return X_n*m+b

if __name__ == "__main__":
  X, Y = np.array([1500, 2100, 800]), np.array([500000, 800000, 250000]) #  data

  X_b = np.column_stack((np.ones_like(X), X))              # [1, x]
  bhat, mhat = np.linalg.solve(X_b.T @ X_b, X_b.T @ Y)   # w = [b, m]

  yhats = fhatbatched(X, mhat, bhat) # yhat
  for y, yhat in zip(Y, yhats):
    print(f"expected: ${y:.0f}, actual: ${yhat:.2f}")

To summarize, we have selected and computed

  1. an inductive bias with the family of linear functions
  2. an inductive principle with the least squared loss
  3. the parameters which minimze the empirical risk, denoted as

Together, the inductive bias describes the relationship between the input and output spaces, the inductive principle is the loss function that measures prediction accuracy, and the minimization of the empirical risk finds the parameters for the best predictor.

1.2.4 Fitting a Line Directly with Normal Equations

1.2.5 Fitting a Line in Non-Linear Feature Space

Feature Space: polynomials.

1.2.6 Generalization, Bias-Variance Tradeoff and Double Descent

1.2 Programming High Dimensional Data and Functions

Now that we have an operational understanding of how statistical learning employ's the power of the multidimensional array abstraction in practice with linear regression using the numpy.ndarray, we will now implement our very own version of the multidimensional array which we'll define under teenygrad.Tensor.

1.3.1 The Simplest Tensor Implementation with Nested lists

Let's start with the simplest implementation idea for teenygrad.Tensor, which is a nested array implementation with Python's built-in list data structure. For some we will use a float, for , we will use a list[float], and for arbitary we will use a list[list[float]]. In fact, this is simply what we started doing in the previous chapter when implementing our linear regression model before switching to numpy's accelerated nd.array.

# teenygrad/__init__.py

def tensor(data) -> Tensor: raise NotImplementedError("")
# teenygrad/tensor.py
class Tensor():
  def __init__(self, data): raise NotImplementedError("")
import tinygrad

if __name__ == "__main__":
  x = teenygrad.tensor([[1.1, 2.2],
                        [3.3, 4.4],
                        [5.5, 6.6]])

1.3.2 Tensors with Strides

strides Mapping Virtual Shapes to Physical Storage

class Tensor():

1.3.2 Tensor.matmul(): Naive Vector/Matrix Operations

class Tensor():
  # ...shape, stride, offset, storage

  def __matmul__(self):
    raise NotImplementedError("")

...

1.3.3 Accelerating Interpreted Language Implementations

1.3.4 CPython Extension Modules via PyO3 and Rust

CPython with Rust Extension Modules via PyO3 <!-- jit: pypy. aot: cython/mypyc. bindings: two guides

  • https://docs.python.org/3/extending/index.html
  • https://docs.python.org/3/c-api/index.html

https://www.youtube.com/watch?v=umLZphwA-dw

python c api used for 2 purposes

  1. extension modules
  2. embedding cpython cpython native extension modules

pyo3. recommended tools https://docs.python.org/3/c-api/intro.html#c-api-tools

history

  • https://www.boost.org/doc/libs/1_58_0/libs/python/doc/
  • https://github.com/pybind/pybind11
  • https://nanobind.readthedocs.io/en/latest/
  • https://github.com/dgrunwald/rust-cpython

https://github.com/PyO3/pyo3/blob/main/Architecture.md

https://github.com/openai/tiktoken https://github.com/huggingface/tokenizers/tree/main/bindings/python https://github.com/pola-rs/polars https://www.youtube.com/watch?v=z_Eiy2W0APU

1.3 Accelerating Basic Linear Algebra Subroutines

... talk about BLAS

  • 1971 handbook for automatic computation (ALGOL60)
  • 1971 EISPACK (FORTRAN66)
  • 1973,1974,1975 draft bLAS
  • 1979 LINPACK users guide (dongarra, moler) linear systems. linear least squares
  • 1979 BLAS1 paper (LINPACK benchmark in appendix)

moore's law -> bell's law mainframes: vector-vector BLAS1 number of loops and the edition minis: matrix-vector BLAS2 number of loops and the edition BLAS2 paper micros: matrix-matrix BLAS3 number of loops and the edition

  • http://www.openmathlib.org/OpenBLAS/docs/faq/
  • https://dl.acm.org/doi/epdf/10.1145/355841.355847
  • https://www.netlib.org/utk/people/JackDongarra/PAPERS/An-Extended-Set-of-BLAS-Model-Implementation-and-Test-Programs.pdf
  • https://www.netlib.org/utk/people/JackDongarra/PAPERS/Set-of-Level-3-Basic-Linear-Algebra-Subprograms.pdf
  • https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf
  • https://www.youtube.com/watch?v=JzNpKDW07rw

1.4.1 The SISD Fantasy of Latency-Oriented CPUs

von neumann architecture... ... talk about processor microarchitecture and memory hierarchies

1.4.2 Bottlenecks, Rooflines, and Benchmarks

... talk about bottlenecks (intensity, and rooflines)

ROOFLINES (for computer architectures which look like level 3) compute got cheaper memory hierarchies got deeper minimize surface/volume ratio compute/memory flops are free, bandwidth is money, latency is physics

level3 blas designed for these architectures. thus matmul because arithmetic intensity. is there a blas4? not really.

1.4.3 Accelerating GEMM from Rust to ASM

1.4.4 Accelerating GEMV and GEMM on CPU with Locality

#![allow(unused)]
fn main() {
use std::ops::{Add, Mul};

/// Textbook matrix multiply: C = A @ B
/// A is m×k, B is k×n, result C is m×n.
///
/// Matrices are row-major Vec<Vec<T>> for readability.
pub fn matmul<T>(a: &[Vec<T>], b: &[Vec<T>]) -> Result<Vec<Vec<T>>, String>
where T: Copy + Add<Output = T> + Mul<Output = T> + Default,
{
  if a.is_empty() || b.is_empty() { return Err("matmul: empty matrix".into()); }
  let m = a.len();
  let k = a[0].len();

  if a.iter().any(|row| row.len() != k) { return Err("matmul: A is not rectangular".into()); } // Ensure B is rectangular
  let k2 = b.len();
  let n = b[0].len();
  
  if b.iter().any(|row| row.len() != n) { return Err("matmul: B is not rectangular".into()); } // Ensure B is rectangular
  if k != k2 { return Err(format!( "matmul: shape mismatch: A is {}x{}, B is {}x{}", m, k, k2, n )) }

  let mut c = vec![vec![T::default(); n]; m]; // Allocate C (m×n) filled with zeros (Default::default)

  // Triple loop: i, j, p
  for i in 0..m {
    for j in 0..n {
      let mut acc = T::default();
      for p in 0..k { acc = acc + a[i][p] * b[p][j]; }
      c[i][j] = acc;
    }
  }

  Ok(c)
}
}
; matmul_f32(C, A, B, M, N, K)
; SysV ABI args:
;   rdi = C
;   rsi = A
;   rdx = B
;   rcx = M
;   r8  = N
;   r9  = K
;
; row-major:
;   A[i*K + k], B[k*N + j], C[i*N + j]

global matmul_f32
section .text

matmul_f32:
  push rbx
  push r12
  push r13
  push r14
  push r15

1.4.5 Accelerating GEMV and GEMM on CPU with Parallelism

1.5 Learning Sentiment Iteratively with Classification

We'll now tackle our first learning problem in the domain of language with sentiment analysis — classifying whether a given piece of text has a positive or negative connotation — by modifying

  1. our model to support discrete categorical targets
  2. our optimization method to an iterative algorithm, namely gradient descent

Before diving into the model to support discrete categorical targets in , we will take a closer look into the optimization method of gradient descent because there are two reasons for why we want to move from direct to iterative methods, the first being memory bottlenecks, and the second being non-linear function classes.

1.5.1 Linear Regression, and the Limitations of Direct Methods

The first reason why we want to switch from direct to iterative optimization methods is because direct methods are memory-bound on materializing the matrix .

1.5.2 Inductive Bias and Principle with Logistic Regression and Cross Entropy

The second reason why we want to switch from direct to iterative optimization methods is because even if the number of dimensions is small enough so that we are not memory-bound, direct methods simply will not work for other function classes besides linear regression. Consider ___.

Inductive Bias

In order to train a model which learns the sentiment of text, we will collect data where the input space are feature vectors and the output space are and which encode the negative and positive cases of binary classification[^6] (we will expand the number of categorical responses in the next section with multi-class classification). Then the function of interest we'd like to recover from the data is of form . Recall that machine learning consists of selecting a family of functions as the inductive bias, a loss function as the inductive principle, and estimating the parameters by empirically minimizing the risk.

For the inputs,

For the model class , we will continue to use a a weighted sum of the input vector with where so that , but add the function where referred to as the logistic or sigmoid[^7] function so that the output is a valid probability. We'll interpret as the log odds and as and the complement as . However, the current type of is , whereas the task is to assign a negative () or positive () sentiment to the provided input feature vector . To ammeliorate this we will add a final decision boundary where

(FIGURE.MODEL ARCHITECTURE)

We can vectorize the evaluation of on each with a data matrix so that

import torch

def f(xi: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
  """Compute sigmoid(w^T x) for a single example xi."""
  logits = torch.matmul(w, xi)
  return torch.sigmoid(logits)


if __name__ == "__main__":
  w = torch.randn(3)
  X = torch.randn(5, 3)

  for xi in X: # does this work?
    yi_hat = f(xi, w)
    print(yi_hat.item())

For example, let's trace through the evaluation of our model with the following input example :

Inductive Principle

For the loss function

1.5.3 Iterative Optimization via Gradient Descent

For estimating the parameters that minimize the loss , we will optimize iteratively using gradient descent, for the two reasons already mentioned of being memory-bottlenecked on materializing the with the linear regression model class, and, with other classes of functions (which is currently the case), we don't have ___.

Gradient descent, simply put, is an optimization method that uses differential calculus, namely, the fact that the gradient provides hints — the direction of steepest descent — on how to iteratively modify the parameters closer to an optimum. The gradient descent algorithm can be expressed in a one line update rule so that the goal of is implemented by:

where is referred to as the learning rate or step size. We now dive deeper into differential calculus and generalize the derivative to higher dimensional spaces[^8] to better understand what's happening under the hood.

1.5.4 Multinomial Logistic Regression

1.5.5 Generalized Linear Models



  1. While predicting house prices is not AGI, we follow the canonical hello world of machine learnign in order to provide an easy example while introducing the foundational mechanics of statistical learning, which are crucial to understanding part two and part three of the book where we cover more industrial deep learning. In software 1.0 you should know how to reverse a linked list. In software 2.0 you should know how to fit a line.

  2. If you are familiar with functional programming, this is the mathematical equivalent of currying via closure. ↩2