1. Elements of Learning
In part one of the book, we transition from the deterministic types of software 1.0 to the stochastic tensors of software 2.0
by training a house price prediction model with numpy's np.ndarray in order
to understand the mechanics of the statistical learning approach to artificial intelligence.
After that, we will implement our own multidimensional array for teenygrad with
teenygrad.Tensor, providing a multidimensional arrayBLAS-like kernels providing acceleration on multi-core processors like CPUs which we will then use to train a language sentiment model.
This will prepare us for the second part of the course where we change 1. the goal from learning price/sentiment to learning sequences and 2. the function class from generalized linear models to deep neural networks
Contents
- 1.1 Continuously Stochastic Software 2.0 with
scipy - 1.2 Regression: Learning Price Directly with
numpy - 1.3 Programming High Dimensional Data and Functions with
teenygrad - 1.4 Accelerating Basic Linear Algebra Subroutines with
teenygrad - 1.5 Classification: Learning Sentiment Iteratively with
teenygrad
1.1 Continuously Stochastic Software 2.0
Although separated by 2000 years, the art of programming practiced in Silicon Valley shares the same dream as that of mathematics in Ancient Greece — that of discovering discrete descriptions of reality. While programmers prefer reducing all descriptions down to number (two natural numbers specifically, that of 0 and 1) while the mathematicians prefered reducing down to shape, what is shared in common is the preference towards the deterministically finite. Both programmer and mathematician use tools provided by toolsmiths referred to as interpreter/compiler writers and logicians: the programmer uses the values, types of a programming language, whereas the mathematician uses the proofs, and propositions of a mathematical logic.
Although the discipline of mathematics as originally started in Ancient Greece
shared unanimous discomfort in founding descriptions of reality in the continuously infinite
the usefulness of said descriptions became apparent as centuries passed,
with one prominent example being the so-called sunrise problem in which
many mathematicians switched from the deterministic tools of the proposition to the stochastic ones of the probability.
History is repeating itself for programmers are faced with the same challenge
with a new method of programming software into artificial intelligence.
This is the primary thread which will be pulled in this whirlwind tour of the Structure and Interpretation of Tensor Programs.
where train deep neural networks like nanogpt with our very own deep learning framework called teenygrad.
But to prepare ourselves, let's learn more about the sunrise problem which although appears estoric, may provide valuable insight
into how history is rhyming with itself.
1.1.1 Probabilities on Events with Set Theory
1.1.2 Propositions to Beliefs with Conditional Probability
(GOFAI -> machine learning)
1.1.3 Random Variables and their Distributions
- continous: gaussian (for regression) and exponential (for GLMs)
- discrete: bernouilli (for classification)
1.1.4 Statistical Inference: Probabilities from Data
- MLE of gaussian, exponential, and bernouilli
- MAP of gaussian, exponential, and bernouilli
1.2 Learning Price Directly with Regression in numpy
Although thus far nothing like ChatGPT has been discussed nor implemented, the generative nature of evaluating a probabilistic model and the inferring nature of inverting a function from data serves as the core foundation to the statistical learning approach to artificial intelligence.
Moving forwards, this essence of software 2.0's recovering functions from data continues to hold (in contrast to software 1.0's specifying functions with instructions) throughout the book, while the kinds of data and functions gradually become more complex. After part one, describing reality's fuzzy, continuous, and stochastic phenomena will become second nature, which is exactly what is needed to reproduce human-like capabilities of vision and language albeit with machines. In particular, the second part of the book 2. Neural Networks dives deep into the latter with large language models.
To obtain said foundation, the remaining chapters of part one cover
regression and classification and the implementation of high dimensional arrays.
In chapter 1.2 Learning Price Directly with Regression in numpy
a function of the form is learned from housing data
which can then be evaluated to predict the regressor of house prices with numpy's
high dimensional vector abstration np.ndarray.
Then, in chapter 1.3 Programming High Dimensional Data and Functions with teenygrad
and 1.4 Accelerating High Dimensional Functions with the BLAS
an accelerated high dimensional vector teenygrad.Tensor is programmed,
which is subsequently used in the last chapter 1.5 Learning Sentiment Iteratively with Classification in teenygrad
where a function of the form is learned from data
which can then be evaluated to predict the classifcation of language sentiment.
1.2.1 The Simplest Inductive Bias
The primary goal of the statistical learning approach to artificial intelligence is induction, which is to use data in order to make accurate predictions in new situations with some function . When there's access to historical data of the expected output for some input , we refer to this regime as supervised learning, for the learning algorithm is being "supervised" with the right answer.
In particular, the dataset is split into a training set , and a testing set . Once the function is recovered from the training set, it's ability to generalize is evaluated on the test set. that is, predict a value given some unseen observation . Machine learning has its roots in statistical inference, and so we also refer to the data as observations or evidence, and the function as hypotheses or models.
For instance, consider the task of house price prediction1 where the collected dataset are input-output pairs of square feet, and dollars, respectively. Let's take a look at this data.
import matplotlib.pyplot as plt
X, Y = [1500, 2100, 800], [500000, 800000, 250000]
plt.scatter(X,Y)
plt.title("Price vs Size")
plt.xlabel('ft^2')
plt.ylabel('$')
plt.show()
In the case of house price prediction, the input space is the size in square feet, and the output space is the price, which means the function to recover has type . Since enumerating through every house size and price pair in is intractable (otherwise all pairs could be inserted inside a lookup), an assumption about the structure of the data needs to be made, referred to as the inductive bias. The simplest assumption to make is to assume that there exists a linear relationship between the data so that is determined by two parameters of slope and intercept . That is,
where is used to denote the differentiation between input and parameter2.
Implementing computable looks like the following,
where each evaluation of fhat() passes in a random number for the slope parameter of x.
Feel free to modify the range for and run the code a few times to see the different values fhat() produces as predictions.
def fhat(x: float, m: float, b: float) -> float:
yhat = x*m+b
return yhat
if __name__ == "__main__":
import random
X, Y = [1500, 2100, 800], [500000, 800000, 250000]
m, b = random.uniform(-3.0, 3.0), random.uniform(-3.0, 3.0)
for (xi,yi) in zip(X,Y):
yihat = fhat(xi,m,b)
print(f"expected: ${yi}, actual: ${yihat:.2f}")
And instead of evaluating fhat() for each input xi sequentially in a loop,
we can change the type of the input from a scalar to a vector
so that the predictions can be evaluated in a single vector-scalar multiplication
(and subsequently vector-scalar addition) where
Some languages are adding support for multidimensional arrays in their standard libraries (todo, footenote with cpp),
but all numerical computing in python which involve the evaluation of
high dimensional functions are conducted in the numpy package which provides the numpy.ndarray data structure
and operations accelerated by vector instructions in hardware.
import numpy as np
def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray:
print(f"X_n.shape:{X_n.shape}")
yhat_n = X_n*m+b # vector-scalar multiply and add uses vectorized hardware instructions
return yhat_n
if __name__ == "__main__":
import random
X, Y = np.array([1500, 2100, 800]), [500000, 800000, 250000]
m, b = random.uniform(-3.0, 3.0), random.uniform(-3.0, 3.0)
yhats = fhatbatched(X,m,b) # evaluating fhat on all inputs in one function call
for y, yhat in zip(Y, yhats):
print(f"expected: ${y}, actual: ${yhat:.2f}")
Because the two code snippets seem to have run in relatively the same amount of time, a reasonable question at this point is to ask how much faster is the batched implementation with hardware accelerated vector instructions compared to the vanilla-python sequential implementation?
The answer becomes apparent when the size of vectors is increased.
For instance, when the number of house data in our dataset increases to , , ,
the difference in wallclock time between the two implementations becomes more drastic.
Run the code below which updates n from 3 to 10,000,000 (which will take a few seconds)
in order to measure the difference in wallclock time.
(Note that the construction of dataset Y is omitted to simply compare the evaluation of the sequential and vectorized implementations of .)
import numpy as np
def fhat(x: float, m: float, b: float) -> float: return x*m+b
def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray: return X_n*m+b
if __name__ == "__main__":
import random, timeit
X = np.random.rand(10_000_000)
m, b = random.uniform(-3.0, 3.0), random.uniform(-3.0, 3.0)
def sequential(): return [fhat(x,m,b) for x in X]
def vectorized(): return fhatbatched(X,m,b)
scalar_time, vector_time = timeit.timeit(sequential, number=5), timeit.timeit(vectorized, number=5)
print(f"sequential time: {scalar_time:.2f} sec")
print(f"vectorized time: {vector_time:.2f} sec")
Analyzing the two times printed to stdout, the majority if not all of the execution time
was spent waiting for the sequential implementation to complete.
This is because np.ndarray's implementations of * and + are accelerated using
vectorized hardware instructions, which we will implement for ourselves in sections
1.3 Programming High Dimensional Data and Functions and
1.4 Accelerating High Dimensional Functions with the BLAS.
For now, we will continue to use the np.ndarray to carry out the task of price prediction
(even though the specific housing dataset we are working with only has 516 entries.)
1.2.2 Fitting a Line in Higher Dimensions
Before we recover the parameters and which characterize the function that best fits the data, let's increase the expressivity of our function in order to model price not simply as a relation of size but also of the other variables available from the dataset. This means modifying the domain of our function from to so that house price is modeled as a function of form
where is the vector of parameters and so that is the bias. Note that we are now using to denote the size of the feature space rather than slope so that accepts not a scalar but a vector , with the codomain remaining . So for instance, is size, is age, is the number of bedrooms, is the location, each weighted by a parameter indicating how important that feature is relative to the final price. We can still define the batched evaluation over the entire dataset with
where is a data/design matrix which stores vectors in . In Python, this looks like the following:
import numpy as np
def fhatbatched(X_n: np.ndarray, theta: np.ndarray) -> np.ndarray:
print(f"X_n.shape:{X_n.shape}")
yhat_n = X_n@theta # vector-scalar multiply and add uses vectorized hardware instructions
return yhat_n
if __name__ == "__main__":
import random
X, Y = np.array([[1500], [2100], [800]]), [500000, 800000, 250000] # X is (n, m) where n=3, m=1
theta = np.random.uniform(-3.0, 3.0, size=(1,)) # theta is shape (m,)
yhats = fhatbatched(X,theta) # evaluating fhat on all inputs in one function call
for y, yhat in zip(Y, yhats):
print(f"expected: ${y}, actual: ${yhat:.2f}")
1.2.3 The Inductive Principle of Maximizing Likelihood
After the inductive bias on the family of functions has been made, the learning algorithm must find the function with a good fit. Since artificial learning algorithms don't have visual cortex like biological humans2, the notion of "good fit" needs to defined in a systematic fashion. This is done by selecting the parameter which maximizes the likelihood of the data . Returning to the linear regression inductive bias we've selected to model the house price data, we assume there exists noise in both our model (epistemic uncertainty) and data (aleatoric uncertainty), so that where
prices are normally distributed conditioned on seeing the features with the mean being the equation of the line where , then we have that
Returning to the linear regression model, we can solve this optimization with a direct method using normal equations.
def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray: return X_n*m+b
if __name__ == "__main__":
X, Y = np.array([1500, 2100, 800]), np.array([500000, 800000, 250000]) # data
X_b = np.column_stack((np.ones_like(X), X)) # [1, x]
bhat, mhat = np.linalg.solve(X_b.T @ X_b, X_b.T @ Y) # w = [b, m]
yhats = fhatbatched(X, mhat, bhat) # yhat
for y, yhat in zip(Y, yhats):
print(f"expected: ${y:.0f}, actual: ${yhat:.2f}")
To summarize, we have selected and computed
- an inductive bias with the family of linear functions
- an inductive principle with the least squared loss
- the parameters which minimze the empirical risk, denoted as
Together, the inductive bias describes the relationship between the input and output spaces, the inductive principle is the loss function that measures prediction accuracy, and the minimization of the empirical risk finds the parameters for the best predictor.
1.2.4 Fitting a Line Directly with Normal Equations
1.2.5 Fitting a Line in Non-Linear Feature Space
Feature Space: polynomials.
1.2.6 Generalization, Bias-Variance Tradeoff and Double Descent
1.2 Programming High Dimensional Data and Functions
Now that we have an operational understanding of how statistical learning employ's the power of the multidimensional array abstraction in practice
with linear regression using the numpy.ndarray, we will now implement our very own version of the multidimensional array which we'll define under teenygrad.Tensor.
1.3.1 The Simplest Tensor Implementation with Nested lists
Let's start with the simplest implementation idea for teenygrad.Tensor,
which is a nested array implementation with Python's built-in list data structure.
For some we will use a float,
for , we will use a list[float],
and for arbitary we will use a list[list[float]].
In fact, this is simply what we started doing in the previous chapter when implementing our linear regression model before switching to numpy's accelerated nd.array.
# teenygrad/__init__.py
def tensor(data) -> Tensor: raise NotImplementedError("")
# teenygrad/tensor.py
class Tensor():
def __init__(self, data): raise NotImplementedError("")
import tinygrad
if __name__ == "__main__":
x = teenygrad.tensor([[1.1, 2.2],
[3.3, 4.4],
[5.5, 6.6]])
1.3.2 Tensors with Strides
strides Mapping Virtual Shapes to Physical Storage
class Tensor():
1.3.2 Tensor.matmul(): Naive Vector/Matrix Operations
class Tensor():
# ...shape, stride, offset, storage
def __matmul__(self):
raise NotImplementedError("")
...
1.3.3 Accelerating Interpreted Language Implementations
1.3.4 CPython Extension Modules via PyO3 and Rust
CPython with Rust Extension Modules via PyO3 <!-- jit: pypy. aot: cython/mypyc. bindings:
two guides
- https://docs.python.org/3/extending/index.html
- https://docs.python.org/3/c-api/index.html
https://www.youtube.com/watch?v=umLZphwA-dw
python c api used for 2 purposes
- extension modules
- embedding cpython cpython native extension modules
pyo3. recommended tools https://docs.python.org/3/c-api/intro.html#c-api-tools
history
- https://www.boost.org/doc/libs/1_58_0/libs/python/doc/
- https://github.com/pybind/pybind11
- https://nanobind.readthedocs.io/en/latest/
- https://github.com/dgrunwald/rust-cpython
https://github.com/PyO3/pyo3/blob/main/Architecture.md
https://github.com/openai/tiktoken https://github.com/huggingface/tokenizers/tree/main/bindings/python https://github.com/pola-rs/polars https://www.youtube.com/watch?v=z_Eiy2W0APU
1.3 Accelerating Basic Linear Algebra Subroutines
... talk about BLAS
- 1971 handbook for automatic computation (ALGOL60)
- 1971 EISPACK (FORTRAN66)
- 1973,1974,1975 draft bLAS
- 1979 LINPACK users guide (dongarra, moler) linear systems. linear least squares
- 1979 BLAS1 paper (LINPACK benchmark in appendix)
moore's law -> bell's law mainframes: vector-vector BLAS1 number of loops and the edition minis: matrix-vector BLAS2 number of loops and the edition BLAS2 paper micros: matrix-matrix BLAS3 number of loops and the edition
- http://www.openmathlib.org/OpenBLAS/docs/faq/
- https://dl.acm.org/doi/epdf/10.1145/355841.355847
- https://www.netlib.org/utk/people/JackDongarra/PAPERS/An-Extended-Set-of-BLAS-Model-Implementation-and-Test-Programs.pdf
- https://www.netlib.org/utk/people/JackDongarra/PAPERS/Set-of-Level-3-Basic-Linear-Algebra-Subprograms.pdf
- https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf
- https://www.youtube.com/watch?v=JzNpKDW07rw
1.4.1 The SISD Fantasy of Latency-Oriented CPUs
von neumann architecture... ... talk about processor microarchitecture and memory hierarchies
1.4.2 Bottlenecks, Rooflines, and Benchmarks
... talk about bottlenecks (intensity, and rooflines)
ROOFLINES (for computer architectures which look like level 3) compute got cheaper memory hierarchies got deeper minimize surface/volume ratio compute/memory flops are free, bandwidth is money, latency is physics
level3 blas designed for these architectures. thus matmul because arithmetic intensity. is there a blas4? not really.
1.4.3 Accelerating GEMM from Rust to ASM
1.4.4 Accelerating GEMV and GEMM on CPU with Locality
#![allow(unused)] fn main() { use std::ops::{Add, Mul}; /// Textbook matrix multiply: C = A @ B /// A is m×k, B is k×n, result C is m×n. /// /// Matrices are row-major Vec<Vec<T>> for readability. pub fn matmul<T>(a: &[Vec<T>], b: &[Vec<T>]) -> Result<Vec<Vec<T>>, String> where T: Copy + Add<Output = T> + Mul<Output = T> + Default, { if a.is_empty() || b.is_empty() { return Err("matmul: empty matrix".into()); } let m = a.len(); let k = a[0].len(); if a.iter().any(|row| row.len() != k) { return Err("matmul: A is not rectangular".into()); } // Ensure B is rectangular let k2 = b.len(); let n = b[0].len(); if b.iter().any(|row| row.len() != n) { return Err("matmul: B is not rectangular".into()); } // Ensure B is rectangular if k != k2 { return Err(format!( "matmul: shape mismatch: A is {}x{}, B is {}x{}", m, k, k2, n )) } let mut c = vec![vec![T::default(); n]; m]; // Allocate C (m×n) filled with zeros (Default::default) // Triple loop: i, j, p for i in 0..m { for j in 0..n { let mut acc = T::default(); for p in 0..k { acc = acc + a[i][p] * b[p][j]; } c[i][j] = acc; } } Ok(c) } }
; matmul_f32(C, A, B, M, N, K)
; SysV ABI args:
; rdi = C
; rsi = A
; rdx = B
; rcx = M
; r8 = N
; r9 = K
;
; row-major:
; A[i*K + k], B[k*N + j], C[i*N + j]
global matmul_f32
section .text
matmul_f32:
push rbx
push r12
push r13
push r14
push r15
1.4.5 Accelerating GEMV and GEMM on CPU with Parallelism
1.5 Learning Sentiment Iteratively with Classification
We'll now tackle our first learning problem in the domain of language with sentiment analysis — classifying whether a given piece of text has a positive or negative connotation — by modifying
- our model to support discrete categorical targets
- our optimization method to an iterative algorithm, namely gradient descent
Before diving into the model to support discrete categorical targets in , we will take a closer look into the optimization method of gradient descent because there are two reasons for why we want to move from direct to iterative methods, the first being memory bottlenecks, and the second being non-linear function classes.
1.5.1 Linear Regression, and the Limitations of Direct Methods
The first reason why we want to switch from direct to iterative optimization methods is because direct methods are memory-bound on materializing the matrix .
1.5.2 Inductive Bias and Principle with Logistic Regression and Cross Entropy
The second reason why we want to switch from direct to iterative optimization methods is because even if the number of dimensions is small enough so that we are not memory-bound, direct methods simply will not work for other function classes besides linear regression. Consider ___.
Inductive Bias
In order to train a model which learns the sentiment of text, we will collect data where the input space are feature vectors and the output space are and which encode the negative and positive cases of binary classification[^6] (we will expand the number of categorical responses in the next section with multi-class classification). Then the function of interest we'd like to recover from the data is of form . Recall that machine learning consists of selecting a family of functions as the inductive bias, a loss function as the inductive principle, and estimating the parameters by empirically minimizing the risk.
For the inputs,
For the model class , we will continue to use a a weighted sum of the input vector with where so that , but add the function where referred to as the logistic or sigmoid[^7] function so that the output is a valid probability. We'll interpret as the log odds and as and the complement as . However, the current type of is , whereas the task is to assign a negative () or positive () sentiment to the provided input feature vector . To ammeliorate this we will add a final decision boundary where
(FIGURE.MODEL ARCHITECTURE)
We can vectorize the evaluation of on each with a data matrix so that
import torch
def f(xi: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
"""Compute sigmoid(w^T x) for a single example xi."""
logits = torch.matmul(w, xi)
return torch.sigmoid(logits)
if __name__ == "__main__":
w = torch.randn(3)
X = torch.randn(5, 3)
for xi in X: # does this work?
yi_hat = f(xi, w)
print(yi_hat.item())
For example, let's trace through the evaluation of our model with the following input example :
Inductive Principle
For the loss function
1.5.3 Iterative Optimization via Gradient Descent
For estimating the parameters that minimize the loss , we will optimize iteratively using gradient descent, for the two reasons already mentioned of being memory-bottlenecked on materializing the with the linear regression model class, and, with other classes of functions (which is currently the case), we don't have ___.
Gradient descent, simply put, is an optimization method that uses differential calculus, namely, the fact that the gradient provides hints — the direction of steepest descent — on how to iteratively modify the parameters closer to an optimum. The gradient descent algorithm can be expressed in a one line update rule so that the goal of is implemented by:
where is referred to as the learning rate or step size. We now dive deeper into differential calculus and generalize the derivative to higher dimensional spaces[^8] to better understand what's happening under the hood.
1.5.4 Multinomial Logistic Regression
1.5.5 Generalized Linear Models
-
While predicting house prices is not AGI, we follow the canonical hello world of machine learnign in order to provide an easy example while introducing the foundational mechanics of statistical learning, which are crucial to understanding part two and part three of the book where we cover more industrial deep learning. In software 1.0 you should know how to reverse a linked list. In software 2.0 you should know how to fit a line. ↩
-
If you are familiar with functional programming, this is the mathematical equivalent of currying via closure. ↩ ↩2