1. Elements of Learning
In part 1 of the course notes, we transition from deterministic types to stochastic tensors
by training a house price prediction model with numpy's, understanding
the mechanics of the statistical learning approach to artificial intelligence.
After that, we start building our own deep learning framework picograd, and
develop the Tensor data structure with linear transformations
encoded via .matmul()s which are accelerated on CPU.
This will prepare us for the next part of the course notes where we change the
learning task to sequence learning and the function class to deep neural networks.
Contents
- 1.1 Learning Price Directly with Regression
- 1.2 High Dimensional Arrays
- 1.3 Accelerating Basic Linear Algebra Subroutines on CPU
- 1.4 Learning Sentiment Iteratively with Classification
1.1 Learning Price Directly with Regression
The primary goal of the statistical learning approach to artificial intelligence is induction, which is to use historical data to make accurate predictions in new situations. More formally, the goal is to recover some general function from particular data , which can then be evaluated to predict some future value given some future observation . That is, the goal is generalization. Machine learning has its roots in statistical inference, and so we also refer to the data as observations or evidence, and the function as hypotheses or models. For instance, consider the task of house price prediction1 where the input space is the size in square feet, and the output space is the price, so the function we are interested in recovering has type . Since enumerating through every house size and price pair in is intractable (otherwise all pairs could be inserted inside a lookup or database), an assumption about the structure of the data needs to be made, referred to as the inductive bias.
<SHOW_DATA>
1.1.1 Fitting a Line
The simplest assumption to make is to assume that there exists a linear relationship between the data so that a function of the form parameterized by slope and intercept where fits the data well. After the inductive bias on the family of functions has been made, the learning algorithm must find the function with the best fit by selecting parameters and . Since artificial learning algorithms don't have visual cortex like biological humans2, the notion of "good fit" needs to be quantified with a second function of the form , referred to as the loss function, which can be used to systematically find the function of best fit and thus best predictor3. For now, we'll make the unjustified decision (justifying it later4) to use the least squares loss so that and denote the loss function with to reflect the fact that the input-output pairs are fixed, and that the parameters determine the goodness of fit. Evaluating the least squares loss over the entire dataset results in , and minimizing this quantity is the final step, denoted by
To summarize, we have selected
- an inductive bias with the family of linear functions which describes the relationship between the input and output space
- an inductive principle with the least squared loss which measures accuracy with an objective function
- and will find the line of best fit by estimating the parameters which minimze the empirical risk, denoted as
With the linear regression model, we can solve this optimization with a direct method.
todo: parameter, linear/affine, matrices as data
1.1.2 Fitting a Line, in Higher Dimensions
Vector Space
1.1.3 Fitting a Line, Probabilistically
The most widely adopted mathematical language for formalizing our intuitions around non-deterministic stochastic phenomena is probability theory (as opposed to alternative frameworks such as probabilistic logic or uncertainty quantification). In probability theory statements are neither true nor false, but rather, truth is distributed across a weighted set of random events . Similarly, random variables and their probability distributions do not take on definite values but rather sets of values. Probability theory is the generalization of aristotelian logic.
Probability Space
1.1.4 Fitting a Line, with Non-Linear Features
Feature Space
1.2 High Dimensional Arrays
1.2.1 Tensor Language
class Tensor():
1.2.2 Op Intermediate Representation
class Op():
1.2.3 Device Runtime
class Device():
1.3 Accelerating Basic Linear Algebra Subroutines on CPU
1.3.1 Latency-Oriented SISD Fantasy: From IBM 360 to Macbook
Throughout the last 60 years of computer architecture, we've seen the number of transistors double roughly every two years, enabling the transition across said different computer classes from room-sized mainframe computers, to refrigerator-sized minicomputers, to personal-sized microcomputers, and even to embedded ones. Although the form factors have drastically changed, the computer architectures have roughly remained the same.
Across the computer classes of IBM's System/360s, to DEC's PDP11s, to Apple's Macbooks, all architectures in essence follow the original stored-program sequential instruction processing architecture laid out by Burks, Goldstine, and von Neumann in their 1946 paper5 Preliminary Discussion of the Logical Design of an Electronic Computing Instrument, dubbed the von Neumann Architecture, which roughly speaking consists of five elements placed into three categories:
- Processor and Memory
- Input and Output
- Control
<FIGURE_2>
In order to execute a program's specifies sequence of instructions, the control unit orchestrates all components in concert with one another in a fetch, decode, execute cycle in order to interpret each instruction and carry out it's computation on the processor against the memory's state. These instructions are specified by the instruction set architecture and can be encoded in human-readable languages referred to assembly code, or the direct binary encoding referred to as machine code. Although these days programs are specified in higher level languages with a larger semantic gap from the machine, understanding assembly is crucial to any low-level machine learning systems programmer in order to obtain an understanding as close as possible to the series of instructions the processor actually6 executes.

$ time = instructions*\frac{instructions}{clock}\frac{clocks}{second} $
Optimization 1: Increase (frequency)
Optimization 2: Increase (instruction-level parallelism)
Optimization 3: Avoid stalls (speculation: caches, branch predictors, memory prefetchers)
What to do with twice the silicon?? (Orange line)
1.3.2 Bottlenecks and Rooflines
Is the throughput performance of matrix multiplication satisfactory? Well, if the hardware is theoretically capable of delivering more performance throughput measured in FLOP/S, why leave any compute on the table? This intuition is captured in the roofline cost model where ...
Figure x: https://modal.com/gpu-glossary
One of the key operations used in scientific computing (and machine learning)
for each $i \in [0,n) $ $j\in [0,p)$ so that . Translating the mathematical definition directly to a computational algorithm results in the following $[n \to n^3]$ implementation that computes dot products:
def matmul(A, B):
n, m, p = len(A), len(A[0]), len(B[0])
C = [[0.0 for _ in range(p)] for _ in range(n)]
for i in range(n):
for j in range(p):
sum = 0.0
for k in range(m):
sum += A[i][k] * B[k][j]
C[i][j] = sum
return C
if __name__ == "__main__":
A = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
B = [[7.0, 8.0], [9.0, 10.0], [11.0, 12.0]]
C = matmul(A, B)
print(C)
How long do you expect the wall clock time to take for a square matrix of size 4096? On the order of nanoseconds, microseconds, milliseconds, seconds, minutes, hours, days?
Since the algorithm has a cubic amount of work, the number of floating point operations is roughly , and the measured running time on an Apple M3 takes X hours, meaning the througput is 10 MFLOP/S, which is Z% of the W peak theoretical TFLOP/S of the M3 machine.
function matmul(A, B) {
const [n, m, p] = [A.length, A[0].length, B[0].length];
const C = Array.from({ length: n }, () => Array(p).fill(0));
for (let i = 0; i < n; i++) {
for (let j = 0; j < p; j++) {
let sum = 0;
for (let k = 0; k < m; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
return C;
}
50 MFLOP/S
fn matmul(A: &[Vec<f64>], B: &[Vec<f64>]) -> Vec<Vec<f64>> { let (n,m,p) = (A.len(), A[0].len(), B[0].len()); let mut C = vec![vec![0.0; p]; n]; for i in 0..n { for j in 0..p { let mut sum = 0.0; for k in 0..m { sum += A[i][k] * B[k][j]; } C[i][j] = sum; } } C } fn main() { let A = vec![vec![1.0, 2.0, 3.0], vec![4.0, 5.0, 6.0]]; let B = vec![vec![7.0, 8.0], vec![9.0, 10.0], vec![11.0, 12.0]]; let C = matmul(&A, &B); println!("{:?}", C); }
1 GFLOP/S
1.3.3 M1 Processor Architecture
1.3.4 Macbook's Memory Hierarchy
- vectorized, cache blocked 40GFLOP/S (0.2% peak)
- multicore 200GFLOP/S (40% CPU vector peak). (1% of SoC peak)
- bf16, matrix units 1.5TFLOP/S (7.5% of SoC peak)
- metal 3.2TFLOP/S (15% of SoC peak)
- npu (int16) 16 TOP/S (80% of SoC peak) 19TFLOP/S A100 4PFLOP/S H100 14PFLOP/S B100 20PFLOP/S B200 6 orders of magnitude speedup (1,000,000) speedup <-- kills proebsting's law!
1.4 Learning Sentiment Iteratively with Classification
We'll now tackle our first learning problem in the domain of language with sentiment analysis — classifying whether a given piece of text has a positive or negative connotation — by modifying
- our model to support discrete categorical targets
- our optimization method to an iterative algorithm, namely gradient descent
Before diving into the model to support discrete categorical targets in , we will take a closer look into the optimization method of gradient descent because there are two reasons for why we want to move from direct to iterative methods, the first being memory bottlenecks, and the second being non-linear function classes.
1.4.1 Linear Regression, and the Limitations of Direct Methods
The first reason why we want to switch from direct to iterative optimization methods is because direct methods are memory-bound on materializing the matrix .
1.4.2 Inductive Bias and Principle with Logistic Regression and Cross Entropy
The second reason why we want to switch from direct to iterative optimization methods is because even if the number of dimensions is small enough so that we are not memory-bound, direct methods simply will not work for other function classes besides linear regression. Consider ___.
Inductive Bias
In order to train a model which learns the sentiment of text, we will collect data where the input space are feature vectors and the output space are and which encode the negative and positive cases of binary classification7 (we will expand the number of categorical responses in the next section with multi-class classification). Then the function of interest we'd like to recover from the data is of form . Recall that machine learning consists of selecting a family of functions as the inductive bias, a loss function as the inductive principle, and estimating the parameters by empirically minimizing the risk.
For the inputs,
For the model class , we will continue to use a a weighted sum of the input vector with where so that , but add the function $\sigma : \reals \to [0,1]$ where referred to as the logistic or sigmoid8 function so that the output is a valid probability. We'll interpret as the log odds and as and the complement as . However, the current type of is $f:\reals \to [0,1]$, whereas the task is to assign a negative () or positive () sentiment to the provided input feature vector . To ammeliorate this we will add a final decision boundary where $$ sentiment(\mathbf{x}) \colonequals \begin{cases} 1 &\text{if } p(Y=1|X=x) > 0.5 \ -1 &\text{otherwise} \end{cases} $$
(FIGURE.MODEL ARCHITECTURE)
We can vectorize the evaluation of on each $i \in [0..n)$ with a data matrix so that
import torch
def f(xi: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
"""Compute sigmoid(w^T x) for a single example xi."""
logits = torch.matmul(w, xi)
return torch.sigmoid(logits)
if __name__ == "__main__":
w = torch.randn(3)
X = torch.randn(5, 3)
for xi in X: # does this work?
yi_hat = f(xi, w)
print(yi_hat.item())
For example, let's trace through the evaluation of our model with the following input example :
Inductive Principle
For the loss function
1.4.3 Iterative Optimization via Gradient Descent
For estimating the parameters that minimize the loss , we will optimize iteratively using gradient descent, for the two reasons already mentioned of being memory-bottlenecked on materializing the with the linear regression model class, and, with other classes of functions (which is currently the case), we don't have ___.
Gradient descent, simply put, is an optimization method that uses differential calculus, namely, the fact that the gradient provides hints — the direction of steepest descent — on how to iteratively modify the parameters closer to an optimum. The gradient descent algorithm can be expressed in a one line update rule so that the goal of is implemented by:
where is referred to as the learning rate or step size. We now dive deeper into differential calculus and generalize the derivative to higher dimensional spaces9 to better understand what's happening under the hood.
1.4.4 Multinomial Logistic Regression
1.4.5 Generalized Linear Models
While predicting house prices is not AGI, we follow the hello world in order to provide an easy example while introducing the foundational mechanics and workhorses of machine learning, which are crurical in part two of the course notes where we cover more industrial machine learning. In software 1.0 you should know how to reverse a linked list. In software 2.0 you should know how to fit a line.
Our visual cortex won't work in input-output spaces with dimensions higher than 3 anyways.
When loss functions are convex we can find the best predictor which minimizes the loss which used to be very important before the empirical success of deep neural networks as a function class which are highly non-convex, which we will see in part two of the course notes.
With the inductive principle of maximum likelihood estimation.
In that paper, they refer to components of the computer architecture as "organs". With each new class of computer unlocking new applications, humans find grand-inspiring names such as the digital computer, intergalactic network, the cloud, and artificial intelligence. The more things change, the more things stay the same.
Even in the case of x86-64 based instruction set architectures, which decode the instructions into microinstructions.
The encodings are arbitrary. The output space can alternatively be the set which we avoid to prevent confusion around the categorical encodings and the probabilities $p \in [0,1]$ assigned to them.
Perhaps a more suitable name for the model is "sigmoidal classification" rather than "logistic regression".
What is known as the Fréchet derivative, which is defined on any vector space equipped with a norm