1. Elements of Learning

In part 1 of the course notes, we transition from deterministic types to stochastic tensors by training a house price prediction model with numpy's, understanding the mechanics of the statistical learning approach to artificial intelligence. After that, we start building our own deep learning framework picograd, and develop the Tensor data structure with linear transformations encoded via .matmul()s which are accelerated on CPU. This will prepare us for the next part of the course notes where we change the learning task to sequence learning and the function class to deep neural networks.

Contents

1.1 Learning Price Directly with Regression

The primary goal of the statistical learning approach to artificial intelligence is induction, which is to use historical data to make accurate predictions in new situations. More formally, the goal is to recover some general function $f : X \to Y$ from particular data $d = (x^{(i)}, y^{(i)})_i = 0^{n}$ , which can then be evaluated to predict some future value $y^{(n + 1)}$ given some future observation $x^{(n + 1)}$ . That is, the goal is generalization. Machine learning has its roots in statistical inference, and so we also refer to the data as observations or evidence, and the function as hypotheses or models. For instance, consider the task of house price prediction¹ where the input space $X = R$ is the size in square feet, and the output space $Y = R$ is the price, so the function we are interested in recovering has type $f : R \to R$ . Since enumerating through every house size and price pair in $R$ is intractable (otherwise all pairs could be inserted inside a lookup or database), an assumption about the structure of the data needs to be made, referred to as the inductive bias.

<SHOW_DATA>

1.1.1 Fitting a Line

The simplest assumption to make is to assume that there exists a linear relationship between the data so that a function of the form $f : R \to R$ parameterized by slope $m$ and intercept $b$ where $f (x; m, b) : = m x + b$ fits the data well. After the inductive bias on the family of functions has been made, the learning algorithm must find the function $\hat{f}$ with the best fit by selecting parameters $m$ and $b$ . Since artificial learning algorithms don't have visual cortex like biological humans², the notion of "good fit" needs to be quantified with a second function of the form $L : Y \times Y \to R$ , referred to as the loss function, which can be used to systematically find the function of best fit and thus best predictor³. For now, we'll make the unjustified decision (justifying it later⁴) to use the least squares loss so that $ℓ (y^{(i)}, \overset{y}{^}^{(i)}) : = (y - \overset{y}{^})^{2} = [(y^{(i)}) * data - (m x^{(i)} + b) * prediction]^{2}$ and denote the loss function with $l (m, b)$ to reflect the fact that the $(x^{(i)}, y^{(i)})$ input-output pairs are fixed, and that the parameters determine the goodness of fit. Evaluating the least squares loss over the entire dataset results in $L (m, b) : = \sum_i = 1^{n} (m x^{(i)} + b - y^{(i)})^{2}$ , and minimizing this quantity is the final step, denoted by $(\overset{m}{^}, \hat{b}) = arg min L (m, b)$

To summarize, we have selected

an inductive bias with the family of linear functions $f (x; m, b) : = m x + b$ which describes the relationship between the input and output space
an inductive principle with the least squared loss $L (m, b) : = \sum_i = 1^{n} (m x^{(i)} + b - y^{(i)})^{2}$ which measures accuracy with an objective function
and will find the line of best fit by estimating the parameters which minimze the empirical risk, denoted as $(\overset{m}{^}, \hat{b}) = arg min L (m, b)$

With the linear regression model, we can solve this optimization with a direct method.

todo: parameter, linear/affine, matrices as data

1.1.2 Fitting a Line, in Higher Dimensions

Vector Space

1.1.3 Fitting a Line, Probabilistically

The most widely adopted mathematical language for formalizing our intuitions around non-deterministic stochastic phenomena is probability theory (as opposed to alternative frameworks such as probabilistic logic or uncertainty quantification). In probability theory statements $S$ are neither true nor false, but rather, truth is distributed across a weighted set of random events $E$ . Similarly, random variables $X$ and their probability distributions $p$ do not take on definite values but rather sets of values. Probability theory is the generalization of aristotelian logic.

Probability Space

1.1.4 Fitting a Line, with Non-Linear Features

Feature Space

1.2 High Dimensional Arrays

1.2.1 `Tensor` Language

class Tensor():

1.2.2 `Op` Intermediate Representation

class Op():

1.2.3 `Device` Runtime

class Device():

1.3 Accelerating Basic Linear Algebra Subroutines on CPU

1.3.1 Latency-Oriented SISD Fantasy: From IBM 360 to Macbook

Throughout the last 60 years of computer architecture, we've seen the number of transistors double roughly every two years, enabling the transition across said different computer classes from room-sized mainframe computers, to refrigerator-sized minicomputers, to personal-sized microcomputers, and even to embedded ones. Although the form factors have drastically changed, the computer architectures have roughly remained the same.

Across the computer classes of IBM's System/360s, to DEC's PDP11s, to Apple's Macbooks, all architectures in essence follow the original stored-program sequential instruction processing architecture laid out by Burks, Goldstine, and von Neumann in their 1946 paper⁵ Preliminary Discussion of the Logical Design of an Electronic Computing Instrument, dubbed the von Neumann Architecture, which roughly speaking consists of five elements placed into three categories:

Processor and Memory
Input and Output
Control

<FIGURE_2>

In order to execute a program's specifies sequence of instructions, the control unit orchestrates all components in concert with one another in a fetch, decode, execute cycle in order to interpret each instruction and carry out it's computation on the processor against the memory's state. These instructions are specified by the instruction set architecture and can be encoded in human-readable languages referred to assembly code, or the direct binary encoding referred to as machine code. Although these days programs are specified in higher level languages with a larger semantic gap from the machine, understanding assembly is crucial to any low-level machine learning systems programmer in order to obtain an understanding as close as possible to the series of instructions the processor actually⁶ executes.

$ time = instructions*\frac{instructions}{clock}\frac{clocks}{second} $

Optimization 1: Increase $\frac{c l oc k s}{seco n d}$ (frequency)

Optimization 2: Increase $\frac{in s t r u c t i o n s}{c l oc k}$ (instruction-level parallelism)

Optimization 3: Avoid stalls (speculation: caches, branch predictors, memory prefetchers)

What to do with twice the silicon?? (Orange line)

1.3.2 Bottlenecks and Rooflines

Is the throughput performance of matrix multiplication satisfactory? Well, if the hardware is theoretically capable of delivering more performance throughput measured in FLOP/S, why leave any compute on the table? This intuition is captured in the roofline cost model where ...

Figure x: https://modal.com/gpu-glossary

Question

A plain note.

Answer (Click to reveal)

A plain note.

One of the key operations used in scientific computing (and machine learning) is that of matrix multiplication. That is, given two matrices $A \in R^{n \times m}$ and $B \in R^{m \times p}$ , evaluate the matrix product $C$ defined by

$C ij = \sum k = 0^{m} A ik B kj$ for each $i \in [0,n) $ $j\in [0,p)$ so that $C \in R^{n \times p}$ . Translating the mathematical definition directly to a computational algorithm results in the following $[n \to n^3]$ implementation that computes $n^{2}$ dot products:

def matmul(A, B):
  n, m, p = len(A), len(A[0]), len(B[0])
  C = [[0.0 for _ in range(p)] for _ in range(n)]

  for i in range(n):
    for j in range(p):
      sum = 0.0
      for k in range(m):
          sum += A[i][k] * B[k][j]
      C[i][j] = sum
  return C

if __name__ == "__main__":
    A = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
    B = [[7.0, 8.0], [9.0, 10.0], [11.0, 12.0]]
    C = matmul(A, B)
    print(C)

How long do you expect the wall clock time to take for a square matrix of size 4096? On the order of nanoseconds, microseconds, milliseconds, seconds, minutes, hours, days?

Since the algorithm has a cubic amount of work, the number of floating point operations is roughly $(2^{1} 2)^{3} = 2^{3} 6$ , and the measured running time on an Apple M3 takes X hours, meaning the througput is 10 MFLOP/S, which is Z% of the W peak theoretical TFLOP/S of the M3 machine.

function matmul(A, B) {
  const [n, m, p] = [A.length, A[0].length, B[0].length];
  const C = Array.from({ length: n }, () => Array(p).fill(0));

  for (let i = 0; i < n; i++) {
    for (let j = 0; j < p; j++) {
      let sum = 0;
      for (let k = 0; k < m; k++) {
        sum += A[i][k] * B[k][j];
      }
      C[i][j] = sum;
    }
  }

  return C;
}

50 MFLOP/S

fn matmul(A: &[Vec<f64>], B: &[Vec<f64>]) -> Vec<Vec<f64>> {
  let (n,m,p) = (A.len(), A[0].len(), B[0].len());
  let mut C = vec![vec![0.0; p]; n];

  for i in 0..n {
    for j in 0..p {
      let mut sum = 0.0;
      for k in 0..m {
        sum += A[i][k] * B[k][j];
      }
      C[i][j] = sum;
    }
  }

  C
}

fn main() {
    let A = vec![vec![1.0, 2.0, 3.0], vec![4.0, 5.0, 6.0]];
    let B = vec![vec![7.0, 8.0], vec![9.0, 10.0], vec![11.0, 12.0]];

    let C = matmul(&A, &B);
    println!("{:?}", C);
}

1 GFLOP/S

1.3.3 M1 Processor Architecture

1.3.4 Macbook's Memory Hierarchy

vectorized, cache blocked 40GFLOP/S (0.2% peak)
multicore 200GFLOP/S (40% CPU vector peak). (1% of SoC peak)
bf16, matrix units 1.5TFLOP/S (7.5% of SoC peak)
metal 3.2TFLOP/S (15% of SoC peak)
npu (int16) 16 TOP/S (80% of SoC peak) 19TFLOP/S A100 4PFLOP/S H100 14PFLOP/S B100 20PFLOP/S B200 6 orders of magnitude speedup (1,000,000) speedup <-- kills proebsting's law!

1.4 Learning Sentiment Iteratively with Classification

We'll now tackle our first learning problem in the domain of language with sentiment analysis — classifying whether a given piece of text has a positive or negative connotation — by modifying

our model to support discrete categorical targets
our optimization method to an iterative algorithm, namely gradient descent

Before diving into the model to support discrete categorical targets in $N$ , we will take a closer look into the optimization method of gradient descent because there are two reasons for why we want to move from direct to iterative methods, the first being memory bottlenecks, and the second being non-linear function classes.

1.4.1 Linear Regression, and the Limitations of Direct Methods

The first reason why we want to switch from direct to iterative optimization methods is because direct methods are memory-bound on materializing the matrix $X^{⊤} X$ $\in R^{d \times d}$ .

1.4.2 Inductive Bias and Principle with Logistic Regression and Cross Entropy

The second reason why we want to switch from direct to iterative optimization methods is because even if the number of dimensions $d$ is small enough so that we are not memory-bound, direct methods simply will not work for other function classes besides linear regression. Consider ___.

Inductive Bias

In order to train a model which learns the sentiment of text, we will collect data $d = (x^{(i)}, y^{(i)})_i = 0^{n}$ where the input space $X = R^{5}$ are feature vectors $x = ...$ and the output space $Y = - 1, 1$ are $- 1$ and $1$ which encode the negative and positive cases of binary classification⁷ (we will expand the number of categorical responses in the next section with multi-class classification). Then the function of interest we'd like to recover from the data is of form $f : R^{5} \to - 1, 1$ . Recall that machine learning consists of selecting a family of functions as the inductive bias, a loss function as the inductive principle, and estimating the parameters by empirically minimizing the risk.

For the inputs,

For the model class $f : R^{d} \to 0, 1$ , we will continue to use a a weighted sum of the input vector $x \in R^{d}$ with $f (x; θ) : = θ^{⊤} x$ where $x * 0 = 1$ so that $θ * 0 = b$ , but add the function $\sigma : \reals \to [0,1]$ where $σ (z) : = \frac{1}{1 + e x p ( - z )}$ referred to as the logistic or sigmoid⁸ function so that the output is a valid probability. We'll interpret $z$ as the log odds $ln \frac{p}{1 - p}$ and $f (x; θ) = σ (θ^{⊤} x)$ as $p (Y = 1∣ X = x)$ and the complement $1 - f (x; θ)$ as $p (Y = - 1∣ X = x)$ . However, the current type of $f$ is $f:\reals \to [0,1]$, whereas the task is to assign a negative ( $- 1$ ) or positive ( $1$ ) sentiment to the provided input feature vector $x$ . To ammeliorate this we will add a final decision boundary where $$ sentiment(\mathbf{x}) \colonequals \begin{cases} 1 &\text{if } p(Y=1|X=x) > 0.5 \ -1 &\text{otherwise} \end{cases} $$

(FIGURE.MODEL ARCHITECTURE)

We can vectorize the evaluation of $f (x; θ)$ on each $x^{(i)}$ $i \in [0..n)$ with a data matrix $X \in R^{n \times d}$ so that

import torch

def f(xi: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
  """Compute sigmoid(w^T x) for a single example xi."""
  logits = torch.matmul(w, xi)
  return torch.sigmoid(logits)


if __name__ == "__main__":
  w = torch.randn(3)
  X = torch.randn(5, 3)

  for xi in X: # does this work?
    yi_hat = f(xi, w)
    print(yi_hat.item())

For example, let's trace through the evaluation of our model $f$ with the following input example $x$ :

Inductive Principle

For the loss function $L (θ)$

1.4.3 Iterative Optimization via Gradient Descent

For estimating the parameters that minimize the loss $\hat{θ} \in arg min L (θ)$ , we will optimize iteratively using gradient descent, for the two reasons already mentioned of being memory-bottlenecked on materializing the $X^{⊤} X \in R^{d \times d}$ with the linear regression model class, and, with other classes of functions (which is currently the case), we don't have ___.

Gradient descent, simply put, is an optimization method that uses differential calculus, namely, the fact that the gradient provides hints — the direction of steepest descent — on how to iteratively modify the parameters closer to an optimum. The gradient descent algorithm can be expressed in a one line update rule so that the goal of $\hat{θ} \in arg min L (θ)$ is implemented by: $θ^{t + 1} : = θ^{t} - α \nabla L (θ)$

where $α$ is referred to as the learning rate or step size. We now dive deeper into differential calculus and generalize the derivative to higher dimensional spaces⁹ to better understand what's happening under the hood.

1.4.4 Multinomial Logistic Regression

1.4.5 Generalized Linear Models

While predicting house prices is not AGI, we follow the hello world in order to provide an easy example while introducing the foundational mechanics and workhorses of machine learning, which are crurical in part two of the course notes where we cover more industrial machine learning. In software 1.0 you should know how to reverse a linked list. In software 2.0 you should know how to fit a line.

Our visual cortex won't work in input-output spaces with dimensions higher than 3 anyways.

When loss functions are convex we can find the best predictor which minimizes the loss which used to be very important before the empirical success of deep neural networks as a function class which are highly non-convex, which we will see in part two of the course notes.

⁴

With the inductive principle of maximum likelihood estimation.

⁵

In that paper, they refer to components of the computer architecture as "organs". With each new class of computer unlocking new applications, humans find grand-inspiring names such as the digital computer, intergalactic network, the cloud, and artificial intelligence. The more things change, the more things stay the same.

⁶

Even in the case of x86-64 based instruction set architectures, which decode the instructions into microinstructions.

⁷

The encodings are arbitrary. The output space $Y$ can alternatively be the set $0, 1$ which we avoid to prevent confusion around the categorical encodings and the probabilities $p \in [0,1]$ assigned to them.

⁸

Perhaps a more suitable name for the model is "sigmoidal classification" rather than "logistic regression".

⁹

What is known as the Fréchet derivative, which is defined on any vector space $V$ equipped with a norm $∥ \cdot ∥$

SITP