8. Tensor Compilers
This chapter implements an interpreter for neural network. By the end of this chapter, you will have a working implementation of the multidimensional array abstraction pioneered by numpy, and autodifferentiation pioneered by HIPS autograd.
Contents
- nd: forward pass with
loss = model(x)
- dfdx(nd): backward pass with
loss.backward()
,opt.step()
- ffn to gpt: lstm, rnn, gpt
- beyond nanogpt: attention variants, kv cache, speculative decoding
References
Part 1 — nd: forward pass with loss = model(x)
The following is the model definition and inference loop for a FFN language model
following (Bengio et al. 2003).
The API is a 1-1 match with PyTorch — take a second to convince yourself by
replacing import picograd
with import torch
and sampling from the inference
loop.
"""
Dimension key:
B: batch size
T: sequence length
V: vocabulary size
E: embedding dimension (E != D)
D: model dimension
"""
import picograd
# from jaxtyping import
# *********************MODEL*********************
B, T = 32, 3
V, E, D = 27, 10, 200
class Linear:
def __init__(self, D_in, D_out, bias=True):
self.W_DiDo = picograd.randn((D_in, D_out)) * 0.01
self.b_Do = picograd.zeros(D_out) if bias else None
def __call__(self, X_Di):
self.X_Do = X_Di @ self.W_DiDo
if self.b_Do is not None: self.X_Do += self.b_Do
self.out = self.X_Do
return self.X_Do
def parameters(self):
return [self.W_DiDo] + ([] if self.b_Do is None else [self.b_Do])
class Tanh:
def __call__(self, X_BD):
self.X_BD = picograd.tanh(X_BD)
self.out = self.X_BD
return self.X_BD
def parameters(self):
return []
model = [
Linear(T * E, D, bias=False), Tanh(),
Linear(D, D, bias=False), Tanh(),
Linear(D, V, bias=False)
]
C_VE = picograd.randn((V,E)) #, generator=g)
params = [C_VE] + [p for l in model for p in l.parameters()]
for p in params:
p.requires_grad = True
print("model loaded to cpu")
# *********************INFERENCE LOOP*********************
for _ in range(20): # 20 samples
output, context = [], [0] * T
while True:
X_1T = picograd.tensor([context]) # B=1 for inference, T=3, in [0..27] (set to 0 for init)
X_1TE = C_VE[X_1T] # using 0..27 as indices into C_VE for each B=1 example of context length T
X_1cTE = X_1TE.view(-1, T*E) # B=1, TE
X = X_1cTE
for h in model:
X = h(X)
y_hat = F.softmax(X, dim=1)
# sample and autoregressively update context
token = picograd.multinomial(y_hat, num_samples=1, replacement=True).item()#, generator=g).item()
context = context[1:] + [token]
output.append(decode[token])
if token == 0:
break
print(''.join(output))
At the end of part 1 we will be able to autoregressively sample from the
untrained model with the inference loop.