Singularity Systems Overview
Contents
Singularity Overview — Software 2.0
Systems Overview — "Golden Age" Infrastructure Buildout
implementing 4 compilers might be intimidating. In the same way artists paint over and over, and mathematicians rederive over and over, language implementors should represent over and over.
,--. ,--.
((O ))--((O ))
,'_`--'____`--'_`.
_|---------------------------------|_ to the tensor
| |Tensor Compiler: Torch -> Triton| | ^
| |Tiling Compiler: Triton -> PTX | | ^
| |Vector Compiler: CUDA -> PTX | | ^
| |Scalar Compiler: C -> RISC-V| | ^
| |---------------------------------| | passing assembly
| |:::::::::::::::::::::::::::::::::| | ^
| |::::::::::::::µarch::::::::::::::| | ^
|_|:::::::::::::::::::::::::::::::::|_| ^
|---------------------------------| from the transistor
__..-' `-..__
.-| : .----------------. : |-.
,\ || | |\______________/| | || /.
/`.\:| | || __ __ __ || | |;/,'\
:`-._\;.| || '--''--''--' || |,:/_.-':
| : | || .----------. || | : |
| | | || '----SSt---' || | | |
| | | || _ _ _ || | | |
:,--.; | || (_) (_) (_) || | :,--.;
(`-'|) | ||______________|| | (|`-')
`--' | |/______________\| | `--'
|____________________|
`.________________,'
(_______)(_______)
(_______)(_______)
(_______)(_______)
(_______)(_______)
| || |
'--------''--------'
Course Information Singularity Systems: Zero to Hero follows up from Neural Networks: Zero to Hero. We convert
While micrograd helps research scientists to understand the leaky abstraction of backpropagation, picograd is intended for systems programmers and performance engineers looking to further understand the compilers and chips of deep learning.
Try it out with:
pip install picograd
Prerequisites
- solid deep learning (llama)
- solid systems programming (C || C++ || Rust)
Syllabus Core: Deep Learning Compilers
- dfdx(nd): implements an interpreter for neural networks (HIPS/autograd)
- brrr: accelerates the interpreter with vector processing (pytorch1)
- pt2: constructs a compiler for neural networks (pytorch2)
- az1use1: 3d parallelism
Throughout the past decade, modern day AI infrastructure has rapidly evolved
to meet the needs of deep neural networks — most notably with the throughput
performance of GPUs moving from TFLOPS
to PFLOPS
. Datacenter
computing now has the goal of machines with EFLOP
speeds, now that that the
throughput of the fastest (non-distributed) supercomputers on TOP500 LINPACK
workloads are just reaching EFLOP
levels.
Although the brain is an existence proof of physics powering 20PFLOP
machines
with 20W
, the problem with the semiconductor physics of today is two-fold:
- instruction-level parallelism from out-of-order superscalar pipelines hits diminishing returns
- frequency scaling is hitting against Dennard scaling's power wall
and so this free-single-thread-performance-lunch aspect to Moore's law that transitioned us across computer classes from minis to micros and from micros to mobile is "over".
As a result computer architects are moving from homogenous general hardware to heterogenous specialized hardware, which means that the complexity of extracting program performance leaks upwards from the hardware — these days, to unlock the full performance of hardware, it's the programmer's responsibility to program the vector processors in multi-core/many-core machines.
The problem with the vector processing of multi-core/many-core machines is two-fold:
- programming model: compilers were sufficiently smart with autovectorization
- execution model: program speedups were bound by Amdahl's law
But the industry sidestepped these problems by changing the programming model to SIMT on SIMD (CUDA) and finding domains whose execution models had more parallelism (deep neural networks).
The challenge (producing a golden age) of compiler engineers and chip architects face is to find the optimal mapping from intelligence to energy. This means creating new programming languages and machiens while minimizing the accidental complexity that naturally builds up along the way:
- The Golden Age of Compiler Design (Lattner)
- A New Golden Age for Computer Architecture (Hennessy and Patterson)
Singularity Systems
The Singularity Systems: Zero to Hero course follows up from
where Neural Networks: Zero to Hero
left off: we will convert micrograd
into picrograd
, where the main difference is that:
- micrograd is a backprop engine with scalar-support helping researchers understand that backpropagation as an abstraction is in fact leaky (gradient activations, normalizations)
- picograd leans closer towards modern day deep learning frameworks with tensor-support for both pytorch1 interpretation and pytorch2 compilation.
While picograd
is more oriented towards low-level system programmers and
performance engineers, this framework remains pedagogical and remains a
point-wise compiler. This means that we will "only" support:
- 1 model: llama
- 1 programming model: eager
- 2 execution model: eager, graph
- 2 hardware architectures: amd cpu, nvidia gpu
- 2 precisions: fp32, tf32
Welcome to the golden age of Systems ML!