Singularity Systems Overview

Contents

Singularity Overview — Software 2.0

Systems Overview — "Golden Age" Infrastructure Buildout

implementing 4 compilers might be intimidating. In the same way artists paint over and over, and mathematicians rederive over and over, language implementors should represent over and over.

                  ,--.    ,--.
                 ((O ))--((O ))
               ,'_`--'____`--'_`.
      _|---------------------------------|_             to the tensor
     | |Tensor Compiler: Torch  -> Triton| |                  ^
     | |Tiling Compiler: Triton -> PTX   | |                  ^
     | |Vector Compiler: CUDA   -> PTX   | |                  ^
     | |Scalar Compiler: C      -> RISC-V| |                  ^
     | |---------------------------------| |           passing assembly
     | |:::::::::::::::::::::::::::::::::| |                  ^
     | |::::::::::::::µarch::::::::::::::| |                  ^
     |_|:::::::::::::::::::::::::::::::::|_|                  ^
       |---------------------------------|           from the transistor
            __..-'            `-..__
         .-| : .----------------. : |-.
       ,\ || | |\______________/| | || /.
      /`.\:| | ||  __  __  __  || | |;/,'\
     :`-._\;.| || '--''--''--' || |,:/_.-':
     |    :  | || .----------. || |  :    |
     |    |  | || '----SSt---' || |  |    |
     |    |  | ||   _   _   _  || |  |    |
     :,--.;  | ||  (_) (_) (_) || |  :,--.;
     (`-'|)  | ||______________|| |  (|`-')
      `--'   | |/______________\| |   `--'
             |____________________|
              `.________________,'
               (_______)(_______)
               (_______)(_______)
               (_______)(_______)
               (_______)(_______)
              |        ||        |
              '--------''--------'

Course Information Singularity Systems: Zero to Hero follows up from Neural Networks: Zero to Hero. We convert

micrograd: toy backpropagation engine into...
picograd: modern deep learning framework

While micrograd helps research scientists to understand the leaky abstraction of backpropagation, picograd is intended for systems programmers and performance engineers looking to further understand the compilers and chips of deep learning.

Try it out with:

pip install picograd

Prerequisites

solid deep learning (llama)
solid systems programming (C || C++ || Rust)

Syllabus Core: Deep Learning Compilers

dfdx(nd): implements an interpreter for neural networks (HIPS/autograd)
brrr: accelerates the interpreter with vector processing (pytorch1)
pt2: constructs a compiler for neural networks (pytorch2)
az1use1: 3d parallelism

Throughout the past decade, modern day AI infrastructure has rapidly evolved to meet the needs of deep neural networks — most notably with the throughput performance of GPUs moving from TFLOPS to PFLOPS. Datacenter computing now has the goal of machines with EFLOP speeds, now that that the throughput of the fastest (non-distributed) supercomputers on TOP500 LINPACK workloads are just reaching EFLOP levels.

Although the brain is an existence proof of physics powering 20PFLOP machines with 20W, the problem with the semiconductor physics of today is two-fold:

instruction-level parallelism from out-of-order superscalar pipelines hits diminishing returns
frequency scaling is hitting against Dennard scaling's power wall

and so this free-single-thread-performance-lunch aspect to Moore's law that transitioned us across computer classes from minis to micros and from micros to mobile is "over".

As a result computer architects are moving from homogenous general hardware to heterogenous specialized hardware, which means that the complexity of extracting program performance leaks upwards from the hardware — these days, to unlock the full performance of hardware, it's the programmer's responsibility to program the vector processors in multi-core/many-core machines.

The problem with the vector processing of multi-core/many-core machines is two-fold:

programming model: compilers were sufficiently smart with autovectorization
execution model: program speedups were bound by Amdahl's law

But the industry sidestepped these problems by changing the programming model to SIMT on SIMD (CUDA) and finding domains whose execution models had more parallelism (deep neural networks).

The challenge (producing a golden age) of compiler engineers and chip architects face is to find the optimal mapping from intelligence to energy. This means creating new programming languages and machiens while minimizing the accidental complexity that naturally builds up along the way:

Singularity Systems

The Singularity Systems: Zero to Hero course follows up from where Neural Networks: Zero to Hero left off: we will convert micrograd into picrograd, where the main difference is that:

micrograd is a backprop engine with scalar-support helping researchers understand that backpropagation as an abstraction is in fact leaky (gradient activations, normalizations)
picograd leans closer towards modern day deep learning frameworks with tensor-support for both pytorch1 interpretation and pytorch2 compilation.

While picograd is more oriented towards low-level system programmers and performance engineers, this framework remains pedagogical and remains a point-wise compiler. This means that we will "only" support:

1 model: llama
1 programming model: eager
2 execution model: eager, graph
2 hardware architectures: amd cpu, nvidia gpu
2 precisions: fp32, tf32

Welcome to the golden age of Systems ML!