Skip to content

Latest commit

 

History

History
201 lines (151 loc) · 10.7 KB

METALIUM_GUIDE.md

File metadata and controls

201 lines (151 loc) · 10.7 KB

TT Architecture and Metalium Guide

TT Architecture and Metalium Overview

Tenstorrent has built the future of AI architecture and parallel programming. It achieves high performance on current AI models but is flexible and programmable, enabling the invention of future AI models and HPC (High-Performance Computing) applications without the constraints of current architectures. It is designed for both inference and training and is, from the ground up, designed for the scale-out of AI workloads, while also allowing for scaling down to just a couple of cores. Additionally, it is built using cost-effective components: simple packages, GDDR memory and Ethernet. This document describes it.

All you need is a Tensix core and a mesh

A Tensix Core is:

  • 5 small RISC-V processors (aka "Baby RISCVs") that run C/C++ kernels and dispatch instructions to the compute and data movement engines
  • 1 MB SRAM memory, (aka L1) a scratch pad accessible by all RISCVs and engines within the core
  • Matrix engine (aka FPU) that performs Matrix multiplication, elementwise, and dot product operations on small matrices (or tiles) of shape 32x32 and similar
  • Vector engine (aka SFPU) for vectorized kernels such as Top-k, Sort and special functions such as GELU, Exp, and Sqrt
  • Data Movement engine connected to 2 Networks on Chip (NoCs)

A chips is a collection of cores and I/O blocks, connected into a mesh via a NoC:

  • Tensix compute core (each with local SRAM)
  • DRAM memory banks
  • Ethernet cores for chip-to-chip interconnect
  • PCIe link for host interface
  • ARC core for board management and control
image

Near Memory Compute and Efficient use of SRAM

The high BW and large capacity SRAM in each Tensix core is a form of near memory compute. A Tensix core operating on its local SRAM achieves "silicon peak" of what current technology node allows for. Tensix cores are connected into a mesh via 2 NOCs, and each Tensix core can communicate with any other Tensix core in the mesh, and with off-chip DRAM, as well as Ethernet cores. GPU fracture SRAM across levels within the chip: large register files, small L1, and L2. SRAM is primarily used for re-use and pre-fetch on the way to off-chip DRAM, not as a primary form of Tensor storage and large parts of it not at the peak silicon speed. In contrast, in TT architecture, the entire SRAM is in one single level, and its significant capacity allows it be used as intermediates between operations, w/o relaying on HBM as the primary storage to hand-off data between operations.

Distributed Shared Memory and In-Place Compute

The mesh of Tensix cores architecture is the first one to efficiently implement distributed shared memory within the chip and enable programmers and compilers to optimize both layout and movement of the data. In many AI and HPC operations, such as as elementwise, the tensors can be laid out (ie "sharded") across SRAMs so that compute can operate on the local data in-place without any data movement. Further elaboration in Scalable Architecture section.

Explicit Data Movement

The performance and efficiency of data movement in AI and HPC application is as important as raw compute capacity of the math engines. In Tenix, data movement is explicit and decoupled from compute. The data movement kernels use the data movement engine in each Tensix to bring data from neighbouring cores or off-chip DRAM to the local SRAM of the Tensix core, and trigger the compute engine to operate on the data. The data movement in TT architecture can be pre-planned, optimized and debugged separately from the compute. There is no caches, no global crossbars, no memory access coalescing or other complex mechanisms that are used in traditional architectures that hide the data movement from the programmer or compiler. For deeper insight see section User Kernels: Explicit and Decoupled Data Movement and Compute.

Native Tile-Based Compute

In Tensix, compute instructions operate on tiles -- 32x32 matrix of scalars. Operating on coarse chunks of data allows for use of a simple single-threaded RISCVs processors to dispatch these instructions. Similarly, data movement RISCVs issue asynchronous tile-sized data movement instructions to bring data into the scratch SRAM, allowing for large number of outstanding transfers generated by a single RISC-V data movement processor, concurrently with the compute engine.

Think Bare Metal Cores, Not Threads

Each RISCV processor runs single-threaded and Core-to-Thread mapping is 1:1. Thus, the parallelization involves breaking the work across cores and dispatching the kernels directly to cores. This is in contrast to a complex thread scheduling scheme where a very large number of threads is time-slice scheduled onto a limited number of cores. As a result, in TT architecture there is no context switching or complex thread scheduling. Once the kernel is dispatched to a core it runs to completion without interruption or preemption by another thread. This simplifies reasoning about performance: it boils down to direct cycle couting of sections of a C/C++ kernel running on a bare metal RISCV core. Equally important, it simplifies direct debug of kernels via gdb step-through, breakpoints, and printf from cores.

Scalable Architecture

AI workloads operate on tensors (N-dimensional data) and exhibit a high degree of locality and regularity in the fundamental compute operations:

  • Elementwise operations are entirely local on each element in the tensor (in-place), and can be achieved without any data movement
  • Matrix multiplication operations have regular communication across the rows and columns of the matrix
  • Reduction operation can be decomposed across dimensions, such as columns, rows, and nearest neighbours in a matrix
  • Window based (stencil) operations such as convolutions exchange data with their neighbours

These data movement patterns (local, row/column, nearest neighbour) are most efficiently implemented via a regular and scalable mesh architecture.

Tenstorrent architecture is a mesh of cores within a chip and mesh of chips at the cluster level. image TODO: Describe Galaxy, break up the slide into two slides

Two levels of memory

TODO: Describe SRAM and DRAM as two levels of memory hierarchy within the mesh, both distributed and explicit data movement.

Compute Density via Scale-Out

TODO: Describe that TT wins at scale-out, best compute density at the server and cluster level

MIMD and Control of Both Compute and Data

  • "Program cores not threads"

Everything is a RISCV kernel

  • Bare Metal C/C++ kernels on RISCV
  • User Kernels
    • Data Movement Kernels
    • Compute Kernels
    • Ethernet Data Movement Kernels
  • Dispatch Kernels
image image

Efficiency of Tile-Based Compute and Data Movement

Interleaved and Sharded Buffers

  • Interleaved Buffers
  • Sharded Buffers

Fast Kernel Dispatch

FAQ

Where is CUDA, scalar threads, and caches?

Where is HBM?

For GPU, CPU, FPGA, TPU experts

  • GPU
  • CPU
  • FPGA
  • TPU

First Principles Summary

TT Architecture deep dive

TODO: 1) TOC, write it

Learn Metalium in 7 steps

TODO: 1) TOC, 2) write it

Eltwise Binary Compute Kernel

TODO: this is a placeholder

#include <cstdint>
#include "compute_kernel_api/eltwise_binary.h"
#include "compute_kernel_api/tile_move_copy.h"

namespace NAMESPACE {
void MAIN {
    uint32_t per_core_block_cnt = get_arg_val<uint32_t>(0);
    uint32_t per_core_block_size = get_arg_val<uint32_t>(1); // should be <= 8 in this kernel

    constexpr auto cb_in0 = tt::CB::c_in0;
    constexpr auto cb_in1 = tt::CB::c_in1;
    constexpr auto cb_out0 =  tt::CB::c_out0;

    binary_op_init_common(cb_in0, cb_in1, cb_out0);
    add_tiles_init();

    for(uint32_t block = 0; block < per_core_block_cnt; ++block) {

        // wait for a block of tiles in each of input CBs
        cb_wait_front(cb_in0, per_core_block_size);
        cb_wait_front(cb_in1, per_core_block_size);

        tile_regs_acquire(); // acquire 8 tile registers
        // add a block of tiles
        for(uint32_t i = 0; i < per_core_block_size; ++i)
        {
            add_tiles(cb_in0, cb_in1, i, i, i);
        }
        tile_regs_commit(); // signal the packer

        tile_regs_wait(); // packer waits here
        // pack a block of tiles
        for(uint32_t i = 0; i < per_core_block_size; ++i)
        {
            pack_tile(i, cb_out0);
        }
        tile_regs_release(); // packer releases

        // pop a block of tiles from each of input CBs
        cb_pop_front(cb_in0, per_core_block_size);
        cb_pop_front(cb_in1, per_core_block_size);

        // push a block of tiles to output CB
        cb_push_back(cb_out0, per_core_block_size);
    }

}
}