Efficiency¶
All elements on a single layer of a network are parallelizable
CPU Chip Area¶

Hardware Types¶
| Purpose | ||
|---|---|---|
| General | CPU (Central Processing Unit) | Low Latency Control Flow |
| GPU (Graphics Processing Unit) | High throughput Data flow | |
| TPU (Tensor Processing Unit) NPU (Neural Processing Unit) | ||
| Specialized | FPGA (Field Programmable Gate Assembly) | Re-Programmable Logic |
| ASIC (Application Specific Integrated Circuit) | Fixed logic |
Performance Metrics¶
| Metric | Common Units | Affected by Hardware | Affected by DNN | |||
|---|---|---|---|---|---|---|
| Compute | FLOPs/s | **F**loating-point **op**erations per **s**econd | ✅ | ❌ | ||
| OPs/s | Non floating-point **op**erations per **s**econd | ✅ | ❌ | |||
| MACs/s | Multipy-Accumulate Ops/s | Half FLOPs/s | ✅ | ✅ | ||
| Latency | No of sec per operation | s | ✅ | ✅ | ||
| Throughput | No of operations per second | Ops/s | ✅ | ✅ | ||
| Memory | Capacity | GB | ❌ | ❌ | ||
| Bandwidth | GB/s | ❌ | ❌ | |||
| Workload | Operational intensity | Op/B | ❌ | ✅ | ||
| HW Utilization | ✅ | ✅ |
OPs¶
\[ \begin{aligned} &\text{OPs} \\ &= \text{Ops/sec} \\ &= \underbrace{ \dfrac{1}{\text{Cycles/Op}} \times \text{Cycles/sec} }_\text{for single PE} \times \text{No of PEs} \end{aligned} \]
PE = Processing Element
Roofline Plot¶
Characterize performance of given hardware device across different workloads, to help identify if a workload is memory-bound or compute-bound

| Speed up | Technique | |
|---|---|---|
| Memory-bound | Algorithmic improvement (reduce precision) | ![]() |
| Faster memory chip | ![]() | |
| Compute-Bound | Faster PE (Overclocking) | ![]() |
Operational Intensity¶
\[ \begin{aligned} \text{Operational Intensity} &= \dfrac{\text{No of Ops}}{\text{Mem Footprint}} \\ \text{No of Ops} &= \text{Multiplications} + \text{Additions} \\ \text{Mem Footprint} &= \text{Size of parameters} + \text{Size of activations} \end{aligned} \]
Quantifies the ratio of computations to memory footprint of a DNN
The same DNN can have different operational intensity on different hardware, if each device supports different numerical precision (Size of data affects operational intensity)

IDK¶

Performance Bottlenecks¶
- Memory access efficiency
- Uncoalesced reads
- Compute utilization
- Overhead of control logic
- Complex DNN topologies
- Control flow and data hazards may stall execution even if hardware is available
Hardware Efficiency¶
Energy breakdown¶

Hardware Efficiency Approaches¶
| Approach | Technique | ||
|---|---|---|---|
| Arithmetic | Specialized instructions | Amortize overhead Reduce overhead fraction Perform complex/fused operations with the same data fetch SIMD Matrix Multiple Unit HFMA HDP4A HMMA | ![]() |
| Quantization | Lower numerical precision | ![]() | |
| Memory | Locality | Move data to inexpensive on-chip memory | ![]() ![]() |
| Re-use | Avoid expensive memory fetches Temporal: Read once, use same data multiple times by same PE SIMD, SIMT Spatial: Read once, use data multiple times by multiple PEs Dataflow processing Weights stationary (CNNs) Input stationary (Fully-Connected Layers) Output stationary | ![]() ![]() ![]() | |
| Operations | Sparsity | Skip ineffectual operations Activation Sparsity (Sparse Activation Functions: ReLU) Weight Sparsity (Regularization/Pruning) Block Sparsity Coarse-grained Fine-grained - Overhead | |
| Interleaving | |||
| Model storage | CSC Representation (Compressed Sparse Column) | ||
| Model Optim: Change DNN arch (and hence workload) to better fit HW | Compression | ||
| Distillation | |||
| AutoML |
Floating-point add is more expensive relative to integer, compared to multiplication , due to shifting operations
Guidelines for DSAs¶
Domain-Specific Architectures
- Dedicated memory to minimize distance of data transfer
- Invest resources saved from dropping advanced micro-architectural optimizations into more arithmetic units/larger memories
- Use easiest form of parallelism that matches the domain
- Reduce data size and type to simplest needed for the domain
- Use domain-specific programming language to port code to DSA









