Efficiency¶

All elements on a single layer of a network are parallelizable

CPU Chip Area¶

Hardware Types¶

Purpose
General	CPU (Central Processing Unit)	Low Latency Control Flow
	GPU (Graphics Processing Unit)	High throughput Data flow
	TPU (Tensor Processing Unit) NPU (Neural Processing Unit)
Specialized	FPGA (Field Programmable Gate Assembly)	Re-Programmable Logic
	ASIC (Application Specific Integrated Circuit)	Fixed logic

Performance Metrics¶

	Metric			Common Units	Affected by Hardware	Affected by DNN
Compute	FLOPs/s	Floating-point operations per second			✅	❌
	OPs/s	Non floating-point operations per second			✅	❌
	MACs/s	Multipy-Accumulate Ops/s	Half FLOPs/s		✅	✅
	Latency	No of sec per operation		s	✅	✅
	Throughput	No of operations per second		Ops/s	✅	✅
Memory	Capacity			GB	❌	❌
	Bandwidth			GB/s	❌	❌
Workload	Operational intensity			Op/B	❌	✅
	HW Utilization				✅	✅

OPs¶

\[ \begin{aligned} &\text{OPs} \\ &= \text{Ops/sec} \\ &= \underbrace{ \dfrac{1}{\text{Cycles/Op}} \times \text{Cycles/sec} }_\text{for single PE} \times \text{No of PEs} \end{aligned} \]

PE = Processing Element

Roofline Plot¶

Characterize performance of given hardware device across different workloads, to help identify if a workload is memory-bound or compute-bound

Speed up	Technique
Memory-bound	Algorithmic improvement (reduce precision)
	Faster memory chip
Compute-Bound	Faster PE (Overclocking)

Operational Intensity¶

\[ \begin{aligned} \text{Operational Intensity} &= \dfrac{\text{No of Ops}}{\text{Mem Footprint}} \\ \text{No of Ops} &= \text{Multiplications} + \text{Additions} \\ \text{Mem Footprint} &= \text{Size of parameters} + \text{Size of activations} \end{aligned} \]

Quantifies the ratio of computations to memory footprint of a DNN

The same DNN can have different operational intensity on different hardware, if each device supports different numerical precision (Size of data affects operational intensity)

IDK¶

Performance Bottlenecks¶

Memory access efficiency
Uncoalesced reads
Compute utilization
Overhead of control logic
Complex DNN topologies
Control flow and data hazards may stall execution even if hardware is available

Hardware Efficiency¶

Energy breakdown¶

Hardware Efficiency Approaches¶

Approach	Technique
Arithmetic	Specialized instructions	Amortize overhead Reduce overhead fraction Perform complex/fused operations with the same data fetch SIMD Matrix Multiple Unit HFMA HDP4A HMMA
	Quantization	Lower numerical precision
Memory	Locality	Move data to inexpensive on-chip memory
	Re-use	Avoid expensive memory fetches Temporal: Read once, use same data multiple times by same PE SIMD, SIMT Spatial: Read once, use data multiple times by multiple PEs Dataflow processing Weights stationary (CNNs) Input stationary (Fully-Connected Layers) Output stationary
Operations	Sparsity	Skip ineffectual operations Activation Sparsity (Sparse Activation Functions: ReLU) Weight Sparsity (Regularization/Pruning) Block Sparsity Coarse-grained Fine-grained - Overhead
	Interleaving
Model storage	CSC Representation (Compressed Sparse Column)
Model Optim: Change DNN arch (and hence workload) to better fit HW	Compression
	Distillation
	AutoML

Floating-point add is more expensive relative to integer, compared to multiplication , due to shifting operations

Guidelines for DSAs¶

Domain-Specific Architectures

Dedicated memory to minimize distance of data transfer
Invest resources saved from dropping advanced micro-architectural optimizations into more arithmetic units/larger memories
Use easiest form of parallelism that matches the domain
Reduce data size and type to simplest needed for the domain
Use domain-specific programming language to port code to DSA

2025-12-01