Decentralized Learning¶
Types¶
| Distributed | Offloading | Federated | Collaborative Learning | |
|---|---|---|---|---|
| Send data to central server for training Device only used as sensor | - Data never stored in data center - Encrypt data and only decrypt after averaging 1000 updates | Each device maintains functional model | ||
| Model Location | Servers | Servers | Device Aggregated on Cloud | Device |
| Data Location | Device Servers | Device Servers | Device | Device |
| Design goals | Speed | Privacy Online learning Security Scale | ||
| Device Types | Same | Different | ||
| Device Compute Power | High | Low | ||
| Training | Complex | Simple | ||
| Run training when phone charging Transmit updates when WiFi available | ||||
| Training examples | Next-word prediction | |||
| Number of devices | 10-1k | 100k+ | ||
| Network speed | Fast | Slow | ||
| Network reliability | Reliable | Intermittent | ||
| Data Distribution | IID | Non-IID (Each device has own data distribution) Not representative of training data | ||
| Applications | - Privacy - Personal data from devices - Health data from hospitals - Continuous data - Smart home/city - Autonomous vehicles | |||
| Advantages | - Save device battery - No need to support on-device training - Better accuracy? | - Most secure: Data never aggregated to a central server that could be compromised - Most scalable: No central server with bandwidth limitations | ||
| Limitations | - poor privacy - worse scalability | Not fully private: You can recover data from model parameters/gradient updates Consumes higher total energy | ||
| Challenges | Poor network | All challenges of FL | ||
| Example | Google Photos | |||
![]() |
Terms¶
| Straggler | Device that doesn’t return data on time |
| Data Imbalance | One devices has 10k samples, while 10k devices have 1 sample each |
Terms¶
Compression¶
- Gradient
- Data
Quantization¶
Quantization to gradients before transmission
Communication cost drops linearly with bit width
Pruning¶
Prune gradients based magnitude and compress zeroes
Distributed Training¶
- Model Parallelism: Fully-Connected layers
- Data Parallelism: Convolutional layers
Single GPU-system¶

Model Parallelism¶
All workers train on same batch
Workers communicate as frequently as network allows

Necessary for models that do not fit on a single GPU
No method to hide synchronization latency
- Have to wait for data to be sent from upstream model split
- Need to think about how pipelining would work for model-parallel training
Types
- Inter-layer
- Intra-layer
Limitations
- Overhead due to
- moving data from one GPU to another via CPU
- Synchronization
- Pipelining not easy
Data Parallelism¶
Each worker trains the same convolutional layers on a different data batch
Workers communicate as frequently as network allows
| Communication Overhead | Advantage | Limitation | |||
|---|---|---|---|---|---|
| Single-GPU | ![]() | ||||
| Multiple GPU | ![]() | Average gradients across minibatch on all GPUs Over PCIe, ethernet, NVLink depending on system | \(kn(n-1)\) | High communication overhead | |
| Parameter Server | ![]() | ||||
| Parallel Parameter Sharing | ![]() | \(k\) per worker \(kn/s\) for server | |||
| Ring Allreduce | ![]() | Each GPU has different chunks of the mini-batch | \(2k\dfrac{n-1}{n}\) | Scalable Communication cost independent of \(n\) |
where - \(n=\) no of client GPUs - \(k =\) no of gradients - \(s=\) no of server GPUs
Ring-Allreduce¶

Step 1: Reduce-Scatter¶


Step 2: Allgather¶

Weight Updates Types¶
| Synchronous | Asynchronous | |
|---|---|---|
| Working | - Before forward pass, fetch latest parameters from server - Compute loss on each GPU using these latest parmeters - Gradients sent back to server to update model | |
| Speed per epoch | Slow | Fast |
| Training convergence | Fast | Slow |
| Accuracy | Better | Worse |
![]() |
Pipeline Parallelism¶

Federated Learning¶
“Federated”: Distributed but “report to” one central entity
Conventional learning
- Data collection
- Data Labeling (if supervised)
- Data cleaning
- Model training
But new data is generated very frequently
Steps¶
- Download model from cloud to devices
- Personalization: Each device trains model on its own local data
- Devices send their model updates back to server
- Update global model
- Repeat steps 1-4
Each iteration of this loop is called “round” of learning
Algorithms¶
| Handling Stragglers | Handling Data Imbalance | |||
|---|---|---|---|---|
| FedAvg | The more data points a device has, the higher weight of device in updating global model | Drop | Poor | ![]() |
| FedProx | Use partial results | Discourage large weight updates through regularization \(\lambda {\vert \vert w' - w \vert \vert}^2\) \(w=\) Weight of single device | ||
| q-fed-avg | Discourage large weight updates for any single device | |||
| per-per-avg |

Data Labelling¶
How to get labels
- Sometimes explicit labeling not required: Next-work prediction
- Need to incentivize users to label own data: Google Photos
- Use data for unsupervised learning
Types¶
| Horizontal | ![]() |
| Vertical | ![]() |
| Transfer | ![]() |










