Skip to content

Data Intensive Applications

Any of the following

Any of the following generation/usage increases quickly: - Volume of data - Complexity of data - Speed of change in data

Pillars

Pillar Properties
Reliable Fault-tolerance
No authorized access
Chaos Testing
Robust to Full Machine Failures
Bug-free, Automated bug tests
Environments: Dev, Staging/Testing, Prod
Quick roll-backs
Scalable Handle high traffic volume
Traffic load with peak # of reads, writes, simultaneous users
Capacity planning
Response time vs throughput
End user response time
90th, 95th percentile SLO/A service level objectives/agreements
Vertically-Scaling up (more powerful machine)
Horizontally-Scaling out (distributed across smaller machines)
Maintainable Add new people to work
Productivity
Operable: Configurable and testable
Simple: easy to understand and ramp up, well-documented
Evolveable: easy to change

Components

Tools
Databases Source of truth SQL
Cache Temporary storage of expensive operation Memcache
Full-text index Quickly searching data by keyword/filter ESIndex
Apache Lucener
Message queues MEssaging passing passing between process Apache Kafka
Stream Processing Apache Spark
Apache Samza
Batch Processing Crunching last amount of data Apache Spark
Apache Hadoop
Application code Connective tissue other components

Databases

Data Model - Relational model - Document-based model - Not great for analytics - Graph model

Aspects to keep in mind - Data storage - Data retrieval

ID

u K

OLTP OLAP
Online Transaction Processing Database Online Analytical Processing Database
Row-oriented Column-oriented
Optimized for Writes Reads
Flexibility High Low

Structure

  • Shallow dimension tables
  • Dense fact tables

Pipeline

  • ETL
  • ELT

Code Compatibility

  • Backward compatibility: Newer code can read data written by older code for old and new clients
  • Forward Compatibility: Older code can read data written by newer code for old and new clients

Rolling Upgrades

Canary

flowchart TB
ut[User Traffic] -->
lb[Load Balancer]

lb --> mu[Most Users] --> cv1[Code Version 1]
lb --> fu[Few Users] --> cv2[Code Version 2]

Replication

  • Machine failures
  • Latency for global audience
  • Scale to large userbase
  • Offline/network failures

Types - Leader-Replica - Writes go to leader - Reads may/may not go to leader - Reads go to replica - Multileader-Replica - Leaderless - Write to all replicas - Read from all replicas - Eg: Amazon Dynamo, Voldemort, Cassandra

Aspects to consider - Synchronous vs Asynchronous replication - Replication lag - Topology - Durability vs availability vs latency - Leader Failover - Conflict resolution between leaders

Database Partitioning

Sharding/Splitting/Horizontal Scaling

Last Updated: 2025-03-31

Comments