Fault-tolerance No authorized access Chaos Testing Robust to Full Machine Failures Bug-free, Automated bug tests Environments: Dev, Staging/Testing, Prod Quick roll-backs
Scalable
Handle high traffic volume Traffic load with peak # of reads, writes, simultaneous users Capacity planning Response time vs throughput End user response time 90th, 95th percentile SLO/A service level objectives/agreements Vertically-Scaling up (more powerful machine) Horizontally-Scaling out (distributed across smaller machines)
Maintainable
Add new people to work Productivity Operable: Configurable and testable Simple: easy to understand and ramp up, well-documented Evolveable: easy to change
Types - Leader-Replica - Writes go to leader - Reads may/may not go to leader - Reads go to replica - Multileader-Replica - Leaderless - Write to all replicas - Read from all replicas - Eg: Amazon Dynamo, Voldemort, Cassandra
Aspects to consider - Synchronous vs Asynchronous replication - Replication lag - Topology - Durability vs availability vs latency - Leader Failover - Conflict resolution between leaders