1GB file is generated every 24 hours on one of the machines inside Google data center. This file needs to be copied only once to all the machines across all the data centers. Design this system.
Google has millions of machines and data centers across the globe. Thus, it is impossible to use this machine as a single centralized point of file sharing. The goal is to design a system architecture and flow that can handle these huge files. Design interviews test your ability to analyze and solve big problems, quickly. They are hard, but not impossible, especially if you prepare carefully.
Important Design Considerations:
The most important part is to gather requirements, reduce problem ambiguity, and explore all possible hidden constraints.
What are the clarifying questions we can ask to the original problem?
- Can we compress that file?
- Can we connect any pair of machines?
- Can we be fault-tolerant to network failures?
- How much time will it take to copy that file?
- What are the different types of failure events that can occur?
- How to optimize sync time
- Not all of the nodes are equal in compute/storage/network characteristics and current load (node could be running heavy cpu/gpu tasks).
Capacity Estimation and Constraints
- Let’s assume 1 data center has 100k server nodes and machines on average Likewise all these servers connected via a (single local) flat topology network where any node can connect to any other node within the same DC.
- Let’s estimate and assume that we have up to 400 MiB bandwidth (upload+download)
If we serve that file from a single machine a single download will take 1 GB / 400 MiB/ s = 25s
Therefore distributing it over 100k machines would take more than 25 days!
Even if we assume that that file is some sort of a log file and could be compressible to 0.5 of the size it still does lead us to 12–15 days of synchronization. So let’s keep compression aside for now.
Apparently synchronization time is the key problem of this design but the good news that it scales ~ O(N/M) where N is the number of servers and M is the number of file serving peers (M=1 in our back on the napkin math above)
By the end of this section, we have the answer to
- (4) How much time will it take to copy that file — from 25s per download and O(N/M) for total sync
Obviously this problem could reveal a whole new level of complexity if we change the initial requirement from 1GB to 100GB but this could be a bonus question*
Approaches and trade-offs
Before moving forward let’s consider answers to a few questions to navigate our problem-solving vector
- Should we divide the file into smaller pieces and serve it to all the machines?
- Should we consider using a peer-to-peer approach to address O(N/M) ?
- Should we just increase file servers M to a fixed number like 100 in O(N/M) ?
- How to coordinate 100k machines from a greed lock and not exhaust all network bandwidth?
Let’s start exploring p2p file sharing idea. In a simplified case — to build it we need to have 3 things — file seeders and file downloaders and seeders index for coordination. The last one also needs to be centralized. Seeder (FS file seeder) and leecher (FL) nodes are coordinated via the indexer. The indexer itself needs to be stateful, linearly scalable, and fault-tolerant. This could be an actual practical challenge.
The p2p coordination flow is the following.
- FS nodes register and de-register from an indexer
- FL pulls the indexer and connects to any node available seeder to download the file
- Once the download finishes FL register on the indexer and becomes FS(1)
These steps repeat until eventually the file will be shared with all nodes.
How to address the cold start problem when potentially all 100k will try to connect to the same machine?
To address that we either need to have a distributed lock (which easily can be implemented with KV storage) as an additional synchronization mechanism. Likewise, we could handle it in the indexer implementation.
Do we know a classic data structure that could help us represent and p2p indexer?
A naive approach would be a KV storage or hash map to implement a lock on the node while it’s downloading. The trade-off of this approach is that client will have to retry connection attempts to a seeder and constantly pulling this lock state from the indexer.
Another option is to use stack/queue to store available seeders. And after each download, FS and FL can self enqueue to that queue to speed up the file distribution.
My favorite approach is to use a distributed queue for the following reasons. Seeder can enqueue coordinates of a local file server connection (ip/port) into the queue and wait for the download. FL just needs to pick up the message and connect to the seeder for the download (which exclusively grants access to that seeder). Once the download is complete we will have +2 just enqueued seeders (one original seeder and one just converted leecher)
- The queue is super easy to scale and partition
- This approach is simple and elegant using a classic well-known concept.
- For full synchronization, it only requires O(Ln(N)) time on avg
- Adaptive load distribution — faster seeders will share faster
- Having 100k messages in the queue will indicate that sync has been finished.
- Since at peak we need to handle up to 100k rps we could spin up 10 queue partitions x 10k rps which is on the lowest spectrum of benchmarks for any queue implementation.
Note: Simple solution is always the better one.
- Worst case scenario leads to O(N) when only one node can seed download
- A slow FL can be a bottleneck for a fast distribution (node could be in the middle of a long term computational task). As an option, we can exclude such nodes from seeding.
- If the download fails seeder needs to re-register into the queue and this could go potentially into an infinite loop in combination with the same bad FL. This could be handled by storing additional counter/download status state in the globally
- Delivery at least once as a common pitfall for the queues. The same message could be delivered to 2 consumers which leads to a race condition. To handle that we can use a single server TCP connection simultaneously available on the downloader. Any connection refused errors should be ignored by the FL
- Handle dead seeder nodes in the queue. Similar behavior to (4) from FL perspective
- Handle dead FLs in progress measurement. Could be resolved by tracking it via an additional state
- What if one of the queue partitions goes down? This could be handled by partition logic/implementation such as consistent hashing using unique node attributes (ip/mac addresses/hardware id etc..)
Assuming 50 data centers across different regions. This task could be actually simplified if we assume there is a DC gateway where we can push that file into a single machine. Thus this problem is reduced to a single DC version which we already deconstructed.
- What if the network is not flat?
- What if there is ~ 2% rate of connection errors between nodes due to multi-tenancy issues?
- How to monitor the process?
Enjoyed this article or have a better solution? Let me know in the comments.