As the digital landscape evolves, the demand for robust question-and-answer platforms continues to rise. Understanding the underlying architecture of such platforms, like Quora, is essential for engineers and developers aiming to create scalable and efficient systems. In this exploration, we dissect the functional and non-functional requirements, building blocks, workflow, and potential limitations of a Q&A site.
Functional Requirements
When designing a Q&A site, we only consider the following features and also ensure that, in the future, we can extend the features.
Questions and answers functionality
Commenting to the questions along with upvote and downvote capabilities
Recommendation system for personalized home feeds and advertising
Ranking mechanism for answers
Non Functional Requirements
The site should have the following characteristics due to the large number of users all over the world,
Scalability: The architecture should accommodate additional features and support a growing user base seamlessly.
Consistency: The questions and answers should be consistent with any set of users. This does not imply, that the newly added Q&A should be available to all the users right away.
Availability: The system should be highly available. It should handle a large number of concurrent requests and also be able to perform in case of a few server failures.
Performance: The system should serve the user without noticeable delay.
Resource Estimation
Consider 300 million active users. Each user will make 20 requests per day. With these assumptions, we are going to calculate the,
Number of servers
Storage size that includes database and blob storage
Network bandwidth
Servers
Each server can handle 8000 requests per second
Total request per day, TOTAL_REQUEST_PER_DAY = (3x10^6) * 20
Requests per second, TOTAL_REQUEST_PER_SECOND = (TOTAL_REQUEST_PER_DAY) / (24 60 60)
Number of servers required (TOTAL_REQUEST_PER_SECOND / 8000)
Storage
15% Q&A will have the image
5% Q&A will have video
For a Q&A there will be either an image or video or nothing. Both video and image can not be in a single Q&A
Consider 1 question from each active user
Two responses from each question
Calculate the storage
Let's assume
Each image size is 250 KB
Each video size is 5 MB
Text content and metadata regarding a question is 100 KB
Image Storage
15% of total Q&A has image
Size 15% of 300 Million * 250 KB = ~ 11 TB
Video Storage
5% of total Q&A has video
Size (5% of 300 Million) * 5 MB = ~75 TB
Text Content
1 Q&A per active user
Total 300 million active user
Each Q&A has an estimated of 100 KB textual content and metadata
Total 30 TB
Each of the day, we will require 11 + 75 + 30 = ~116 TB of storage
Bandwidth
Incoming Bandwidth
we will send 116 TB = (116 * 8) GB data per day
Bandwidth per second (116 * 8) / (24 hours in seconds) ~ 11 Gb/s
Outgoing Bandwidth
Consider 300 million active user
Consider each user sees 20 questions each day
300 million * 20 Q&A, each Q&A has 100 KB of textual and metadata
~600 TB
Consider that 15% of Q&A has an image
300 million active user
Each user fetches 20 Q&A
15% of this Q&A have an image
Each image has a size of 250 KB
~225 TB
Consider 5% of Q&A has video
300 million active user
Each user fetches 20 Q&A
5% of these images have video
Each video has a size of 5 MB
~1500 TB
Now total size of data: 600 + 225 + 1500 = 2325 TB ~ 2500 TB ~ 2500000 GB ~ 20000000 Gb
Outgoing bandwidth 20000000 / (24 hours in seconds) = 231 Gb/s
Building Blocks
Load Balancer: Distribute traffic between servers and services
Database: Store textual content in the DB
Distributed Caching: Schedule tasks and reduce loads to db and services
Blob Store: Store the images and videos
Web and Application Server
To handle requests
web server for the manager processes
application server for the worker process
Application servers maintain an in-memory queue to process different user requests
A router library between the web and application server
Manager process enqueued the tasks
Application process dequeued the tasks
Data Stores
A relational database MySQL for storing Q&A and comments, as it has a high level of consistency
An hBase to store metadata as it has a very high throughput in storing and retrieving data. Use these
Stats for recommendation later
Blob storage for the images and videos
Distributed Cache
memcacheD for critical data caching
Redis for upvote type data as it has in-store increment
CDN for serving images and videos
Computer Servers
For recommendation engine or ranking
The process will be online and offline
Probably running some ML operations
May have lots of memory and high processing power
Workflow
Posting Q&A and comments
The web server receives the request and passes it to the application server
Web servers also manipulate the web page like the request is in progress
The worker process will manipulate the database, ex: fetch or save data
The task will be prioritized by different queue
User requests will be served earlier
The weekly digest will have less priority
Images and videos will be stored in the blob storage
The answer ranking system will rank the answers based on upvotes, views, dates, and some other properties. An ML engine will back it up.
Extract and store metadata of answers, comments, images, and videos in hBase and feed these metadata to ML and rank offline
Recommendation System
Run both online and offline, used for
User feed
Find duplicate
Generate add
Search
Build index and store in DB and keep frequently used in hBase
Make an index by tokenizing from Q&A, level, and comments
Limitations
Latencies of web and application servers: Latency of web and application servers communications.
In Memory Queue Failure: Tasks are queued in the queue. If a queue is failed, a lot of manual engineering will be required. Replicating the queue can be a solution but it will require extra memory. Tasks like view count should not hamper comparatively more important tasks like saving answers or questions.
MySQL QPS: Since we offer a lot of services, it is possible to encounter a lot of queries in our MySQL server. This will result in a huge latency in getting the query results.
HBase Latency: Although the HBase has a high throughput, it has slow latency. On top of that, since, we rely on the ML, at one point, it will have a poor performance.
Adjustment for the mentioned limitations
Latencies of web and application servers: Using a service host. A large powerful machine to handle all web and application processes at once in a single place.
In Memory Queue Failure: Use Kafka instead.
MySQL QPS: Use vertical sharding. If there are joins involved between two tables, put them in the same shard.
hBase Latency: Use MyRocks instead. Allows improved latency and data transfer tools between RocksDB and MySQL.
Conclusion
Understanding the intricate system design of a Q/A site like Quora offers invaluable insights into creating scalable, efficient, and reliable platforms in the digital age. By addressing functional and non-functional requirements, designing robust building blocks, establishing efficient workflows, and implementing mitigation strategies for potential limitations, developers can craft high-performance systems capable of meeting the demands of millions of users worldwide.