Exploring the System Design of a Q&A Site

As the digital landscape evolves, the demand for robust question-and-answer platforms continues to rise. Understanding the underlying architecture of such platforms, like Quora, is essential for engineers and developers aiming to create scalable and efficient systems. In this exploration, we dissect the functional and non-functional requirements, building blocks, workflow, and potential limitations of a Q&A site.

Functional Requirements

When designing a Q&A site, we only consider the following features and also ensure that, in the future, we can extend the features.

Questions and answers functionality
Commenting to the questions along with upvote and downvote capabilities
Recommendation system for personalized home feeds and advertising
Ranking mechanism for answers

Non Functional Requirements

The site should have the following characteristics due to the large number of users all over the world,

Scalability: The architecture should accommodate additional features and support a growing user base seamlessly.
Consistency: The questions and answers should be consistent with any set of users. This does not imply, that the newly added Q&A should be available to all the users right away.
Availability: The system should be highly available. It should handle a large number of concurrent requests and also be able to perform in case of a few server failures.
Performance: The system should serve the user without noticeable delay.

Resource Estimation

Consider 300 million active users. Each user will make 20 requests per day. With these assumptions, we are going to calculate the,

Number of servers
Storage size that includes database and blob storage
Network bandwidth

Servers

Each server can handle 8000 requests per second
Total request per day, TOTAL_REQUEST_PER_DAY = (3x10^6) * 20
Requests per second, TOTAL_REQUEST_PER_SECOND = (TOTAL_REQUEST_PER_DAY) / (24 60 60)
Number of servers required (TOTAL_REQUEST_PER_SECOND / 8000)

Storage

15% Q&A will have the image
5% Q&A will have video
For a Q&A there will be either an image or video or nothing. Both video and image can not be in a single Q&A
Consider 1 question from each active user
Two responses from each question

Calculate the storage

Let's assume

Each image size is 250 KB
Each video size is 5 MB
Text content and metadata regarding a question is 100 KB

Image Storage

15% of total Q&A has image
Size 15% of 300 Million * 250 KB = ~ 11 TB

Video Storage

5% of total Q&A has video
Size (5% of 300 Million) * 5 MB = ~75 TB

Text Content

1 Q&A per active user
Total 300 million active user
Each Q&A has an estimated of 100 KB textual content and metadata
Total 30 TB

Each of the day, we will require 11 + 75 + 30 = ~116 TB of storage

Bandwidth

Incoming Bandwidth

we will send 116 TB = (116 * 8) GB data per day
Bandwidth per second (116 * 8) / (24 hours in seconds) ~ 11 Gb/s

Outgoing Bandwidth

Consider 300 million active user
Consider each user sees 20 questions each day
- 300 million * 20 Q&A, each Q&A has 100 KB of textual and metadata
- ~600 TB
Consider that 15% of Q&A has an image
- 300 million active user
- Each user fetches 20 Q&A
- 15% of this Q&A have an image
- Each image has a size of 250 KB
- ~225 TB
Consider 5% of Q&A has video
- 300 million active user
- Each user fetches 20 Q&A
- 5% of these images have video
- Each video has a size of 5 MB
- ~1500 TB
Now total size of data: 600 + 225 + 1500 = 2325 TB ~ 2500 TB ~ 2500000 GB ~ 20000000 Gb
Outgoing bandwidth 20000000 / (24 hours in seconds) = 231 Gb/s

Building Blocks

Load Balancer: Distribute traffic between servers and services
Database: Store textual content in the DB
Distributed Caching: Schedule tasks and reduce loads to db and services
Blob Store: Store the images and videos

Web and Application Server

To handle requests
- web server for the manager processes
- application server for the worker process
Application servers maintain an in-memory queue to process different user requests
A router library between the web and application server
Manager process enqueued the tasks
Application process dequeued the tasks

Data Stores

A relational database MySQL for storing Q&A and comments, as it has a high level of consistency
An hBase to store metadata as it has a very high throughput in storing and retrieving data. Use these
Stats for recommendation later
Blob storage for the images and videos

Distributed Cache

memcacheD for critical data caching
Redis for upvote type data as it has in-store increment
CDN for serving images and videos

Computer Servers

For recommendation engine or ranking
The process will be online and offline
Probably running some ML operations
May have lots of memory and high processing power

Workflow

Posting Q&A and comments

The web server receives the request and passes it to the application server
Web servers also manipulate the web page like the request is in progress
The worker process will manipulate the database, ex: fetch or save data
The task will be prioritized by different queue
- User requests will be served earlier
- The weekly digest will have less priority
Images and videos will be stored in the blob storage
The answer ranking system will rank the answers based on upvotes, views, dates, and some other properties. An ML engine will back it up.
Extract and store metadata of answers, comments, images, and videos in hBase and feed these metadata to ML and rank offline

Recommendation System

Run both online and offline, used for
- User feed
- Find duplicate
- Generate add

Search

Build index and store in DB and keep frequently used in hBase
Make an index by tokenizing from Q&A, level, and comments

Limitations

Latencies of web and application servers: Latency of web and application servers communications.

In Memory Queue Failure: Tasks are queued in the queue. If a queue is failed, a lot of manual engineering will be required. Replicating the queue can be a solution but it will require extra memory. Tasks like view count should not hamper comparatively more important tasks like saving answers or questions.

MySQL QPS: Since we offer a lot of services, it is possible to encounter a lot of queries in our MySQL server. This will result in a huge latency in getting the query results.

HBase Latency: Although the HBase has a high throughput, it has slow latency. On top of that, since, we rely on the ML, at one point, it will have a poor performance.

Adjustment for the mentioned limitations

Latencies of web and application servers: Using a service host. A large powerful machine to handle all web and application processes at once in a single place.

In Memory Queue Failure: Use Kafka instead.

MySQL QPS: Use vertical sharding. If there are joins involved between two tables, put them in the same shard.

hBase Latency: Use MyRocks instead. Allows improved latency and data transfer tools between RocksDB and MySQL.

Conclusion

Understanding the intricate system design of a Q/A site like Quora offers invaluable insights into creating scalable, efficient, and reliable platforms in the digital age. By addressing functional and non-functional requirements, designing robust building blocks, establishing efficient workflows, and implementing mitigation strategies for potential limitations, developers can craft high-performance systems capable of meeting the demands of millions of users worldwide.

Shams Nahid's Blog