Deep Dive Inside a Distributed Cache Engine
Core components while designing a distributed, scalable and fault-tolerant cache system.
Caching data helps to improve application performance. Failing the caching service will put an extreme load on the database and result in poor performance, in the worst case, can crash the service. While we design a caching service, we should consider low latency at the minimum cost possible. Depending on the scenario and application requirements, we should choose the appropriate and affordable caching mechanism. A caching mechanism should offer,
- Store data
- Define how data will be stored
- Remove/invalid data
- Define how data will be replaced
- Get data
Caching Advantages
- Minimize the computations for complex queries in the database layer
- Minimize network call
- API caching reduce
API Gateway -> Application Service
calls - Database caching can reduce
Application Service -> Database
queries
- API caching reduce
- Storing the user's sessions data. Example can be
- Store users carts in e-commerce application
- Sore info of riders or drivers in a ride-sharing app of their position
Types of Caching
Read Through: Application first go to the cache-store to fetch the data
If data exist in the cache-store it returns the data to the application
When data does not exist in the cache-store, the cache itself fetch the data from the database
- It can save the data in the cache and return it to the application
- It can first return the data to the application and save the data in the cache-store (Better)
Write Through: Storing data to Database
can be handled by either the Cache
or the Application
.
When the data persist
is handled by Cache
, the application first writes data to the Cache
store and then Cache
store write data to Database
. It can be time-consuming, because, we need validation that both cache and database have persisted the data synchronously. If the cache goes down before the data is persisted in the database, the data can be lost.
Cache Aside: With this approach, data will be persisted in cache and database both handled by the application. In this case, failure of the caching mechanism will not lose the data.
Write Back / Behind Cache: Another hybrid architecture can be store data in the cache initially and after a certain period/threshold we will persist all these data in the database as a bulk insert.
Read / Refresh Cache: Data is cached before the user looked for it. Use some prediction engine or machine learning model to decide which data should be loaded. If we know, the user will look for the followers feed, we will load it the moment user logged in and show the data when asked.
Placing Cache
Depending on the application requirements, we have to decide how close the cache and application server will be.
It could be,
- Cache server can be attached with each application server
- Cache server can be in a global place in front of the database
If we put the cache along with each application server, it will be the faster response. But considering if a server fails, the cache will also fail. Also, there will be no sync of cache data between servers.
On the other hand, if we use a global database, even though an application fails, the cache data will still be available. In this case, although it is a comparatively slow response, still more accurate and we can scale the caching mechanism independently.
Cache Replacing / Eviction Policy
Cache memory is limited and we need a defined policy on how we make cache data invalid.
- Least Recently Used Replace the data, that was used the longest time ago.
- Least Frequently Used Replace the data, that has a very low use rate.
- Most Recently Used Replace the most recent data that is used. In this case, the data was cached on the prediction that, it might be accessed. Only when it goes to the clients, the data is no longer required in the cache.
- FIFO Replace the oldest data from the cache by caching the latest data.
Non Functional Requirements
When it comes to designing a distributed caching mechanism, we have to consider,
Scalability: For a scalable system, we have to consider multiple servers. To handle and distribute millions of data, we should distribute these cache data to multiple servers. We can either use a key range to distribute data or make use of consistent hashing that will uniformly distribute data among all the servers.
Fault-tolerant: We can use multiple servers and replicate each server's data in other servers. In this case, even if one server is lost, we can use the replicated data. To manage this data replication, we can do one of the following approaches,
- Single Master: In this case, a single master is responsible replicate data and also for deciding the child server to define read operation.
- Multiple Server: In some cases, this approach is being used.
- No Master: The application layer is responsible for doing all the data replication stuff.
System Topology: All their master/child servers' data, their read-write operations, a proxy server to handle the Client SDK
will be handled by the topology manager.
There's a lot when it comes to designing a distributed caching mechanism like Redis. These are the core components when we design such a large-scale system. Feel free to reach out for any queries.