Table of Contents
Introduction
Hi everyone. In this article, we are going to cover the complete end to end guide to mastering system design. This article aims to be both thorough and clear regarding what we need to learn in high-level and low-level design. Now whenever we talk about system design, it is a very important concept. Especially if we are targeting SDE2 or higher roles. And if we want to become a CTO and a tech lead in the future. Because in system design we cover many such important concepts which are very important for designing highly scalable robust systems.
What is System Design?
To put this in simple layman’s terms, system design is the process of creating daily tech-based processes. For instance, we must have utilized Instagram, WhatsApp, Uber, and Netflix on a daily basis. We must therefore develop one of these systems during system design interviews. The interviewer might say, for instance, that you must design WhatsApp. As software engineers, there are a few fundamental tools and ideas that we should have before designing WhatsApp. Additionally, system design teaches us these concepts and tools.

Thus, system design is taught to us in two stages. The first is high level design, where we create the system’s general layout. For instance, which database would we choose if we were designing WhatsApp? How are we going to use message queues? How can we make advantage of cache? This allows us to build a system as a whole and describe its various parts. We don’t actually need to code anything here.

Low level design is the second component of system design where machine coding is actually required. Where API design is also done. Where class diagrams are also made. Where our models are also described. This is where our real coding abilities are put to the test. We will be covering the concepts of both HLD and LLD in detail. This article will only cover detailed roadmap of HLD. There will be a separate article for LLD.
Prerequisites for Learning System Design
Before diving into system design, it’s crucial to have hands-on development experience. If we go and learn system design directly without learning development, without creating any project, without doing some work at least at SDE one level in any company, then many system design concepts will seem theoretical. This practical experience helps you relate to real-world scenarios, like understanding when to use a database or a cache.
High-Level Design (HLD)
Let’s first discuss HLD. Imagine if during an interview, the interviewer asks us to create a system that is similar to Netflix. Alternatively, we could create Netflix itself. This implies that we must create a service that can smoothly deliver video content to millions of viewers in many nations. We must now understand two things in order to understand this system more fully: Functional and Non-Functional requirements
Functional Requirements: Functional requirements refer to the exact characteristics of the system that we must develop and that actually will be used by the users. For example, different features that users will be able to use on Netflix.:
- User registration and login
- Subscription management
- Video playback and controls (e.g., play, pause)
Non-functional requirements: Non-functional requirements refer to the functionality of our system. What characteristics will it possess? Like:
- Security: Only paid users can access premium content, requiring robust authentication and authorization.
- Low Latency: Videos should load quickly to ensure a seamless user experience.
- Scalability: The system must handle millions of concurrent users across different regions.
Once these requirements are clear, the next step is to design the system architecture of Netflix with these factors in mind.
Step 1: Fundamentals
- Serverless vs Serverful : The first step in high-level design (HLD) is understanding architecture choices, like serverless vs serverful. For example, deploying Netflix on AWS Lambda (serverless) means AWS manages scaling and infrastructure, while AWS EC2 (serverful) provides more control but requires more management. The choice depends on the trade-off between scalability, cost, and control, as serverless can be costly for large, high-traffic systems. So, we have to think about such things in advance about our system.
- Horizontal vs Vertical Scaling: Initially, our Netflix app may have 100 users, 1000 users. So, a single server is enough there. Vertical scaling means increasing the capacity of a single server (e.g., more RAM or storage) to handle more users, which has a physical limit.Horizontal scaling involves adding multiple servers and using a load balancer to distribute traffic, providing better scalability for large systems like Netflix.
- What are threads, what are pages? How does the internet actually work? How does the request response cycle work? How does DNS work? We should have basic knowledge of all these things.
Step 2: Databases
Data is at the core of coding and programming. It involves both data manipulation and data transfer. It has to do with data storage. where data storage serves a number of purposes. Databases are therefore the most crucial component of any system.
Understanding the differences between SQL (e.g., MySQL, PostgreSQL) and NoSQL (e.g., MongoDB, Neo4G) databases, their pros and cons, as well as concepts like in-memory databases, data replication and migration, data partitioning, and sharding (horizontal data partitioning) is critical, as these are common interview topics in top tech companies.
Step 3: Consistency and Availability
Understanding data consistency and availability is crucial for system design. Consistency ensures every read returns the latest write, while availability guarantees a response, even if it’s not the latest data.
- Data consistency and its levels: Learn about eventual consistency, quorum consistency and causal consistency, linearizable consistency.
- Isolation and its levels: Learn about the differences between read uncommitted and read committed. What is repeatable read.
- CAP Theorem: According to the CAP theorem, we essentially have to choose between availability and consistency in the case of a network partition. Depending on the type of system we have, we must decide which of the two to prioritize.
For example, if we talk about Netflix, then we will build a complete system for subscription purchase for user’s payment, there we will give more priority to consistency as compared to availability. That’s why payment systems like Paytm or PhonePe prioritize consistency to ensure accurate transaction records, often using SQL databases. In contrast, real-time messaging apps like WhatsApp prioritize availability to deliver faster, though potentially stale, messages.
Step 4: Cache
- What is Cache : a cache holds the data that we wish to provide to users rapidly and with minimal latency. As a result, this is typically our most used data or highly popular data that many users will use. Common caching systems include Redis and Memcached, each with different write and replacement policies.
- Write policies: Learn about how write-back, write-through and write-around determine how data is updated in the cache and main storage.
- Replacement Policies: Replacement policies like Least Frequently Used (LFU), Least Recently Used (LRU), and Segmented LRU decide which new data will come in the cache and which old data will be replaced.
- Content Delivery Networks(CDNs): CDNs also play a crucial role here, ensuring fast delivery of static content. For example, if a popular show like Black Mirror is releasing a new season, Netflix can pre-store episodes on its CDN servers across different regions, reducing load and minimizing buffering for millions of users.
Step 5: Networking
We also need to cover networking concepts like the differences between TCP and UDP, HTTP vs HTTPS, and the various HTTP versions (1, 2, 3) along with their key improvements. Also learn about WebSockets for real-time, full-duplex communication and WebRTC, which is crucial for video streaming in systems like Google Meet or Zoom.
Step 6: Load balancers
We need to learn about:
- Role in horizontal and vertical scaling.
- Types of load balancing: Stateless vs Stateful.
- Common load balancing algorithms: Round Robin, Least Connections, Consistent Hashing.
- Use of Proxy and Reverse Proxy.
- Rate Limiting for DDoS protection and its implementation.
Step 7: Message Queues
- Asynchronous Processing: The basic work of message queues is to handle the non-critical tasks(tasks that can tolerate slight delays). Example: In WhatsApp, sending a message is critical, but showing the double tick for message delivery is less critical and can be handled by message queues.
- Publisher-Subscriber Model: Kafka, Rabbit MQR is a popular option. It uses a very popular publisher subscriber model about which we can read or learn about.
Step 8: Monoliths and Microservices
According to industry standards, whenever we discuss the construction and design of systems, we will first construct them as a monolith. However, as our systems grow in size and our teams get bigger, we will eventually need to switch from monoliths to microservices.
- Why do we need micro services?
- Concept of single point of failure: A risk in monolithic systems, where a single component failure can bring down the entire system.
- Avoid cascading failures: How small failures can spread across a system, which microservices help avoid.
- Containerization: A crucial part of microservices, allowing services to be packaged independently. Popular tools include Docker.
- How do we migrate from monoliths to micro services: Moving from a monolith to microservices is complex and requires careful planning.
Step 9: Monitoring and Logging
- Logging events and monitoring metrics: Essential for identifying issues. Suppose, during a Netflix subscription sale, heavy traffic can cause system issues. Logs help identify these problems, allowing for better preparation in future sales. Therefore, event logging and real-time metrics monitoring are essential.
- Anomaly detection: Identifies unusual behavior before it impacts the user experience. Popular tools include: AWS CloudWatch, Grafana, Prometheus.
Step 10: Security
We have previously discussed that Netflix content should only be accessible to subscribers who had paid for access. Strong permission and authentication within our system are necessary for it.
- Tokens for auth
- SSO and OAuth
- Access control lists and Rule Engines: To specify access and various restrictions for varying levels of system access, we have an access control list.
- Encryption
Step 11: System Design Tradeoffs
System design is all about making trade-offs. There are no fixed right or wrong answers, just different approaches based on the problem. In interviews, you need to justify your design choices and explain your thought process, as different designers may prioritize different aspects of the system.
- Push vs Pull Architecture: Push sends data proactively without client requests, while pull waits for client requests before sending data.
- Consistency vs Availability (CAP Theorem): Choosing between always returning the latest data (consistency) or ensuring the system is always responsive (availability) during network partitions.
- SQL vs NoSQL: SQL provides strong consistency and structured data, while NoSQL offers scalability and flexibility but may compromise consistency.
- Memory vs Latency vs Throughput vs Accuracy: Balancing fast response times, high data throughput, low memory usage, and data accuracy, depending on the application’s needs.
Step 12: Practice
The final step is practice, practice and practice. The more system design problems you solve, the better you’ll get at designing scalable systems. Analyzing real-world examples helps you understand where to use message queues, when to prefer cache, which database should be preferred where and how to balance consistency and availability based on the use case.
So here are some examples of the most popular 10 services that you should practice before attending a system design interview.
- YouTube: Teaches us about handling massive video storage, efficient video streaming, recommendation algorithms, and content delivery at scale.
- Twitter: Shows us how to handle real-time data streams, optimize for low-latency, and manage high-read traffic.
- WhatsApp: Provides insights into low-latency messaging, end-to-end encryption, and high availability.
- Uber: demonstrates how to address the needs of both riders and drivers with different functional requirements
- Amazon: teaches us about e-commerce needs
- Dropbox/Google Drive
- Netflix: Highlights the importance of CDN, adaptive bitrate streaming, global scalability, and personalized content recommendations.
- Instagram: shows how to handle media-heavy social apps
- Zoom: gives insights into video streaming
- Booking.com/Airbnb: Teaches us about handling marketplace dynamics, booking systems, complex search algorithms, and user verification.
In system design, there is a phrase that we might keep in mind for our future self: it is better to satisfy some of the users than disappoint all of them. We must design our systems with this saying in mind.
Conclusion
In conclusion, it’s completely normal to feel overwhelmed by the vast range of HLD topics, especially when moving from personal projects or service-based roles to large-scale product-based MNCs or startups. While a single software engineer might not directly handle all these components daily, having a strong grasp of these concepts makes it significantly easier to understand, contribute to, and optimize complex, scalable systems in such organizations. This is why mastering system design is so crucial for software engineers.
Detailed roadmap for Low level design will be coming really soon. Till then keep learning and keep exploring. For more such articles. Visit https://upskillltoday.com/