All interview questions CS Fundamentals · 2026

System Design Interview Questions

System design is the highest-signal round for mid and senior engineering interviews. These are the concepts interviewers actually probe, with concise answers you can reason through out loud.

16 questions with concise, interview-ready answers.

1. How do you approach a system design interview question?

Start by clarifying requirements and scope, separating functional needs (what the system does) from non-functional ones (scale, latency, availability, consistency). Estimate the load — users, requests per second, read/write ratio, and storage — so your design is grounded in numbers. Then sketch a high-level architecture (clients, API layer, services, datastores), and only afterward drill into bottlenecks and trade-offs. Always state your assumptions and explain why you chose one approach over another.

2. What is the difference between horizontal and vertical scaling?

Vertical scaling (scaling up) means adding more power — CPU, RAM, disk — to a single machine. Horizontal scaling (scaling out) means adding more machines and distributing load across them. Vertical scaling is simpler but has a hard ceiling and a single point of failure, while horizontal scaling is effectively unbounded and more fault-tolerant but requires handling distributed concerns like load balancing and data partitioning. Most large-scale systems favor horizontal scaling.

3. What is load balancing and why is it needed?

A load balancer distributes incoming traffic across multiple backend servers so no single server is overwhelmed. It improves availability and throughput, and routes around unhealthy instances using health checks. Common algorithms include round-robin, least-connections, and consistent hashing. Load balancers can operate at Layer 4 (TCP/UDP) or Layer 7 (HTTP, where they can route by path or header).

4. What is caching and how do you handle cache invalidation?

Caching stores frequently accessed data in a fast layer (in-memory like Redis or Memcached) to reduce latency and database load. Cache invalidation keeps cached data from going stale, and it is famously hard. Common strategies are TTL-based expiry, write-through (update cache and DB together), write-back, and cache-aside (the app loads on a miss and invalidates on write). You also pick an eviction policy such as LRU to bound memory usage.

5. What is a CDN and when should you use one?

A CDN (Content Delivery Network) is a geographically distributed network of edge servers that cache content close to users. It cuts latency, offloads traffic from your origin servers, and improves availability. CDNs are ideal for static assets like images, video, CSS, and JavaScript, and can also cache cacheable API responses. Examples include Cloudflare, Akamai, and CloudFront.

6. What is the difference between sharding, partitioning, and replication?

Partitioning splits a dataset into smaller pieces; sharding specifically means horizontal partitioning across multiple database servers, where each shard holds a subset of rows (by a shard key). Replication copies the same data to multiple nodes for redundancy and read scaling. In short, sharding spreads data to scale writes and storage, while replication duplicates data to improve availability and read throughput. Large systems often use both together.

7. How do you choose between SQL and NoSQL?

Choose SQL when you need strong consistency, complex queries and joins, transactions, and a well-defined relational schema — it is the default for most applications. Choose NoSQL when you need massive horizontal scale, flexible or evolving schemas, or very high write throughput, accepting weaker consistency or fewer query capabilities. NoSQL covers several models: key-value, document, wide-column, and graph stores. The decision comes down to your access patterns, consistency needs, and scale.

8. What is the CAP theorem?

The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition tolerance. Because network partitions are unavoidable in distributed systems, partition tolerance is a given, so the real trade-off is between consistency and availability during a partition. A CP system sacrifices availability to stay consistent, while an AP system stays available but may serve stale data. The choice depends on whether your application can tolerate stale reads or downtime.

9. What are the main consistency models?

Strong consistency guarantees every read returns the most recent write, which is simpler to reason about but costs latency and availability. Eventual consistency allows replicas to temporarily diverge and converge over time, favoring availability and performance. In between are models like read-your-writes, monotonic reads, and causal consistency that offer specific guarantees without full strong consistency. You pick the weakest model your use case can tolerate to maximize performance.

10. What are message queues and why use asynchronous processing?

A message queue (like Kafka, RabbitMQ, or SQS) lets services communicate by passing messages instead of calling each other directly. This decouples producers from consumers, smooths out traffic spikes by buffering work, and improves resilience since the queue absorbs load if a consumer is down. Asynchronous processing moves slow or non-critical work (sending emails, generating thumbnails) off the request path so the user gets a fast response. It also enables retries and helps the system scale consumers independently.

11. How does rate limiting work?

Rate limiting caps how many requests a client can make in a time window to protect a system from abuse, overload, and runaway costs. Common algorithms include token bucket (allows bursts up to a limit), leaky bucket (smooths output to a steady rate), and fixed or sliding window counters. Limits are typically keyed by user, API key, or IP address, and a distributed store like Redis tracks counters across servers. Exceeded requests usually get an HTTP 429 response.

12. What is an API gateway?

An API gateway is a single entry point that sits in front of backend services and routes client requests to the right one. It centralizes cross-cutting concerns like authentication, rate limiting, request routing, SSL termination, logging, and response aggregation. This keeps individual services simpler and gives clients one consistent interface. It is especially common in microservices architectures.

13. Microservices vs monolith — how do you decide?

A monolith packages all functionality into one deployable unit; it is simpler to build, test, and deploy, making it ideal for small teams and early-stage products. Microservices split the system into independently deployable services, enabling teams to scale and ship separately and to scale hot components in isolation. The cost is significant operational complexity — service discovery, network latency, distributed transactions, and harder debugging. A sound rule is to start with a monolith and extract services only once scale or team boundaries justify it.

14. How would you design a URL shortener?

The core is mapping a short code to a long URL: generate a unique code (a base-62 encoding of an auto-increment ID, or a hash), store the mapping in a database, and redirect with an HTTP 301/302 on lookup. Reads vastly outnumber writes, so cache hot mappings in something like Redis and front it with a CDN. To scale, you can shard the datastore by the short code and pre-generate keys with a distributed counter. Add analytics by logging clicks asynchronously through a queue.

15. How do you handle a read-heavy vs a write-heavy system?

For read-heavy systems, add caching layers, read replicas, and a CDN, and denormalize data so reads avoid expensive joins. For write-heavy systems, you scale writes with sharding, use write-optimized stores (like LSM-tree databases), batch writes, and buffer them through a message queue. Read-heavy workloads can often tolerate eventual consistency on replicas, while write-heavy ones focus on partitioning the write key well to avoid hotspots. Knowing the read/write ratio up front shapes nearly every other decision.

16. What is the difference between latency and throughput?

Latency is how long a single operation takes — the delay from request to response, measured in milliseconds. Throughput is how many operations the system completes per unit of time, such as requests per second. They are related but distinct: you can have high throughput with high latency (a batched pipeline) or low latency with low throughput. Optimizing one can hurt the other, so you design against the targets your use case actually requires, often reasoning about tail latency like p99 rather than averages.

Get these answered live in your real interview

NostrobeAI is a real-time AI interview copilot — it hears the question and drafts a strong answer on your screen, invisible on Zoom, Meet, and Teams. One-time pricing, no subscription.

Try NostrobeAI free