Infrastructure December 20, 2025 14 min read

How to Scale an API from Zero to Millions of Requests

A practical guide to scaling your API as traffic grows. Covers vertical and horizontal scaling, caching, database optimization, and load balancing.

Jordan Lee

How to Scale an API from Zero to Millions of Requests

Your API works perfectly in development. You test it locally, everything responds in milliseconds, and life is good. Then you launch, traffic starts growing, and things start breaking.

Scaling isn't something you do once. It's a series of decisions you make as your traffic grows from hundreds to thousands to millions of requests. The good news is that each step is well-understood and follows predictable patterns.

In this guide, you'll learn the practical strategies for scaling an API at each stage of growth — from your first users to handling millions of daily requests. No theoretical hand-waving. Just the specific actions to take and when to take them.

When Do You Need to Scale?

Don't scale prematurely. Watch for these concrete signals that it's time to act:

Response times are climbing — P95 latency goes from 100ms to 500ms+ over weeks
CPU usage is consistently above 70% — No headroom for traffic spikes
Memory is always above 85% — Swap usage is increasing, OOM kills are happening
Error rate is climbing — Timeouts and 503 errors start appearing in your logs
Database connections are maxing out — Your connection pool is fully utilized during peaks

If none of these are happening, you probably don't need to scale yet. Focus on building features instead.

The Scaling Architecture

Here's the architecture you're building towards. You don't need all of this on day one — add pieces as traffic demands it:

Horizontally Scaled API Architecture

Stage 1: Vertical Scaling (Scale Up)

The simplest and often cheapest way to handle more traffic is to upgrade your server. More CPU, more RAM, faster storage. No code changes needed.

Server Size	Typical Capacity	Best For
2 vCPU / 4 GB	~200 concurrent requests	Side projects, MVPs
4 vCPU / 8 GB	~1,000 concurrent requests	Growing startups
8 vCPU / 32 GB	~5,000 concurrent requests	Production SaaS
16+ vCPU / 64+ GB	~10,000+ concurrent requests	High-traffic APIs

Vertical scaling works until you hit hardware limits. But it's the right first step because it's fast and requires zero architecture changes.

Stage 2: Add Caching

Caching is the single highest-impact optimization you can make. A properly configured Redis cache can reduce your database load by 90% or more:

Python — Response caching with Redis

import redis
import json
import hashlib

cache = redis.Redis(host='localhost', port=6379, db=0)

def cached_response(key_prefix, params, ttl=300):
    """Cache decorator for API responses."""
    # Generate unique cache key from request params
    param_hash = hashlib.md5(
        json.dumps(params, sort_keys=True).encode()
    ).hexdigest()
    cache_key = f"{key_prefix}:{param_hash}"

    # Try cache first
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)  # Cache HIT

    return None  # Cache MISS

def store_in_cache(key_prefix, params, data, ttl=300):
    """Store a response in cache."""
    param_hash = hashlib.md5(
        json.dumps(params, sort_keys=True).encode()
    ).hexdigest()
    cache_key = f"{key_prefix}:{param_hash}"
    cache.setex(cache_key, ttl, json.dumps(data))

What to cache and for how long depends on how often the data changes:

Static data (configuration, categories): Cache for hours or days
Semi-dynamic data (product listings, search results): Cache for 5-15 minutes
User-specific data (dashboards, preferences): Cache for 1-5 minutes or use cache invalidation
Real-time data (live prices, notifications): Don't cache, or use very short TTLs (seconds)

Stage 3: Database Optimization

After caching, the database is usually the next bottleneck. Here are the highest-impact optimizations:

Python — Connection pooling

from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

# Without pooling: opens a new connection for every request (SLOW)
# With pooling: reuses connections from a pre-opened pool (FAST)

engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,       # Base number of connections
    max_overflow=30,    # Extra connections under load
    pool_timeout=30,    # Seconds to wait for a connection
    pool_recycle=1800   # Recycle connections every 30 min
)

Other critical database optimizations:

Add indexes for every column you query by (WHERE, JOIN, ORDER BY)
Use EXPLAIN ANALYZE to find slow queries and understand their execution plans
Avoid SELECT * — Only fetch the columns you actually need
Use pagination — Never return unbounded result sets

Stage 4: Horizontal Scaling (Scale Out)

When vertical scaling hits its limits, add more servers. This requires a load balancer to distribute traffic across instances:

Nginx — Load balancer configuration

upstream api_servers {
    least_conn;  # Route to the server with fewest connections

    server 10.0.0.1:8000;
    server 10.0.0.2:8000;
    server 10.0.0.3:8000;
}

server {
    listen 443 ssl http2;
    server_name api.yourdomain.com;

    location / {
        proxy_pass http://api_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Important considerations for horizontal scaling:

Your app must be stateless — Don't store session data in memory. Use Redis or a database instead.
Use health checks — The load balancer should automatically stop sending traffic to unhealthy servers.
Deploy identically — All servers should run the same code version. Use CI/CD for consistent deployments.

Monitoring: Know Before Your Users Do

You can't scale what you can't measure. Set up monitoring for these key metrics from day one:

Metric	Warning Threshold	Action
P95 Response Time	> 500ms	Profile slow endpoints, add caching
CPU Usage	> 70% sustained	Upgrade plan or add servers
Memory Usage	> 85%	Check for memory leaks, upgrade RAM
Error Rate	> 1%	Investigate error logs immediately
DB Connection Pool	> 80% utilized	Increase pool size or add read replicas
Cache Hit Rate	< 80%	Review cache strategy, increase TTLs

What to Do Next

Scaling is a journey, not a destination. Here's a summary of the path:

Start simple — One server is fine until it isn't. Don't over-engineer prematurely.
Monitor from day one — You need data to make scaling decisions. Set up dashboards early.
Cache aggressively — This alone can delay the need for horizontal scaling by months.
Optimize your database — Indexes and connection pooling are free performance gains.
Scale horizontally when vertical limits are reached — Multiple small servers behind a load balancer.
Automate everything — CI/CD, monitoring alerts, automated scaling, automated backups.

Every high-traffic API started as a single server handling a handful of requests. The difference between the ones that scaled and the ones that crashed is that the successful ones planned one step ahead — not ten.

scaling infrastructure performance caching devops