The Playbooks That Actually Move the Needle: Performance Recipes for Monoliths, Microservices, and Serverless

You don’t need another “optimize your queries” blog. You need battle-tested playbooks tied to user metrics and revenue. Here’s what actually works—and how to measure it.

“Performance is a product feature. If you can’t tie it to LCP, p95, and conversion, it’s just an expensive hobby.”
Back to all posts

The problem you’ve actually got

If you’ve ever stared at a Grafana dashboard at 2 a.m. wondering why checkout p95 went vertical while Kubernetes looked “green,” you know the feeling. I’ve watched teams spend quarters shaving CPU only to learn their real issue was LCP on mobile and a chatty API gateway. Performance is a product problem first. If it doesn’t move user metrics—p95/p99 latency, Core Web Vitals (LCP/INP), error rates, and ultimately conversion and retention—it’s just busywork.

At GitPlumbers, we ship performance playbooks that start with user journeys and work backward. Below are the recipes we deploy for common architectures. Each one includes the levers that actually work, the configs that matter, and the metrics we use to prove it.

Playbook: Monolith on a single DB (Rails/Django/Spring + Postgres)

You don’t need a microservices rewrite to get real wins. Most monoliths I see are one pg_stat_statements away from a 3–5x latency improvement.

  • Target metrics
    • API p95 < 300 ms, error rate < 0.5%, DB CPU < 70%, cache hit rate > 85%
    • Business: +2–5% conversion, -15–25% infra cost (fewer wasteful DB calls)
  • Steps that work
    1. Turn on query insight and kill the top offenders:
      • Enable pg_stat_statements and slow query log:
        CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
        ALTER SYSTEM SET shared_preload_libraries='pg_stat_statements';
        ALTER SYSTEM SET log_min_duration_statement=200;
        SELECT query, mean_exec_time, calls FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;
      • Add missing composite indexes and fix N+1s. In Rails, add bullet and fail CI on N+1.
    2. Cache what’s stable:
      • Layer Redis for 5–30s TTL caches on read-heavy endpoints. Measure hit rate in Prometheus.
      • Use ETag/Last-Modified and CDN edge caching for catalog-like pages.
    3. Right-size concurrency:
      • puma workers = CPU cores, threads = 5–8; DB pool must be ≥ total threads.
      • Cap ActiveRecord pool to avoid DB thrash. Use pgbouncer in transaction mode.
    4. I/O wins:
      • Gzip/Brotli and Cache-Control: public, max-age=600, stale-while-revalidate=60 at nginx.
      • Ship HTTP/2; kill server-side template bloat.
  • Measurable outcome
    • Typical result: API p95 from 1.8s → 320ms, Postgres CPU -30%, checkout conversion +3.1% in 3 weeks.

If you’re still fighting the DB at p50, don’t talk sharding. Fix the 10 queries burning 80% of your time first.

Playbook: Microservices behind an API gateway (Envoy/Kong/Istio)

When microservices go slow, it’s almost always death-by-chattiness and missing budgets at the edge. Put the brakes and backoffs at the gateway, not buried in service #14.

  • Target metrics
    • Gateway p95 < 200 ms, p99 < 500 ms, <1% 5xx; per-service SLOs with error budgets
    • Business: +1–3% conversion, -20–40% MTTR via predictable failure modes
  • Controls that matter
    • Timeouts, retries, and circuit breakers at the gateway:
      # Envoy example
      route_config:
        virtual_hosts:
        - name: api
          routes:
          - match: { prefix: "/checkout" }
            route:
              timeout: 1.5s
              retry_policy:
                retry_on: 5xx,reset,connect-failure
                num_retries: 2
                per_try_timeout: 300ms
              max_stream_duration: { max_stream_duration: 3s }
      cluster_manager:
        clusters:
        - name: payments
          circuit_breakers:
            thresholds:
            - priority: DEFAULT
              max_connections: 2000
              max_requests: 5000
              max_pending_requests: 1000
              max_retries: 3
    • Bulkheads: run high-risk calls in isolated pools/queues; throttle at gateway.
    • Collapse chatty fan-outs: aggregate read paths behind a dedicated read-api service.
    • Async the non-critical: use SQS/Kafka for receipts, emails, fraud signals; return 202 with idempotency keys.
    • Enforce budgets with Istio VirtualService timeouts and DestinationRule outlier detection.
  • Release safety
    • Canary with Argo Rollouts based on real SLOs (p95 and error rate). Auto-rollback on budget burn.
  • Measurable outcome
    • In a recent cleanup: gateway p95 650ms → 180ms, p99 -60%, checkout 5xx -70%, MTTR -35%, conversion +1.8%.

Playbook: Event-driven pipelines (Kafka/Flink/Kinesis)

Throughput is pointless if your end-to-end latency blows your SLA. Most “Kafka is slow” pages are just bad partitioning and consumer configs.

  • Target metrics
    • End-to-end latency p95 < 2s, consumer lag ~0 at steady state, checkpoint < 1s
    • Business: faster fraud decisions, stale data complaints down, support tickets -20%
  • What to tune
    • Partitioning: key by high-cardinality id to avoid hot partitions; aim for partitions ≈ max consumer concurrency × 2.
    • Batching: increase fetch.max.bytes and max.poll.records for throughput; cap to meet latency SLO.
    • Compression: lz4 or zstd for network-bound topics.
    • Backpressure: expose consumer lag; autoscale consumers on lag with KEDA.
    • Flink resiliency: checkpoints every 5–10s, incremental, rocksdb state backend when state > RAM.
  • Useful commands
    • Lag:
      kafka-consumer-groups --bootstrap-server $BROKER --group orders-etl --describe
    • Kinesis Enhanced Fan-Out to isolate hot consumers; use on-demand for bursty traffic.
  • Measurable outcome
    • After repartition + KEDA autoscale: p95 E2E 4.6s → 1.3s, lag spikes gone, alert volume -50%.

Playbook: Serverless web/API (AWS Lambda + API Gateway + DynamoDB)

Cold starts and over-slim functions kill tail latency. Memory buys CPU in Lambda; spend it wisely.

  • Target metrics
    • p95 < 250 ms, p99 < 600 ms, init duration < 100 ms on hot paths
    • Business: +1–2% checkout completion, lower abandonment on mobile
  • Concrete steps
    1. Kill cold starts on critical routes with Provisioned Concurrency:
      aws lambda put-provisioned-concurrency-config \
        --function-name checkout-handler \
        --qualifier live \
        --provisioned-concurrent-executions 50
    2. Right-size memory to lower CPU-bound latency; test 256→1024→1536 MB and pick the cheapest p95.
    3. Keep packages lean: bundle-only deps; lazy-load SDKs. Avoid VPC attachments unless you need RDS/ElastiCache.
    4. Use API Gateway Latency and IntegrationLatency to spot backend vs proxy time.
    5. DynamoDB: define PK/SK access patterns up front, enable DAX for read-heavy, and Adaptive Capacity for hotspots.
  • Measurable outcome
    • With 100 provisioned concurrencies + 1024MB: p95 780ms → 210ms, p99 2.1s → 520ms, cost +9% but conversion +1.4% net +$, which the CFO liked.

Playbook: SPA + Edge (Next.js/React + Cloudflare/Fastly)

If LCP is trash on 4G, nothing else matters. You don’t win Core Web Vitals in your Kubernetes cluster.

  • Target metrics
    • LCP < 2.5s, INP < 200ms, TTFB < 200ms on mobile; >90 Lighthouse on product pages
    • Business: +3–8% mobile conversion, SEO lift, lower ad CAC
  • What moves the needle
    • Server render critical paths (Next.js app router), stream HTML, hydrate only where needed.
    • next/image with AVIF/WebP, responsive sizes, and priority on hero.
    • Split bundles aggressively; react-lazy below the fold.
    • Edge cache HTML for anon users with stale-while-revalidate; cache APIs at the CDN when safe.
    • Preconnect, preload key fonts; inline critical CSS ≤ 14KB.
  • Example headers
    Cache-Control: public, max-age=600, s-maxage=600, stale-while-revalidate=60
  • Measure and enforce
    • Lighthouse CI in PRs; WebPageTest for mobile; RUM for real users via Next.js web-vitals hook.
  • Measurable outcome
    • After edge caching + image optimization: LCP p75 3.4s → 1.9s, bounce -6%, add-to-cart +4.2%.

Playbook: LLM/Vector retrieval features (FAISS/Pinecone/Qdrant)

AI features don’t get a pass on latency. A chat assistant with 2s response is fine; 7s feels broken. Most time is in retrieval and token streaming.

  • Target metrics
    • Retrieval p95 < 200ms, first-token < 700ms, answer < 2.5s for short prompts
    • Business: higher task completion, lower abandonment in support flows
  • What to control
    • ANN index tuned for recall-latency tradeoff: HNSW with M=32, ef_search=128 as a starting point; profile.
    • Cache frequent queries and embeddings in Redis with 5–60s TTL; dedupe by fingerprint(prompt, user_ctx).
    • Batch embeddings; reuse normalized vectors.
    • Stream tokens early; use shorter system prompts and tools over longer context.
  • Example Qdrant config
    hnsw:
      m: 32
      ef_construct: 128
    quantization_config:
      scalar:
        type: int8
        always_ram: true
  • Measurable outcome
    • Tuning ef_search and caching top 5% of queries: retrieval p95 420ms → 160ms, first-token 1.4s → 650ms, self-serve deflection +9%.

Operational discipline: make it stick with SLOs and GitOps

Without guardrails, performance rots. Bake these playbooks into your delivery system.

  • SLOs by surface
    • Public API: p95 < 300ms, error rate <0.5%
    • Web: LCP p75 <2.5s, INP <200ms
    • Stream: E2E p95 <2s
  • Instrumentation
    • OpenTelemetry traces to Jaeger/Tempo; RED metrics in Prometheus; RUM for web vitals.
  • Release gates
    • Argo Rollouts canary with PromQL checks:
      analysis:
        metrics:
        - name: api-latency-p95
          interval: 30s
          successCondition: result < 0.3
          provider:
            prometheus:
              query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api",le!="+Inf"}[5m])) by (le))
  • Capacity
    • K8s HPA on RPS/CPU for stateless, VPA for memory-heavy; KEDA on Kafka lag; Lambda Provisioned Concurrency for hot paths.
  • Reporting
    • Monthly: show p95, error rate, infra $, and conversion/retention deltas. Finance cares about the last two.

This is where GitPlumbers usually enters: we codify the playbooks, wire SLOs into rollout controllers, and leave you with dashboards leadership can actually use.

Related Resources

Key takeaways

  • Performance work must be driven by user-facing metrics like p95 latency, LCP, and error budgets—not vanity infra graphs.
  • Each architecture has a small set of high-ROI levers; focus on those and instrument them well.
  • Codify your playbooks, wire them into GitOps, and gate releases on SLOs to avoid regression creep.
  • Measure business impact alongside tech metrics; conversion and retention improvements make performance spend obvious to finance.

Implementation checklist

  • Define SLOs by surface: API p95, web LCP/TTI, streaming end-to-end latency.
  • Instrument tracing (`OpenTelemetry`), metrics (`Prometheus`), and web vitals (`Lighthouse`, `WebPageTest`).
  • Pick one architecture playbook and run it top-to-bottom in 2 weeks; don’t boil the ocean.
  • Automate: alerts on SLO burn, rollback via `Argo Rollouts`, and capacity via HPA/VPA or Provisioned Concurrency.
  • Report both tech and business deltas: p95, errors, infra $, conversion, churn, and NPS.

Questions we hear from teams

How do I pick which playbook to run first?
Start where user pain and dollar impact intersect. If you have a slow web journey (LCP > 2.5s), fix the SPA + Edge playbook. If backend p95 spikes during load, do gateway + microservices. Tie the effort to a single SLO and a single KPI (e.g., checkout conversion) for 2 weeks.
What if our infra team can’t support all these tools?
You don’t need everything on day one. Start with `OpenTelemetry` traces, `Prometheus` RED metrics, and `Lighthouse`. GitOps the configs you touch (gateway, HPA, rollouts). Add more once SLOs stabilize.
How do we prevent performance regressions after we fix them?
Gate releases on SLOs with `Argo Rollouts` or `Flagger`, add perf budgets in CI (Lighthouse CI, k6 smoke), and alert on error budget burn rates. Make regressions visible in standups with a tiny scorecard.
Will these changes blow up our cloud bill?
Usually the opposite. Killing N+1s, adding caches, and tuning timeouts reduce waste. If you add capacity (e.g., Provisioned Concurrency), measure business lift. We’ve seen +1–4% conversion dwarf single-digit percent cost increases.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run a 2-week Performance Playbook Sprint See how we set SLOs that actually stick

Related resources