devops

The 5 Stages of Monitoring: From 'Firefighting' to 'Fire Prevention'

From 'Is it up?' to 'We see everything' – 5 monitoring stages to save your sanity (and $50K in outages). Start small, win big!

Loic Tosser

30 Sep 2025 • 11 min read

Welcome, brave soul, to the Five Stages of Monitoring Grief—a journey from "What’s monitoring?" to "Why didn’t we do this years ago?" If you’re reading this, you’re either:

A developer tired of being woken up at 3 AM because "the website is slow" (with zero context).
A CTO who just realized your "highly available" system is held together by duct tape and prayers.
A sysadmin who has given up on life and now just drinks coffee while watching dashboards blink red.
A business executive who thinks "the cloud handles monitoring" (spoiler: it doesn’t).

Here’s the hard truth: 80% of companies operate in Stage 0—complete monitoring oblivion—while losing $5,600 per minute in downtime (Gartner, 2023). Meanwhile, the other 20% sleep peacefully, sipping margaritas on a beach, because their systems tell them when something’s wrong before users do.

This isn’t just a technical problem—it’s a business suicide pact. Companies like Knight Capital (lost $460M in 45 minutes due to unmonitored deployments) and British Airways ($180M fine for a preventable outage) serve as cautionary tales. Yet, most teams still treat monitoring like an optional extra, somewhere between "nice to have" and "we’ll get to it after we rewrite the frontend in Rust."

So buckle up. We’re about to take you on a sarcastic, brutal, and (hopefully) enlightening tour of monitoring maturity—from blissful ignorance to full observability. And yes, we’ll throw in real config snippets, open-source tools, and bulletproof arguments to convince your boss that monitoring isn’t a cost—it’s the cheapest insurance policy you’ll ever buy.

Stage 0: The Void – "We Don’t Know What We Don’t Know (And We’re Proud of It)"

Symptoms of Stage 0:

✅ "Is the site down?" (First clue: customers are screaming on Twitter.)
✅ "The cloud dashboard says everything’s fine!" (No, AWS doesn’t monitor your app.)
✅ "We’ll know if something’s wrong because… uh… someone will tell us?"
✅ "Monitoring is for paranoid people." (Spoiler: Paranoid people don’t lose millions in outages.)

At Stage 0, your production system is a black box. You have zero visibility into:

Application errors (500s? Timeouts? Who knows!)
Database failures (Is PostgreSQL on fire? Maybe!)
Third-party API meltdowns (Stripe’s down? Guess we’ll find out when payments fail.)
Server resource exhaustion (OOM killer just murdered your app? Shrug.)

The Cost of Ignorance

$300K/hour – Average cost of downtime (ITIC, 2024)
$13.22M/hour – Amazon’s downtime cost (CNBC, 2022)
Your sanity – Priceless.

Common Excuses

"We don’t have time for monitoring." You don’t have time for firefighting 24/7 either.
"The cloud provider handles monitoring." AWS monitors their servers, not your buggy code.
"It’s too expensive." Not as expensive as losing customers because your checkout page is down.
"We’ve never had a major outage." Yet. (Famous last words.)

How to Escape Stage 0

Start with one tiny step: Monit—a free, lightweight tool that restarts crashed services and sends alerts.

Example: Monitoring a Website with Monit

check host kalvad.com with address kalvad.com
  if failed port 80 protocol http then alert
  if failed port 443 protocol https then alert

What this does:

Connects to and tests the host kalvad.com.
Tests using the HTTP protocol, and raises an alert if it fails.
Does the same test using HTTPS.

Stage 1: "Is It Up?" – The Illusion of Monitoring (Spoiler: It’s Not Enough)

Symptoms of Stage 1:

✅ You have Pingdom (or even better, Uptime Kuma) checking if your homepage loads.
✅ You get too many Slack alerts (which you ignore because false positives).
✅ You still don’t know why things break—just that they are broken.

The Problem with Stage 1

HTTP checks ≠ monitoring. (Congrats, your homepage loads! Too bad your database is on fire.)
No historical data. (Was this slow yesterday? Who knows!)
Alert fatigue. (Your team now ignores all Slack notifications.)

A Better Approach: Monit + Basic Logs

While Uptime Kuma tells you "the site is down," Monit can:

Restart failed services.
Check database connectivity.
Monitor disk space, CPU, and memory.

Example: Monitoring PostgreSQL with Monit

check host postgres with address 127.0.0.1
  if failed port 5432 protocol pgsql then alert
  if failed unixsocket /var/run/postgresql/.s.PGSQL.5432 protocol pgsql then alert

Why this rocks:

No more "the database is down" surprises.
Auto-recovery for common failures.
Still free.

HTTP/HTTPS Endpoint Monitoring with Monit and Slack Alerts

Let’s say you want to monitor a specific HTTP endpoint and get Slack notifications when it goes down or comes back up. Here’s how you can do it:

Create a Slack Webhook URL: Go to your Slack workspace, create an incoming webhook, and grab the URL.
Create a Slack Notification Script: Save this as /usr/local/bin/slack.sh:

#!/bin/bash
URL="YOUR_SLACK_WEBHOOK_URL"
COLOR=${MONIT_COLOR:-$([[ $MONIT_EVENT == *"succeeded"* ]] && echo good || echo danger)}
TEXT=$(echo -e "$MONIT_EVENT: $MONIT_DESCRIPTION (https://monit.kalvad.com)" | python3 -c "import json,sys;print(json.dumps(sys.stdin.read()))")
PAYLOAD="{
  \"attachments\": [
    {
      \"text\": $TEXT,
      \"color\": \"$COLOR\",
      \"mrkdwn_in\": [\"text\"],
      \"fields\": [
        { \"title\": \"Date\", \"value\": \"$MONIT_DATE\", \"short\": true },
        { \"title\": \"Host\", \"value\": \"$MONIT_HOST\", \"short\": true }
      ]
    }
  ]
}"
curl -s -X POST --data-urlencode "payload=$PAYLOAD" $URL

Don't forget to chmod +x

Configure Monit to Check the HTTP Endpoint: Add this to your Monit configuration:

check host kalvad.com with address kalvad.com
  if failed
    port 443 protocol https request /up
    with timeout 5 seconds
    certificate valid > 5 days
  then exec "/usr/local/bin/slack.sh"
  else if succeeded then exec "/usr/local/bin/slack.sh"

What this does:

Checks if https://kalvad.com/up is up.
Validates the SSL certificate (expires in more than 5 days).
Sends a Slack notification when the status changes.

Why this rocks:

Proactive monitoring: You know when your endpoint is down.
Certificate monitoring: Avoid SSL certificate expiration surprises.
Slack integration: Get alerts where you already communicate.

Stage 2: "Why Did It Break?" – Logs, Errors, and the Art of Not Guessing

Symptoms of Stage 2:

✅ You collect logs (but don’t actually read them).
✅ You know when something breaks (but not why).
✅ Your "monitoring" is just tail -f /var/log/nginx/error.log in a terminal.

The Info You Should Be Collecting

Source	Why It Matters	Tool to Use
Application logs	"Why is the checkout failing?"	Fluent Bit
Load balancer logs	"Is traffic being dropped?"	Fluent Bit
Database slow queries	"Why is PostgreSQL at 100% CPU?"	pgBadger
API error responses	"Why are users getting 500s?"	Sentry

Fluent Bit + OpenObserve: A Match Made in Logging Heaven

Instead of grepping logs like a caveman, let’s ship them to OpenObserve (a free, scalable alternative to Datadog/ELK), through Fluent Bit, which is an agent.

Step 1: Install Fluent Bit

# Ubuntu/Debian
sudo apt install fluent-bit

# Config file: /etc/fluent-bit/fluent-bit.conf

example for Ubuntu, but of course, we recommend a real production OS like Alpine, Void, Arch or FreeBSD

Step 2: Configure Fluent Bit to Ship Logs to OpenObserve

[INPUT]
    Name              tail
    Path              /var/log/nginx/access.log
    Parser            nginx
    Tag               nginx.access

[INPUT]
    Name              tail
    Path              /var/log/myapp/app.log
    Parser            json
    Tag               app.logs

[OUTPUT]
    Name              http
    Match             *
    Host              openobserve.yourcompany.com
    Port              5080
    URI               /api/default/_json
    Format            json
    Header            Authorization Bearer YOUR_API_KEY

Step 3: Query Logs in OpenObserve

Now you can search, filter, and alert on logs without SSHing into servers.

Example Queries:

"status=500" → Find all server errors.
"response_time > 1000" → Slow requests.
"method=POST AND path=/checkout" → Failed payments.

Step 4: Install Error Tracking (Self-Hosted First!)

Why Self-Hosting Matters to Us

At Kalvad, we prioritize data privacy and control, which is why we always prefer self-hosted solutions when possible. While Sentry was great for years, its dependence on Kafka, ZooKeeper, and Scuba made it too complex for our infrastructure.

Today, we use:
✅ GlitchTip (open-source, lightweight)
✅ Bugsink (self-hosted, Sentry-compatible)

Both of them are API compatible with Sentry, the only change is the Sentry DSN.

Nonetheless, here is how to configure it with FastAPI:

Install sentry-sdk from PyPI with the fastapi extra

pip install "sentry-sdk[fastapi]"

The associated costs are similar to what you would pay any reasonable cloud provider to operate this software.

Why This Beats `tail -f`

✅ Centralized logs (No more "Which server was that error on?").
✅ Retention policies (Logs older than 30 days? Automatically deleted.)
✅ Alerts on errors (Get Slack/Discord notifications for 500 errors.)

Common Pitfalls

🔥 "Log Hoarding" – Collecting everything but never looking at it.
🔥 "We Use journalctl!" – (Great, enjoy scrolling through 10GB of logs.)
🔥 "Sentry is Enough!" – (No, Sentry only catches uncaught exceptions, not slow queries or failed API calls.)

Pro Tip: If you’re not collecting load balancer logs, you’re missing half the story. (Is AWS ALB dropping requests? Who knows!)

Stage 3: "Is It Getting Worse?" – Metrics, Trends, and Why Slack Alerts Suck

Symptoms of Stage 3:

✅ You track metrics (CPU, memory, request latency).
✅ You see trends (e.g., "Database load doubles every Friday").
✅ You still use Slack for alerts (and your team hates you for it).

The Problem with Slack Alerts

Too noisy. (Your #alerts channel is a graveyard of ignored messages.)
No escalation. (If the on-call engineer misses it, nobody sees it.)
No context. ("High CPU!" → Which server? Which process?)

A Better Alerting System

Tool	Why It’s Better Than Slack
Alerta	Deduplication, escalation, and silence rules.
PagerDuty / Opsgenie	Actually wakes people up (unlike Slack).
OpenObserve Alerts	Integrates with logs & metrics (so you get context).

Why This Works:

Only pages the right people (Devs for app errors, DBAs for PostgreSQL issues).
Escalates if unacknowledged (No more "I didn’t see the Slack message" excuses).
Groups related alerts (No spam for the same issue).

Metrics You Should Be Tracking

Metric	Why It Matters	Tool
Request latency (P99)	"Is the API slowing down?"	OpenTelemetry + OpenObserve
Database connections	"Are we about to hit `max_connections`?"	pg_exporter
Queue depth (RabbitMQ/Kafka)	"Are messages piling up?"	OpenTelemetry
Error rates per endpoint	"Is `/checkout` failing more than usual?"	OpenObserve

Common Pitfalls

🔥 "Dashboard Decorating" – Pretty Grafana dashboards that nobody looks at.
🔥 "Alerting on Everything" – (Your team will ignore all alerts if 90% are noise.)
🔥 "We Don’t Need Trends!" – (Until your database suddenly runs out of disk at 2 AM.)

Stage 4: "We See Everything" – Full Observability (AKA Sleeping Through the Night)

Symptoms of Stage 4:

✅ Traces (Where is that request spending 2 seconds?)
✅ Logs (What was the exact error?)
✅ Metrics (Is this a one-off or a trend?)The associated costs are similar to what you would pay any reasonable cloud provider to operate this software.

✅ Alerts that actually work (No more 3 AM false positives.)

The Holy Trinity of Observability

Metrics (OpenTelemetry) → "Is something wrong?"
Logs (OpenObserve) → "What exactly happened?"
Traces (OpenTelemetry) → "Where is the bottleneck?"

Example: Monitoring FastAPI with OpenTelemetry + OpenObserve

(Stolen from OpenObserve’s blog)

Step 1: Instrument FastAPI with OpenTelemetry

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

Step 2: Configure OpenTelemetry i

n FastAPI

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Set up OpenTelemetry
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://openobserve:5080"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

Step 3: Query Traces in OpenObserve

Now you can:

See exact latency breakdowns (DB query took 800ms? Fix it.)
Find slow endpoints (/users is 3x slower than /products? Optimize it.)
Correlate logs & traces (No more guessing which log line matches which request.)

Example: Finding a Slow Database Query

Trace shows /checkout is slow.
Drill into the trace → See a 1.2s PostgreSQL query.The associated costs are similar to what you would pay any reasonable cloud provider to operate this software.

Add an index → Problem solved.

Why This Beats Guessing

✅ No more "It’s slow, but we don’t know why."
✅ No more "Works on my machine."
✅ No more 3 AM debugging sessions.

Common Pitfalls

🔥 "Observability Theater" – Buying expensive tools but not using them.
🔥 "Traces Are Too Slow!" – (Sample them! You don’t need 100% traces in dev.)
🔥 "We Don’t Need This!" – (Until an outage costs you $50K and you realize you’re flying blind.)

How to Sell Monitoring to Your Boss (Without Getting Fired)

1. Start Small, Show Wins

We don’t need to build the perfect monitoring system on day one. Let’s take a phased approach that delivers quick wins while keeping costs low:

First Week: Deploy Monit for basic uptime checks ($0 cost). This will catch crashes before users notice, reducing downtime and support calls.
First Month: Add Fluent Bit and OpenObserve to track application errors and slow requests ($100 monthly cost). We’ll identify and fix performance bottlenecks that affect customer satisfaction.
First Quarter: Implement OpenTelemetry for end-to-end visibility ($2000 monthlycost). This will help us spot trends and prevent outages before they happen.

The associated costs are similar to what you would pay any reasonable cloud provider to operate this software.

Each step builds on the last, showing measurable improvements without big upfront investments. We’ll demonstrate ROI at every stage while keeping the system simple and maintainable.

2. Address Objections Like a Pro

Let’s be honest, these objections are older than dial-up internet. Here’s how to shut them down with style:

"It’s too expensive." "Our last outage cost $50K. For $5K/year, I’ll even install this myself on a Raspberry Pi taped to the server."

"We don’t have time." "We currently spend 20 hours/week playing ‘Whack-a-Mole’ with outages. This tool gives us 15 of those hours back – that’s like hiring a part-time firefighter who doesn’t need coffee breaks."

"The cloud handles it." "AWS monitors their infrastructure, not our code. That’s like trusting your landlord to water your plants. Spoiler: Your plants are dead, and so is our checkout page."

"Developers will resist." "They resist 3 AM fire drills even more. This is basically giving them the gift of sleep – and you’ll be the hero who made it happen."

Each of these responses is backed by hard data and the kind of logic that would make Spock proud. Use them liberally!

Final Thought

Monitoring isn’t a cost. It’s insurance.The associated costs are similar to what you would pay any reasonable cloud provider to operate this software.

Stage 0? You’re one outage away from disaster.
Stage 1-2? You’re reacting to problems.
Stage 3-4? You’re preventing them.

So—what’s your excuse?

(P.S. If you’re still in Stage 0, I hope you like firefighting. Because that’s your life now.) 🔥

Bonus: Real-World Monitoring Horror Stories (And How to Avoid Them)

1. The SSL Certificate That Expired at Midnight

What Happened:

A major e-commerce site’s SSL certificate expired at midnight.
No monitoring → Site went down.
Lost in sales before anyone noticed.

How to Avoid It:

Monit SSL Check (like the one above).
Let’s Encrypt + Certbot (auto-renew certificates).

2. The Database That Ran Out of Disk Space

What Happened:

A database disk filled up at 2 AM.
No disk monitoring → Site crashed.
3 hours of downtime to restore from backups.

How to Avoid It:

check filesystem data with path /dev/sda1
  if space usage > 80% then alert

3. The API That Slowly Died

What Happened:

An API endpoint degraded over weeks.
No trend monitoring → Users complained, but no one knew why.
$50K in lost revenue before the issue was found.

How to Avoid It:

OpenTelemetry latency tracking.
Alert on P99 response time.

Now go forth and monitor. The world (and your sleep schedule) will thank you. 🚀

Stage 0: The Void – "We Don’t Know What We Don’t Know (And We’re Proud of It)"

Symptoms of Stage 0:

The Cost of Ignorance

Common Excuses

How to Escape Stage 0

Stage 1: "Is It Up?" – The Illusion of Monitoring (Spoiler: It’s Not Enough)

Symptoms of Stage 1:

The Problem with Stage 1

A Better Approach: Monit + Basic Logs

HTTP/HTTPS Endpoint Monitoring with Monit and Slack Alerts

Stage 2: "Why Did It Break?" – Logs, Errors, and the Art of Not Guessing

Symptoms of Stage 2:

The Info You Should Be Collecting

Fluent Bit + OpenObserve: A Match Made in Logging Heaven

Step 1: Install Fluent Bit

Step 2: Configure Fluent Bit to Ship Logs to OpenObserve

Step 3: Query Logs in OpenObserve

Step 4: Install Error Tracking (Self-Hosted First!)

Why Self-Hosting Matters to Us

Why This Beats tail -f

Common Pitfalls

Stage 3: "Is It Getting Worse?" – Metrics, Trends, and Why Slack Alerts Suck

Symptoms of Stage 3:

The Problem with Slack Alerts

A Better Alerting System

Metrics You Should Be Tracking

Common Pitfalls

Stage 4: "We See Everything" – Full Observability (AKA Sleeping Through the Night)

Symptoms of Stage 4:

The Holy Trinity of Observability

Example: Monitoring FastAPI with OpenTelemetry + OpenObserve

Step 1: Instrument FastAPI with OpenTelemetry

Step 2: Configure OpenTelemetry i

n FastAPI

Step 3: Query Traces in OpenObserve

Why This Beats Guessing

Common Pitfalls

How to Sell Monitoring to Your Boss (Without Getting Fired)

1. Start Small, Show Wins

2. Address Objections Like a Pro

Final Thought

Bonus: Real-World Monitoring Horror Stories (And How to Avoid Them)

1. The SSL Certificate That Expired at Midnight

2. The Database That Ran Out of Disk Space

3. The API That Slowly Died

Why This Beats `tail -f`