Uptime, SLAs, and Monitoring: What Hosting Customers Should Measure

System AdminNovember 15, 2020381 views6 min read

Uptime Is More Than a Number on a Sales Page

Every hosting provider advertises uptime — 99.9%, 99.95%, 99.99%. These numbers look impressive until you need to hold someone accountable for a four-hour outage that cost your business real money. The gap between advertised uptime and actual experience is where hosting customers get burned. Understanding SLAs, knowing what to monitor, and having the tools to measure independently puts you in control of the conversation.

This guide covers what uptime metrics actually mean, how SLAs work in practice, and which monitoring strategies help hosting customers detect problems before their visitors do.

What Uptime Percentages Actually Mean

Uptime is expressed as a percentage of total time in a given period (usually monthly). The math is straightforward, but the implications matter:

  • 99.0%: Up to 7.3 hours of downtime per month
  • 99.9%: Up to 43.8 minutes of downtime per month
  • 99.95%: Up to 21.9 minutes of downtime per month
  • 99.99%: Up to 4.4 minutes of downtime per month

The difference between 99.9% and 99.99% sounds trivial — a tenth of a percent. But in real minutes, it is the difference between 43 minutes and 4 minutes of allowed downtime. For an e-commerce site processing thousands of orders daily, those 39 extra minutes can represent significant lost revenue.

How SLAs Work (and Where They Fall Short)

A Service Level Agreement is a contractual commitment to a certain level of service — typically uptime. If the provider fails to meet the SLA, they owe you a remedy, usually in the form of service credits. Here is what most hosting customers do not realize: service credits are almost never equivalent to the actual business impact of downtime.

A typical SLA might offer a 10% credit on your monthly bill for each 0.1% of downtime below the guaranteed threshold. If you pay a hundred dollars per month and suffer an hour of unplanned downtime, you might get ten dollars back. If that hour cost your business thousands in lost sales, the credit is cold comfort.

What to Look for in an SLA

  • Scope: Does the SLA cover only network uptime, or does it include the entire service stack (compute, storage, DNS)? A provider can claim 100% network uptime while your server's disk fails and takes your site offline.
  • Measurement method: How does the provider measure uptime? Do they use internal monitoring, or do you have to prove the outage? Providers that require customer-reported evidence make it harder to claim credits.
  • Exclusions: Scheduled maintenance windows, force majeure events, and customer-caused outages are typically excluded. Read the fine print to understand what does not count.
  • Remedy: Service credits are standard. Cash refunds are rare. Some premium SLAs offer financial penalties, but these are usually reserved for enterprise-grade contracts.

Independent Monitoring: Trust but Verify

Do not rely on your hosting provider to tell you when your site is down. Set up independent, external monitoring that checks your site from multiple geographic locations and alerts you the moment something goes wrong.

What to Monitor

  • HTTP availability: Can your site be reached over HTTPS? A simple HTTP check that verifies a 200 OK response is the most fundamental monitor. Check from multiple locations every 60 seconds.
  • Response time (latency): How long does it take for your server to respond? Track Time to First Byte (TTFB) over time. A gradual increase often signals resource exhaustion, database performance degradation, or caching issues — problems that precede a full outage.
  • SSL certificate expiration: Monitor certificate validity. An expired certificate causes browser warnings and effectively takes your site offline for security-conscious visitors.
  • DNS resolution: Verify that your domain resolves correctly. DNS failures are invisible from the server side — your server is running fine, but nobody can find it.
  • Port checks: Verify that critical ports (443, 22, database ports if applicable) are open and responding. A firewall misconfiguration or crashed service can close a port without triggering other alerts.

Latency Percentiles: Beyond Averages

Average response time is misleading. If 95% of your requests complete in 200 milliseconds but 5% take 5 seconds, the average looks fine while 1 in 20 visitors has a terrible experience. Track percentiles — especially p95 and p99. These tell you the worst-case response time that 95% or 99% of visitors experience, and they surface problems that averages hide.

Alerting: Getting the Right Notifications at the Right Time

Monitoring without alerting is just data collection. Configure alerts that notify you through channels you actually check — email, SMS, push notifications, or a team messaging platform. Set thresholds thoughtfully:

  • Site down: Immediate alert after two consecutive failed checks from different locations. This avoids false positives from transient network issues.
  • High latency: Alert when p95 response time exceeds your acceptable threshold (e.g., 2 seconds) for more than five minutes. Brief spikes are normal; sustained degradation is not.
  • Certificate expiration: Alert at 30 days, 14 days, and 7 days before expiration. If automated renewal is working, you should never see the 7-day alert.
  • Error rate: Alert when HTTP 5xx errors exceed a baseline percentage. A sudden spike in server errors usually means a deployment went wrong, a dependency failed, or resource limits were hit.

Building a Monitoring Dashboard

A monitoring dashboard gives you a single view of your site's health. The essential panels include current status (up or down), response time trend over the past 24 hours, uptime percentage for the current month, and a log of recent incidents. Most monitoring services provide hosted dashboards. If you want to self-host, tools like Grafana combined with Prometheus or a similar metrics backend give you full control.

A public status page is also worth considering. Publishing your uptime data builds trust with customers and gives you a communication channel during incidents. Many monitoring services include a hosted status page feature.

Incident Tracking and Postmortems

Every outage or significant degradation event should be logged: when it started, when it was detected, what caused it, how it was resolved, and how long it lasted. Over time, this log reveals patterns — recurring issues with a specific service, degradation that coincides with traffic peaks, or problems tied to a particular maintenance window.

For significant incidents, write a brief postmortem that covers what happened, why it happened, what the impact was, and what changes will prevent recurrence. This is not about blame — it is about learning and improving. Teams that practice blameless postmortems improve their reliability faster than teams that skip them.

How Monitoring Improves Your Hosting Relationship

When you have independent monitoring data, conversations with your hosting provider become productive rather than adversarial. You can show exactly when an outage started, how long it lasted, and what the impact was. This data supports SLA credit claims, informs upgrade decisions, and helps you evaluate whether your current provider is meeting your needs.

Monitoring also helps you distinguish between hosting problems and application problems. If your monitoring shows the server is reachable and responding quickly but users report errors, the issue is likely in your application code — not the hosting infrastructure.

Getting Started

You do not need an expensive enterprise monitoring stack. Free-tier plans from reputable monitoring services provide HTTP checks, basic alerting, and uptime reporting that cover the essentials. Start with HTTP availability and response time monitoring from at least three geographic locations. Add SSL and DNS monitoring. Set up alerts for downtime and sustained latency spikes. Review your data weekly and refine thresholds based on your site's normal behavior.

Monitoring is not a set-and-forget tool — it is an ongoing practice. As your site grows and your infrastructure evolves, your monitoring should grow with it. The investment is small. The visibility it provides is invaluable.

WordPressMySQLBackupLinux