Skip to content

CloudWatch

1. Why This Service Exists (The Real Problem)

The Problem: You have 50 servers, 3 databases, and 10 load balancers. - Visibility Black Hole: How do you know if Server #42 is spiking in CPU? - Log Fragmentation: Logs are scattered across local /var/log/syslog on 50 different machines. SSH-ing into each one to grep errors is impossible. - Reactive Failure: You only know the site is down when customers complain on Twitter.

The Solution: A centralized repository for all metrics (graphs) and logs (text) that can trigger actions (alarms).

2. Mental Model (Antigravity View)

The Analogy: The Dashboard of a Car + The Black Box Flight Recorder. - Metrics: Speedometer/Fuel Gauge (Numbers over time: CPU %, Memory %, Latency). - Logs: The Flight Recorder (Text streams: "Error: DB connection failed"). - Alarms: The "Check Engine" Light (Triggers when a number crosses a line).

One-Sentence Definition: CloudWatch is a time-series database for metrics and a searchable storage for text logs.

3. Core Components (No Marketing)

  1. Metrics: A single data point: (Timestamp, Value, Unit). E.g., (12:00, 80, %).
  2. Namespaces & Dimensions: Folders and tags for organizing metrics.
    • Namespace: AWS/EC2
    • Dimension: InstanceId=i-12345
  3. Logs Groups & Log Streams:
    • Group: The application (e.g., production-transcoder).
    • Stream: The specific instance source (e.g., i-12345-apache-access-log).
  4. Alarms: An IF statement. IF CPU > 80% for 5 minutes THEN Send Email.
  5. EventBridge (formerly CloudWatch Events): The "Cron Job" and "Event Bus" of AWS. Trigger logic based on system changes.

4. How It Works Internally (Simplified)

  1. Data Ingestion:
    • By Default: AWS services push standard metrics (Hypervisor level) to CloudWatch API.
    • Custom: You install the CloudWatch Agent inside your OS to push internal metrics (Memory, Disk Space) and Logs.
  2. Storage:
    • Metrics are stored in a tiered retention (1 second resolution -> 1 hour -> etc).
    • Logs are stored infinitely (unless you set a retention policy).
  3. Action: Alarms constantly query the time-series DB. If threshold is breached, it publishes a message to SNS (Topic) or triggers Auto Scaling.

5. Common Production Use Cases

  • Infrastructure Monitoring: "Is the CPU high?"
  • Application Monitoring: "How many 500 errors per minute?" (Custom Metric).
  • Log Aggregation: Centralizing logs from ephemeral containers/instances so they aren't lost when the instance dies.
  • Incident Response: Triggering a PagerDuty alert via SNS when the DB latency spikes.

6. Architecture Patterns

The "Standard" Observability Pipeline

  1. Source: EC2s / Containers running CloudWatch Agent.
  2. Dest: CloudWatch Logs Group (/app/prod/backend).
  3. Filter: Metric Filter acts on the Log stream.
    • Pattern: [ERROR]
    • Action: Increment Metric ErrorCount.
  4. Alarm: If ErrorCount > 5 in 1 minute.
  5. Action: Notify SNS Topic Prod-Critical-Alerts -> PagerDuty/Slack.

Unified Agent Layout

  • Install one CloudWatch Agent configs via Systems Manager (SSM) Parameter Store to ensure consistent logging across 1000s of instances.

7. IAM & Security Model

  • CloudWatch Agent Needs Permission: The IAM Role attached to the EC2 must have CloudWatchAgentServerPolicy. Without this, the agent runs but fails to upload data (Silent Failure).
  • Cross-Account Observability: You can configure a "Monitoring Account" that reads metrics/logs from "Prod" and "Staging" accounts for a single pane of glass.

8. Cost Model (Very Important)

This service is a hidden cost assassin. 1. PutMetricData: Every time you push a custom metric, you pay. High resolution (1-sec) is expensive. 2. Log Ingestion: Paying to send the logs. 3. Log Storage: Paying to keep the logs. Archival is cheap, Ingestion is expensive. 4. Alarms: Paying small fee per alarm.

Cost Optimizations: - Turn on Retention Policies: Default is "Never Expire". Set it to 30 days for debug logs. - Don't Log "Info" in Production: Too much volume. Log "Error" and "Warn" only. - Use VPC Endpoints: To avoid NAT Gateway data processing charges when sending logs to CloudWatch.

9. Common Mistakes & Anti-Patterns

  • "Where is my RAM Usage?": AWS cannot see inside your OS. CPU is visible (Hypervisor), but RAM is private. You MUST install the CloudWatch Agent to see Memory usage.
  • Infinite Logs: Forgetting to set retention periods on Log Groups.
  • Alarm Fatigue: Setting alarms on everything. Only alarm on Actionable failures (e.g., "Site Down", not "CPU high but app fine").
  • Using as Analytics: CloudWatch Logs Insights is great for debugging, but too slow/expensive for Business Analytics. Use OpenSearch or Athena for that.

10. When NOT to Use This Service

  • Complex Tracing: Use AWS X-Ray (Distributed tracing).
  • High Cardinality Metrics: If you need to tag metrics with UserID, CloudWatch gets expensive and slow. Use Prometheus/Grafana.
  • Business Intelligence: Don't calculate "Revenue per day" here. Use a DB or Data Warehouse.

11. Interview-Level Summary

  • Resolution: Standard (1 min) vs High Resolution (1 sec).
  • Memory Metrics: Not available by default; requires Agent.
  • Logs: Organized into Groups (App) and Streams (Instance).
  • Retention: Must be configured or you pay forever.
  • Metric Filters: The way to turn "Text Logs" into "Graphable Numbers".