CloudWatch
1. Why This Service Exists (The Real Problem)
The Problem: You have 50 servers, 3 databases, and 10 load balancers.
- Visibility Black Hole: How do you know if Server #42 is spiking in CPU?
- Log Fragmentation: Logs are scattered across local /var/log/syslog on 50 different machines. SSH-ing into each one to grep errors is impossible.
- Reactive Failure: You only know the site is down when customers complain on Twitter.
The Solution: A centralized repository for all metrics (graphs) and logs (text) that can trigger actions (alarms).
2. Mental Model (Antigravity View)
The Analogy: The Dashboard of a Car + The Black Box Flight Recorder. - Metrics: Speedometer/Fuel Gauge (Numbers over time: CPU %, Memory %, Latency). - Logs: The Flight Recorder (Text streams: "Error: DB connection failed"). - Alarms: The "Check Engine" Light (Triggers when a number crosses a line).
One-Sentence Definition: CloudWatch is a time-series database for metrics and a searchable storage for text logs.
3. Core Components (No Marketing)
- Metrics: A single data point:
(Timestamp, Value, Unit). E.g.,(12:00, 80, %). - Namespaces & Dimensions: Folders and tags for organizing metrics.
- Namespace:
AWS/EC2 - Dimension:
InstanceId=i-12345
- Namespace:
- Logs Groups & Log Streams:
- Group: The application (e.g.,
production-transcoder). - Stream: The specific instance source (e.g.,
i-12345-apache-access-log).
- Group: The application (e.g.,
- Alarms: An
IFstatement.IF CPU > 80% for 5 minutes THEN Send Email. - EventBridge (formerly CloudWatch Events): The "Cron Job" and "Event Bus" of AWS. Trigger logic based on system changes.
4. How It Works Internally (Simplified)
- Data Ingestion:
- By Default: AWS services push standard metrics (Hypervisor level) to CloudWatch API.
- Custom: You install the CloudWatch Agent inside your OS to push internal metrics (Memory, Disk Space) and Logs.
- Storage:
- Metrics are stored in a tiered retention (1 second resolution -> 1 hour -> etc).
- Logs are stored infinitely (unless you set a retention policy).
- Action: Alarms constantly query the time-series DB. If threshold is breached, it publishes a message to SNS (Topic) or triggers Auto Scaling.
5. Common Production Use Cases
- Infrastructure Monitoring: "Is the CPU high?"
- Application Monitoring: "How many 500 errors per minute?" (Custom Metric).
- Log Aggregation: Centralizing logs from ephemeral containers/instances so they aren't lost when the instance dies.
- Incident Response: Triggering a PagerDuty alert via SNS when the DB latency spikes.
6. Architecture Patterns
The "Standard" Observability Pipeline
- Source: EC2s / Containers running CloudWatch Agent.
- Dest: CloudWatch Logs Group (
/app/prod/backend). - Filter: Metric Filter acts on the Log stream.
- Pattern:
[ERROR] - Action: Increment Metric
ErrorCount.
- Pattern:
- Alarm:
If ErrorCount > 5 in 1 minute. - Action: Notify SNS Topic
Prod-Critical-Alerts-> PagerDuty/Slack.
Unified Agent Layout
- Install one CloudWatch Agent configs via Systems Manager (SSM) Parameter Store to ensure consistent logging across 1000s of instances.
7. IAM & Security Model
- CloudWatch Agent Needs Permission: The IAM Role attached to the EC2 must have
CloudWatchAgentServerPolicy. Without this, the agent runs but fails to upload data (Silent Failure). - Cross-Account Observability: You can configure a "Monitoring Account" that reads metrics/logs from "Prod" and "Staging" accounts for a single pane of glass.
8. Cost Model (Very Important)
This service is a hidden cost assassin. 1. PutMetricData: Every time you push a custom metric, you pay. High resolution (1-sec) is expensive. 2. Log Ingestion: Paying to send the logs. 3. Log Storage: Paying to keep the logs. Archival is cheap, Ingestion is expensive. 4. Alarms: Paying small fee per alarm.
Cost Optimizations: - Turn on Retention Policies: Default is "Never Expire". Set it to 30 days for debug logs. - Don't Log "Info" in Production: Too much volume. Log "Error" and "Warn" only. - Use VPC Endpoints: To avoid NAT Gateway data processing charges when sending logs to CloudWatch.
9. Common Mistakes & Anti-Patterns
- "Where is my RAM Usage?": AWS cannot see inside your OS. CPU is visible (Hypervisor), but RAM is private. You MUST install the CloudWatch Agent to see Memory usage.
- Infinite Logs: Forgetting to set retention periods on Log Groups.
- Alarm Fatigue: Setting alarms on everything. Only alarm on Actionable failures (e.g., "Site Down", not "CPU high but app fine").
- Using as Analytics: CloudWatch Logs Insights is great for debugging, but too slow/expensive for Business Analytics. Use OpenSearch or Athena for that.
10. When NOT to Use This Service
- Complex Tracing: Use AWS X-Ray (Distributed tracing).
- High Cardinality Metrics: If you need to tag metrics with
UserID, CloudWatch gets expensive and slow. Use Prometheus/Grafana. - Business Intelligence: Don't calculate "Revenue per day" here. Use a DB or Data Warehouse.
11. Interview-Level Summary
- Resolution: Standard (1 min) vs High Resolution (1 sec).
- Memory Metrics: Not available by default; requires Agent.
- Logs: Organized into Groups (App) and Streams (Instance).
- Retention: Must be configured or you pay forever.
- Metric Filters: The way to turn "Text Logs" into "Graphable Numbers".