AWS Health Dashboard

1. Why This Service Exists (The Real Problem)

The Problem: Your application is crashing, but your monitoring (CloudWatch/New Relic) says your code is fine. - Is it me or is it AWS? You spend 4 hours debugging your own code, only to find out AWS had an outage in us-east-1 ELB service. - Silent Retirements: AWS needs to retire the physical host running your critical database next Tuesday. If you miss the email, your DB shuts down unexpectedly.

The Solution: A direct feed of operational data from AWS to you about the status of AWS services and your specific resources.

2. Mental Model (Antigravity View)

The Analogy: The Airport Flight Status Board. - Global View: "All flights to Chicago are delayed" (Service Health - AWS-wide issues). - Personal View: "Your specific flight UA123 is cancelled" (Your Account Health - Issues affecting your EC2s/RDS).

One-Sentence Definition: The official status page for AWS global services and a notification system for scheduled maintenance on your resources.

3. Core Components (No Marketing)

Service Health (Public): The general status of all AWS services across all regions. (e.g., "EC2 in us-east-1 is degraded").
Your Account Health (Private): Events specific to your resources.
- Open Issues: Current outages affecting you.
- Scheduled Changes: Upcoming maintenance (e.g., RDS hardware upgrade).
- Other Notifications: E.g., verifying a domain, updated legal terms.
Eventbridge Integration: The mechanism to pipe these alerts into your Ops channels (Slack/PagerDuty).

4. How It Works Internally (Simplified)

Detection: AWS internal monitoring detects a failure in a service plane (e.g., DynamoDB Control Plane in eu-west-1).
Publication: AWS Engineers post a status update to the Health API.
Routing:
- If it's a global issue -> Published to Service Health.
- If it affects specific hardware -> Maps the hardware ID to your Account ID and publishes to Your Account Health.
Delivery: You view it in Console or receive it via EventBridge.

5. Common Production Use Cases

Outage Triage: The first place to check when "everything is broken". Rule #1 of Incident Response: Check AWS Health.
Maintenance Planning: Knowing that i-12345 will be retired next week so you can migrate it during your maintenance window, not theirs.
Security Alerts: Notification of exposed access keys or abusive behavior detection (from AWS Abuse team).

6. Architecture Patterns

The "Proactive Ops" Integration

Don't manually check the dashboard. Do automate alerts.

Architecture: 1. Source: AWS Health (EventBridge Rule). 2. Filter: specific eventTypeCategories: ["issue", "scheduledChange", "accountNotification"]. 3. Target: SNS Topic -> Lambda. 4. Action: Lambda formats the message ("AWS is doing maintenance on DB-123") and sends it to Slack (#ops-alerts) or creates a Jira Ticket.

7. IAM & Security Model

Read Only Access: Give your Support/Ops team pHealth_ReadOnly access so they can check status without being able to touch resources.
Organization View: In AWS Organizations, the Master Account can view Health events for all member accounts. This is crucial for centralized operations.

8. Cost Model (Very Important)

Free: The "Personal Health Dashboard" is free for all users.
Paid (Business/Enterprise Support): To access the AWS Health API (programmatic access to query status), you MUST have a Business or Enterprise Support plan.
Implication: If you are on the Basic (Free) Support plan, you cannot write code to automate these checks via API. You are stuck with the Console and limited EventBridge integration.

9. Common Mistakes & Anti-Patterns

Ignoring Emails: AWS sends maintenance emails. If you filter them to "Updates" folder, you will have unplanned downtime.
Assuming "All Systems Normal" is True: The Public Dashboard is often updated after the community notices an outage.
Not Aggregating: If you have 50 AWS accounts, logging into each one to check health is impossible. Use AWS Organizations to aggregate health view.

10. When NOT to Use This Service

Application Monitoring: This does NOT monitor your app. It only monitors AWS infrastructure underneath your app. Use CloudWatch for your app.
Real-time debugging: Status updates can have a 5-15 minute lag. Trust your own metrics (CloudWatch) first.

11. Interview-Level Summary

Scope: Global (Service Health) vs Personal (Your Account Health).
Automation: Integrated via EventBridge.
Cost: API access requires Business Support.
Use Case: Outage confirmation and Scheduled Maintenance awareness.
Aggregator: Can view health across entire Organization from master account.