S3 (Simple Storage Service)
1. Why This Service Exists (The Real Problem)
The Problem: Storing files on a server (EC2) is a ticking time bomb.
- Disk Full: error: no space left on device at 3 AM.
- Data Loss: If the physical drive fails, the data is gone forever.
- Scaling: How do you serve 100TB of images to 1 million users from a single machine? You can't.
The Solution: Infinite storage capacity that is accessible via HTTP API from anywhere in the world.
2. Mental Model (Antigravity View)
The Analogy: Google Drive for your Application.
- Bucket: The root folder.
- Object: A file.
- Key: The full path (filename). folder1/folder2/image.jpg.
- Metadata: Tags attached to the file (e.g., Author: Bob, ContentType: image/jpeg).
One-Sentence Definition: An object store with infinite capacity, 99.999999999% (11 9s) durability, and HTTP access.
3. Core Components (No Marketing)
- Buckets: Global namespace (Name must be unique across all AWS accounts). e.g.,
my-app-assets. - Objects: The data blobs. Max size 5TB per object.
- Storage Classes: The price/speed tier.
- Standard: Hot data (Freq access).
- Glacier: Cold archival (Tape storage).
- Policies:
- Bucket Policy: "Who can access this bucket?" (Resource-based).
- ACL (Legacy): Don't use this.
4. How It Works Internally (Simplified)
- Durability: When you upload a file, AWS splits it into erasure-coded chunks and scatters them across at least 3 Availability Zones.
- Consistency: Strong Consistency. If you write a new object and immediately read it, you will get the new version.
- Flat Structure: There are no "folders".
photos/2023/jan/cat.jpgis just a single key string. The console pretends there are folders for UI convenience.
5. Common Production Use Cases
- Static Website Hosting: Hosting React/Vue frontends (HTML/JS/CSS) directly from S3 (Dirt cheap).
- Data Lake: Dumping all raw CSV/JSON logs into one bucket for Athena/Redshift to analyze later.
- Media Storage: Storing user profile pictures, videos, and PDFs.
- Backup Target: Storing DB snapshots.
6. Architecture Patterns
The "Pre-Signed URL" Pattern
Don't route file uploads through your backend server (Client -> EC2 -> S3). It wastes EC2 CPU/Bandwidth. Do upload direct to S3.
Flow:
1. Client: "I want to upload cat.jpg".
2. Backend: Checks permission. Generates a Pre-Signed URL (a temporary token allowing write access to specifically cat.jpg for 5 mins).
3. Client: PUTs file directly to S3 using that URL.
4. Backend: Triggered via EventBridge to know upload is done.
The "CDN Origin" Pattern
- S3 is fast, but CloudFront (CDN) is faster.
- Always put CloudFront in front of S3 for public assets to cache content at the Edge and reduce S3 request costs.
7. IAM & Security Model
The "Block Public Access" Switch: - Turn this ON at the account level. - Never make a bucket public unless it's a static website. - Use OAI (Origin Access Identity) or OAC to let CloudFront read the bucket while executing Block Public Access for everyone else.
Encryption: - SSE-S3: AWS manages keys (Default). - SSE-KMS: You manage keys (Granular control over who can decrypt).
8. Cost Model (Very Important)
- Storage: Cheap ($0.023/GB).
- Requests: Expensive ($0.005 per 1,000 PUT requests).
- Trap: A logging app writing 1KB files every millisecond will bankrupt you on Request fees, not Storage fees.
- Data Transfer: Free In. Expensive Out (unless going to CloudFront).
- Lifecycle Transitions: Moving 1 million tiny files to Glacier costs money per transition. (Glacier has a 40KB minimum overhead).
Optimization: - Use Intelligent Tiering: It automatically moves files to infrequent access tiers when you stop touching them.
9. Common Mistakes & Anti-Patterns
- Public Buckets: Leaking customer data because "I couldn't get the permissions to work".
- ListBucket Abuse: Running
lson a bucket with 100 million objects. It's slow and expensive. S3 is not a database. - Filesystem Usage: Mounting S3 as a filesystem (FUSE/s3fs) on Linux. It performs terribly because S3 is not POSIX compliant (no append, no file locking).
10. When NOT to Use This Service
- Block Storage: If you need a hard drive for your EC2 (e.g., to install Oracle DB). Use EBS.
- Shared File System: If you need 50 servers to edit the same Word doc simultaneously. Use EFS (Elastic File System).
- Queryable DB: S3 Select exists, but it's not a database. Don't use S3 for transactional data.
11. Interview-Level Summary
- Consistency: Strong consistency for new objects and overwrites.
- Standard vs Intelligent Tiering: Intelligent Tiering moves data automatically based on access patterns.
- Encryption: Server-Side (SSE-S3, SSE-KMS, SSE-C) vs Client-Side.
- Versioning: Protects against accidental deletion/overwrites. (MFA Delete adds extra layer).
- Cross-Region Replication (CRR): For disaster recovery or local latency.