Skip to content

S3 (Simple Storage Service)

1. Why This Service Exists (The Real Problem)

The Problem: Storing files on a server (EC2) is a ticking time bomb. - Disk Full: error: no space left on device at 3 AM. - Data Loss: If the physical drive fails, the data is gone forever. - Scaling: How do you serve 100TB of images to 1 million users from a single machine? You can't.

The Solution: Infinite storage capacity that is accessible via HTTP API from anywhere in the world.

2. Mental Model (Antigravity View)

The Analogy: Google Drive for your Application. - Bucket: The root folder. - Object: A file. - Key: The full path (filename). folder1/folder2/image.jpg. - Metadata: Tags attached to the file (e.g., Author: Bob, ContentType: image/jpeg).

One-Sentence Definition: An object store with infinite capacity, 99.999999999% (11 9s) durability, and HTTP access.

3. Core Components (No Marketing)

  1. Buckets: Global namespace (Name must be unique across all AWS accounts). e.g., my-app-assets.
  2. Objects: The data blobs. Max size 5TB per object.
  3. Storage Classes: The price/speed tier.
    • Standard: Hot data (Freq access).
    • Glacier: Cold archival (Tape storage).
  4. Policies:
    • Bucket Policy: "Who can access this bucket?" (Resource-based).
    • ACL (Legacy): Don't use this.

4. How It Works Internally (Simplified)

  1. Durability: When you upload a file, AWS splits it into erasure-coded chunks and scatters them across at least 3 Availability Zones.
  2. Consistency: Strong Consistency. If you write a new object and immediately read it, you will get the new version.
  3. Flat Structure: There are no "folders". photos/2023/jan/cat.jpg is just a single key string. The console pretends there are folders for UI convenience.

5. Common Production Use Cases

  • Static Website Hosting: Hosting React/Vue frontends (HTML/JS/CSS) directly from S3 (Dirt cheap).
  • Data Lake: Dumping all raw CSV/JSON logs into one bucket for Athena/Redshift to analyze later.
  • Media Storage: Storing user profile pictures, videos, and PDFs.
  • Backup Target: Storing DB snapshots.

6. Architecture Patterns

The "Pre-Signed URL" Pattern

Don't route file uploads through your backend server (Client -> EC2 -> S3). It wastes EC2 CPU/Bandwidth. Do upload direct to S3.

Flow: 1. Client: "I want to upload cat.jpg". 2. Backend: Checks permission. Generates a Pre-Signed URL (a temporary token allowing write access to specifically cat.jpg for 5 mins). 3. Client: PUTs file directly to S3 using that URL. 4. Backend: Triggered via EventBridge to know upload is done.

The "CDN Origin" Pattern

  • S3 is fast, but CloudFront (CDN) is faster.
  • Always put CloudFront in front of S3 for public assets to cache content at the Edge and reduce S3 request costs.

7. IAM & Security Model

The "Block Public Access" Switch: - Turn this ON at the account level. - Never make a bucket public unless it's a static website. - Use OAI (Origin Access Identity) or OAC to let CloudFront read the bucket while executing Block Public Access for everyone else.

Encryption: - SSE-S3: AWS manages keys (Default). - SSE-KMS: You manage keys (Granular control over who can decrypt).

8. Cost Model (Very Important)

  • Storage: Cheap ($0.023/GB).
  • Requests: Expensive ($0.005 per 1,000 PUT requests).
    • Trap: A logging app writing 1KB files every millisecond will bankrupt you on Request fees, not Storage fees.
  • Data Transfer: Free In. Expensive Out (unless going to CloudFront).
  • Lifecycle Transitions: Moving 1 million tiny files to Glacier costs money per transition. (Glacier has a 40KB minimum overhead).

Optimization: - Use Intelligent Tiering: It automatically moves files to infrequent access tiers when you stop touching them.

9. Common Mistakes & Anti-Patterns

  • Public Buckets: Leaking customer data because "I couldn't get the permissions to work".
  • ListBucket Abuse: Running ls on a bucket with 100 million objects. It's slow and expensive. S3 is not a database.
  • Filesystem Usage: Mounting S3 as a filesystem (FUSE/s3fs) on Linux. It performs terribly because S3 is not POSIX compliant (no append, no file locking).

10. When NOT to Use This Service

  • Block Storage: If you need a hard drive for your EC2 (e.g., to install Oracle DB). Use EBS.
  • Shared File System: If you need 50 servers to edit the same Word doc simultaneously. Use EFS (Elastic File System).
  • Queryable DB: S3 Select exists, but it's not a database. Don't use S3 for transactional data.

11. Interview-Level Summary

  • Consistency: Strong consistency for new objects and overwrites.
  • Standard vs Intelligent Tiering: Intelligent Tiering moves data automatically based on access patterns.
  • Encryption: Server-Side (SSE-S3, SSE-KMS, SSE-C) vs Client-Side.
  • Versioning: Protects against accidental deletion/overwrites. (MFA Delete adds extra layer).
  • Cross-Region Replication (CRR): For disaster recovery or local latency.