Snowflake vs S3: what storage solution is best for unstructured data?

Introduction

Ask an engineer about their data stack and you’ll get S3 and Snowflake in the first breath. That’s the standard answer—S3 holds everything, Snowflake queries it. For years, that was enough.

But that two-layer setup has cracks now. ML pipelines don’t just query data—they churn through it repeatedly. Document processing needs fast, stateful access. Observability platforms hit the same files thousands of times. None of that fits cleanly into “store in S3, query in Snowflake.”

What’s missing is a third layer: something between storage and analytics that treats active data differently. Not just durable, not just queryable—fast and shared, with filesystem semantics that modern compute actually expects.

This piece walks through why S3 and Snowflake aren’t competitors (they never were), and where something like Archil fills the hole they leave open.

Durable Storage in the Cloud: What S3 Is Built For

S3 is the foundation. It stores objects in buckets—data plus metadata plus a unique key. Each object can be terabytes large, and everything gets replicated across availability zones automatically. It scales indefinitely and almost never loses data.

AWS built S3 for reliability first. You interact with it through REST APIs and SDKs, which makes it compatible with nearly everything. Teams use it as a landing zone for raw data, a long-term archive, and a neutral handoff point between systems. Parquet files, JSON logs, images, PDFs—all of it goes into S3.

What S3 doesn’t do is low-latency access. Operations are optimized for durability and throughput, not speed. And it definitely doesn’t act like a filesystem—no shared mutable files, no POSIX semantics, nothing that looks like a mounted disk.

Analytical Compute: How Snowflake Uses Stored Data

Snowflake sits in a different layer entirely. It doesn’t store arbitrary objects like S3. Instead, it ingests data—usually from S3—and reorganizes it into columnar format for fast querying.

The big architectural move Snowflake made was separating storage from compute. Data lives centrally, and you spin up independent virtual warehouses to query it. That design lets you run high-concurrency workloads without everything stepping on each other.

Snowflake handles structured and semi-structured data. You can query JSON, Parquet, Avro—whatever—using SQL. Teams pair it with ELT tools to transform raw data into clean tables for dashboards and reporting.

Where Snowflake shines: governed analytics, data sharing, BI workloads. What it doesn’t try to be: a general-purpose compute layer or a filesystem.

Why Two Layers Aren’t Enough for Modern Workloads

For a lot of workloads, S3 + Snowflake is still fine. But once you add ML training, heavy document processing, or shared compute environments, you hit problems fast.

Applications need repeated access to the same files across multiple instances. Restarting a job shouldn’t mean re-downloading terabytes. A failure shouldn’t force you to rebuild your entire working set from scratch. Going directly to S3 every time adds latency. Staging everything onto local disks or network filesystems adds operational mess.

There’s a structural gap here. S3 gives you durability and scale. Snowflake gives you analytics and governance. Neither one is built to act as a high-performance shared filesystem for active compute.

Filling that gap requires a dedicated access layer—one that keeps S3 as the source of truth but changes how compute interacts with it.

The Access Layer: Filesystem Semantics on Top of Object Storage

Most compute frameworks expect filesystem behavior. They want mountable paths, file locking, atomic renames, consistent directories. You can’t fake that efficiently using object APIs alone.

An access layer bridges the gap by exposing filesystem semantics on top of object storage. Instead of forcing your application to adapt to S3’s constraints, it makes S3 adapt to what your application expects. The result is simpler pipelines, faster restarts, and fewer hacks.

That’s the role Archil is designed to fill.

Archil: High-Performance Filesystem Access Backed by S3

Archil gives you POSIX-compliant virtual volumes backed directly by S3. These volumes mount like local disks but scale automatically with your dataset size. Multiple compute instances can attach to the same volume and see a consistent view of shared files without any manual coordination.

Because Archil preserves native S3 formats, your data stays accessible through both filesystem operations and standard object APIs. Active data is cached close to compute, which drops latency well below what you’d get hitting S3 directly—and you don’t have to pre-provision capacity.

This works especially well for workloads with tons of files, frequent writes, or shared state across instances. Developers can use existing tools unchanged. Operators get simpler capacity planning and lower costs because you’re only paying for active data.

Putting the Layers Together

In a modern architecture, each layer has one job. S3 is the durable system of record. Archil provides fast, shared access for compute running against active datasets. Snowflake consumes transformed data for analytics, reporting, and governance.

For teams running performance-sensitive compute on S3-backed data, this access layer becomes as foundational as storage and analytics themselves.

These systems don’t overlap or compete—they reinforce each other. Storage stays scalable and cheap. Compute stays flexible. Analytics stays fast and governed.

A Practical Example: High-Volume Document Processing

Take a SaaS platform processing invoices and contracts at scale. Raw PDFs land in S3 for durability and compliance. Compute instances mount an Archil volume to run OCR and extraction jobs directly against the shared dataset, maintaining state across retries and failures.

Structured outputs get loaded into Snowflake for reporting and audits. The original files stay in S3 for long-term retention. You avoid repeated downloads, eliminate manual staging, and recovery becomes trivial when jobs fail.

Conclusion

S3 and Snowflake are still foundational, but they’re no longer enough on their own. Once you’re running compute-heavy or stateful workloads, you need a dedicated access layer.

Splitting durability, access, and analytics into distinct layers makes systems easier to scale and operate. S3 handles reliable storage. Snowflake delivers analytical power. Archil enables high-performance interaction with active data—each one optimized for its role.

Designing explicitly around these layers helps you move faster today and stay resilient as your data platform grows.