February 18, 2025

Flock Architecture: Our Journey Building a DuckDB-Powered Warehouse

Authors

Nirmal Krishnan

Reading Time

Mins

Intro

Last time, we talked about why data warehouses today suck. We'll be honest, we were being audacious - Snowflake and Databricks are great products, but they’re expensive.

This is covered in depth in the previous post, but they are expensive because you as a user:

have limited control over your infrastructure
pay heavy margins on top of the infra costs
pay a “big data” bill even if you don’t have “big data”

In this post, we’ll discuss how we addressed these concerns in our product.

We’ll review the Chakra flock architecture, benchmarks on our storage providers, and an honest review of our shortcomings and sources of optimization long-term.

What’s the hype about DuckDB?

DuckDB is one of the most exciting new technologies for data engineers and data scientists over the last few years.

What is it?

DuckDB is an in-process SQL database system that allows you to analyze data quickly and efficiently. It uses a columnar-vectorized query engine, processes data in batches, and efficiently uses the CPU cache, making it 20-40 times faster than traditional relationship databases.
It’s easy to use. It uses an industry standard SQL dialect that most data scientists and engineers are already comfortable with.
It supports a host of integrations including parquet, delta-lake, iceberg and other open-table formats.

DuckDB performance vs other in-process databases

What is it NOT?

A massively parallel processing (MPP) system. All queries run on one machine as opposed to many machines.
A multi-player experience. DuckDB traditionally runs locally on a single machine, which makes it hard to share and consume data within and outside of your organization.
A concurrent system. Duck doesn’t support multiple read/write processes even if they are non conflicting.

In building our architecture, we had to address what DuckDB is not while championing its best qualities - its speed, ease of use, and host of integrations.

Scaling Out vs Scaling Up

Namely on (1), while DuckDB does not support MPP, it can support most scale by scaling up instead of scaling out. By scaling up, specifically we mean increased capacity on a single machine instead of federating compute across many.

While the choice between scaling out vs scaling up depends on the application, our experience suggests that scaling up is more cost-effective for our use case, which we can pass on to our customers.

This conclusion is supported by both literature [1, 2] and our empirical results. Scaling up avoids the hidden complexities often found in scaled-out systems, such as:

Process start-up times that add latency
Map-reduce operations that create overhead
Collection steps that can bottleneck performance across most data sizes

Multiplayer Support + Concurrency

When we say DuckDB does not support multiplayer mode, we mean multiple read/write requests on the same data assets. DuckDB supports the following configurations:

One process can both read and write to the database.
Multiple processes can read from the database, but no processes can write.

We vied for configuration (1) where all requests for the same data assets within an organization route to the same process.

Importantly, concurrency is supported within a single process allowing us to service multiple requests from the same organization by routing to the same process.

The limitation with this infrastructure is of course - what happens if 100s of users want to write to the same data assets. Contention on writes in real-time data warehouses between users is a common challenge. We’ve segmented processes by “database” meaning that if organizations vie for logical partitioning of data (customers, goods, services, etc), they shouldn’t run into state hot-spotting very often. Separately, on reads, all of the assets on our data marketplace use S3 attach directly from object storage as read replicas, so they have no concurrency issues at all.

Flock Architecture

Okay, great! We have a solve for MPP and an approach for multiplayer + concurrency. Overall, our flock architecture works as follows:

New User Sign-Up:

Provision a new bucket on behalf of our users in object storage with appropriate Access Control (ACL). Users ultimately own the keys to this storage.
Seed the bucket with a .db file. This contains all metadata for our tables along with manually uploaded data. Today, this file also contains data that is programmatically added as well, but we’re actively moving towards moving this data into an Open Table Format (Iceberg).
Spin up a Database Session.

New Database Session:

Spin up a new Compute process.
Pulls the .db file into the Compute process locally. Why don’t we just reference this file in object storage directly? DuckDB S3 attach only supports read operations (which we use in the data marketplace) so we must pull this file on session start.
Process read/write user queries on the local .db file directly to persistent disk.

Session Shutdown:

Sync the .db file back to object storage.

We use an intelligent orchestration system to shutdown sessions based on historical user query patterns. Minimizing session creations and shutdowns is a continued source of optimization that we are refining as we better understand usage patterns.

Meaningfully with the Flock Architecture, users:

retain extensive control of the object storage with none of the maintenance headache
can dynamically select providers across storage (and soon compute)
can enjoy a multi-player, concurrent experience
can enjoy <100ms latency on most queries because the datasets are read directly from disk

Object Storage Providers

The important caveat for the flock architecture is its extremely egress heavy.

On session creation, we pull the .db file locally, which results in significant motion of data. While we support AWS S3, we’re working with some awesome storage providers that charge no / low egress costs, including:

In addition to charging no / low egress, these providers are materially cheaper than S3 by 25-55%, which we can pass on to our users.

How do they perform against a similar no-egress product like Cloudflare R2?

Net / net: AWS is still the king on performance, but our other providers perform as well (if not better) than Cloudflare R2 while being significantly cheaper. While the object storage sync is important, performance matters the most on session creation and shutdown currently, which we're continuing to refine using the orchestration system.

Try it out!

If you're interested in checking out our performant, multi-player DuckDB experience, you can sign-up for free on our website.

We have a rich catalog of integrations (with more to follow including):

A first-class Python SDK
REST API
Marimo Notebook for light-weight BI

Next Time

A look at DuckDB performance testing and a detailed analysis of cost and pricing.

‍

Authors

Nirmal Krishnan

Link copied to clipboard

Flock Architecture: Our Journey Building a DuckDB-Powered Warehouse

Intro

What’s the hype about DuckDB?

What is it?

What is it NOT?

Scaling Out vs Scaling Up

Multiplayer Support + Concurrency

Flock Architecture

Object Storage Providers

Try it out!

Next Time

Join The Community

Flock Architecture: Our Journey Building a DuckDB-Powered Warehouse

Intro

What’s the hype about DuckDB?

What is it?

What is it NOT?

Scaling Out vs Scaling Up

Multiplayer Support + Concurrency

Flock Architecture

Object Storage Providers

Try it out!

Next Time

Take Part in the Conversation