Contents

Introducing: Chakra

Why we're building the Next Gen Data Warehouse

December 19, 2024

MINS

Mission

Chakra is on a mission to organize the world’s structured data.

We’re building tooling to enable researchers and agents to access the most important datasets in a cost effective, up-to-date, and royalty rich fashion.

With Chakra, anyone can access critical structured datasets for use in analytics use cases as well as in model training, retrieval, and inference. Most importantly, value from the use of those datasets will flow back to the creators of that data.

‍

Problem Space

Today, LLMs (large language models) perform poorly when referencing structured data.

Without access to a unified data store, LLMs “try their best” using unstructured datasets like news, but ultimately do not create a rich, engaging, or accurate end-experience for consumers.

First, what do we mean by structured data?

Structured data is tabular - rows and columns that typically sit in a spreadsheet, or in the case of enterprises in a “data warehouse.” Popular traditional data warehouse companies include Snowflake and Databricks.

Examples of the most important structured datasets include:

Financial data (stocks, bonds, crypto, etc.)
Social data (Twitter, Reddit, StackOverFlow, LinkedIn)
Census and demographic data (population, housing, income, education)
Consumer data (credit card spend)

Today, these datasets largely sit in siloes on the originating organization’s data warehouse.

‍

What are some examples of LLMs performing poorly when referencing structured data?

Let’s ask a few LLMs a very basic question:

‍

What’s the SPY vs BTC performance over the last 5 years?

Claude (3.5 Sonnet): gives back code to plot the query with hallucinated data.

‍

ChatGPT (o1): performs slightly better (has access to search with the latest update), but the charts are largely useless because it lacks access to the raw data. The timestamps of the data itself aren’t quite right, but it’s at least accurate!

‍

Financial datasets are by far the best catalogued in news, which the LLMs (with search) have access to, but trying a query on Twitter yields far worse results.

What’s Anatoly Yakovenko’s latest tweet?

Perplexity: references tweets from 9-12 months ago (found in popular news articles)

ChatGPT (4o): slightly better, but still very stale data references from news articles like the Block and Medium.

‍

Why do LLMs perform poorly at referencing structured datasets?

1. Access: The APIs for these key datasets are typically gated. Despite user’s contributing all of the value to the data, datasets like Twitter and Reddit are either (1) blocked for LLM inference or (2) inordinately expensive.

‍

2. Architecture: Most LLM “search” engines are engineered in the following way.

Search the open web based on the user query and get back the top N results
Rank the results
Shove the resulting data in a context window and hope the LLM can serve relevant results

This approach works very well for unstructured user queries (plan me a 4-day trip to Italy) and Perplexity’s success speaks to this.

Ultimately, the architecture struggles with the given examples because the LLM is not engineered to ultimately reference a unified datastore for structured data.

This is changing with recent innovation, including:

Anthropic’s Model Context Protocol
LlamaIndex’s LlamaCloud
MindsDB

Regardless of the architecture implementation, which we expect will improve, access will continue to be an ongoing problem for these agents.

‍

3. Royalty / Value Flow: The current ecosystem lacks a mechanism for fairly compensating data contributors. When companies like Twitter or Reddit generate valuable datasets through user contributions, those users receive no compensation when their data is used for AI training or inference.

Even when organizations are willing to pay for data access, there's no established framework for:

Tracking how the data is actually used
Ensuring value flows back to original contributors
Maintaining transparency around data usage and monetization
Creating incentives for continued high-quality data contributions

This creates a negative feedback loop where valuable datasets become increasingly restricted, expensive, or stale, ultimately limiting the capabilities of AI systems that could benefit from this structured data.

‍

Our Masterplan

1. Build a verifiable data warehouse (live in private beta) where enterprises can store and transform their data.

2. Develop an open data marketplace where enterprises who create proprietary data can share and monetize their datasets.

3. Launch our consumer platform where users can contribute their valuable datasets to the data marketplace. Our query attribution engine ultimately ties all usage of the end data back to the contributing user.

4. Build an agent kit where researchers can seamlessly incorporate structured data into model inference.

‍

Verifiable Data Warehouse

In order to meet our mission of organizing the world’s structured data, we need a place for the data to live. We’ve spent the last few months creating a warehouse that looks and feels like Snowflake, but under the hood uses decentralized (1) storage and (2) compute primitives. This provides the following advantages:

Cost - 70% cheaper than the centralized competitive product. Largely as a result of economies of scale, these infrastructure primitives are much cheaper to operate and we pass on these lower margins to our consumers.
Security - we work with SOC II compliant / encrypted storage providers like Impossible Cloud and Akave. All of the cost advantages of decentralized physical infrastructure (DePIN), none of the security or privacy headaches.
Censorship resistance - critically important for the user contributed data that will live on our data marketplace.

‍

Open Data Marketplace

We plan to prioritize two sets of datasets:

Enterprise datasets - clear value transfer between the data buyer and data producer.
Hybrid datasets - data where the contributing value comes from end consumers. We’re building a query attribution engine - every time a data buyer queries this dataset, value flows back to the contributing user.

‍

Consumer Platform

This will be the home of user contributed datasets. Users can opt in to contributing data and earning royalties on this data. Datasets we plan to prioritize include Reddit, Twitter, StackOverFlow, Github, LinkedIn, and more.

Collection will ultimately be anonymized with clear privacy policies set in place so that the contributing user’s identity is secure.

‍

Agent Kit

The Agent Kit ultimately is how end researchers interface with the marketplace data. We’re building a rich Python SDK in order to seamlessly pull, transform, and synthesize the data from our marketplace into model inference.

‍

you are correct, i cannot read akash data yet. would be interesting to plug into their gpu metrics
— aixbt (@aixbt_agent) December 10, 2024

‍

Some example agents we expect can be deployed based on marketplace data:

Dataixbt - an agent with the same vernacular as aixbt but with access to onchain data (like Akash)
TechRecruitingAgent: By analyzing GitHub contributions, StackOverflow answers, and LinkedIn data, this agent could help identify promising developers in specific domains (like Rust or ML) - it could spot emerging talent based on meaningful contributions rather than just keywords.
Inversebrahgent: An agent that combines historical price data, social media discussions, and git commits to tell the story of important moments in crypto history, linking price movements to developer activity and sentiment.
PelosiAgent - an agent that analyzes Nancy Pelosi’s trading patterns and shares key insights
TolyAgent - an agent that speaks technical gibberish in the same manner as Anatoly

Ultimately our 4 pieces - warehouse, marketplace, consumer, and agent kit - must work together, like distinct Chakras - harmoniously, in order to build a best in class experience.

‍

Team

Our team combines rich experience in both building performant, high quality data systems and creating rich consumer experiences.

We descend from organizations like Microsoft, Snap, Hedge Funds, Berachain, as well as individual MEV search. If this mission speaks to you, check out our jobs board.

‍

Join Our Community

Website: https://chakra.dev

Twitter / X: https://x.com/chakra_ai

Discord: https://discord.gg/chakra-ai

Careers: https://jobs.ashbyhq.com/chakra-labs

‍

Authors

Nirmal Krishnan

Alex Fung