GitHub - kamu-data/kamu-cli: Next-generation decentralized data lakehouse and a multi-party stream processing network

About

kamu (pronounced kæmˈuː) is a command-line tool for management and verifiable processing of structured data.

It's a green-field project that aims to enable global collaboration on data on the same scale as seen today in software.

You can think of kamu as:

Local-first data lakehouse - a free alternative to Databricks / Snowflake / Microsoft Fabric that can run on your laptop without any accounts, and scale to a large on-prem cluster
Kubernetes for data pipelines - an infrastructure-as-code framework for building ETL pipelines using wide range of open-source SQL engines
Git for data - a tamper-proof ledger that handles data ownership and preserves full history of changes to source data
Blockchain for data - a verifiable computing system for transforming data and recording fine-grain provenance and lineage
Peer-to-peer data network - a set of open data formats and protocols for:
- Non-custodial data sharing
- Federated querying of global data as if one giant database
- Processing pipelines that can span across multiple organizations.

Featured Video

Kamu: Unified On/Off-Chain Analytics Tutorial

Quick Start

Use the installer script (Linux / MacOSX / WSL2):

curl -s "https://get.kamu.dev" | sh

Watch introductory videos to see kamu in action
Follow the "Getting Started" guide through an online demo and installation instructions.

How it Works

Ingest from any source

kamu works well with popular data extractors like Debezium and provides many built-in sources ranging from polling data on the web to MQTT broker and blockchain logs.

Track tamper-proof history

Data is stored in Open Data Fabric (ODF) format - an open Web3-native format inspired by Apache Iceberg and Delta.

In addition to "table" abstraction on top of Parquet files, ODF provides:

Cryptographic integrity and commitments
Stable references over real-time data
Decentralized identity, ownership, attribution, and permissions (based on W3C DIDs)
Rich extensible metadata (e.g. licenses, attachments, semantics)
Compatibility with decentralized storages like IPFS

Unlike Iceberg and Delta that encourage continuous loss of history through Change-Data-Capture, ODF format is history-preserving. It encourages working with data in the event form, and dealing with inaccuracies through explicit retractions and corrections.

Explore, query, document

kamu offers a wide range of integrations, including:

Embedded SQL shell for quick EDA
Integrated Jupyter notebooks for ML/AI
Embedded Web UI with SQL editor and metadata explorer
Apache Superset and many other BI solutions

Build enterprise-grade ETL pipelines

Data in kamu can only be transformed through code. An SQL query that cleans one dataset or combines two via JOIN can be used to create a derivative dataset.

kamu doesn't implement data processing itself - it integrates many popular data engines (Flink, Spark, DataFusion...) as plugins, so you can build an ETL flow that uses the strengths of different engines at different steps of the pipeline:

Get near real-time consistent results

All derivative datasets use stream processing that results in some revolutionary qualities:

Input data is only read once, minimizing the traffic
Configurable balance between low-latency and high-consistency
High autonomy - once pipeline is written it can run and deliver fresh data forever with little to no maintenance.

Share datasets with others

ODF datasets can be shared via any conventional (S3, GCS, Azure) and decentralized (IPFS) storage and easily replicated. Sharing a large dataset is simple as:

kamu push covid19.case-details "s3://datasets.example.com/covid19.case-details/"

Because dataset identity is an inseparable part of the metadata - dataset can be copied, but everyone on the network will know who the owner is.

Reuse verifiable data

kamu will store the transformation code in the dataset metadata and ensure that it's deterministic and reproducible. This is a form of verifiable computing.

You can send a dataset to someone else and they can confirm that the data they see in fact corresponds to the inputs and code:

# Download the dataset
kamu pull "s3://datasets.example.com/covid19.case-details/"

# Attempt to verify the transformations
kamu verify --recursive covid19.case-details

Verifiability allows you to establish trust in data processed by someone you don't even know and detect if they act maliciously.

Verifiable trust allows people to reuse and collaborate on data on a global scale, similarly to open-source software.

Query world's data as one big database

Through federation, data in different locations can be queried as if it was in one big data lakehouse - kamu will take care of how to compute results most optimally, potentially delegating parts of the processing to other nodes.

Every query result is accompanied by a cryptographic commitment that you can use to reproduce the same query days or even months later.

Start small and scale progressively

kamu offers unparalleled flexibility of deployment options:

You can build, test, and debug your data projects and pipelines on a laptop
Incorporate online storage for larger volumes, but keep processing it locally
When you need real-time processing and 24/7 querying you can run the same pipelines with kamu-node as a small server
A node can be deployed in Kubernetes and scale to a large cluster.

Get data to and from blockchains

Using kamu you can easily read on-chain data to run analytics on smart contracts, and provide data to blockchains via novel Open Data Fabric oracle.

Community

If you like what we're doing - support us by starring the repo, this helps us a lot!

Subscribe to our YouTube channel to get fresh tech talks and deep dives.

Stop by and say "hi" in our Discord Server - we're always happy to chat about data.

If you'd like to contribute start here.

Name		Name	Last commit message	Last commit date
Latest commit History 1,856 Commits
.config		.config
.github		.github
docs		docs
examples		examples
images		images
installer		installer
migrations		migrations
resources		resources
scripts		scripts
src		src
.env		.env
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
DEVELOPER.md		DEVELOPER.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
clippy.toml		clippy.toml
deny.toml		deny.toml
rust-toolchain		rust-toolchain
rustfmt.toml		rustfmt.toml
update-non-aws.sh		update-non-aws.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Featured Video

Quick Start

How it Works

Ingest from any source

Track tamper-proof history

Explore, query, document

Build enterprise-grade ETL pipelines

Get near real-time consistent results

Share datasets with others

Reuse verifiable data

Query world's data as one big database

Start small and scale progressively

Get data to and from blockchains

Community

About

Uh oh!

Releases 385

Packages

Uh oh!

Uh oh!

Contributors 10

Languages

License

kamu-data/kamu-cli

Folders and files

Latest commit

History

Repository files navigation

About

Featured Video

Quick Start

How it Works

Ingest from any source

Track tamper-proof history

Explore, query, document

Build enterprise-grade ETL pipelines

Get near real-time consistent results

Share datasets with others

Reuse verifiable data

Query world's data as one big database

Start small and scale progressively

Get data to and from blockchains

Community

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 385

Packages 0

Uh oh!

Uh oh!

Contributors 10

Languages

Packages