@adlrocha - Immutable databases

Like blockchain but without blockchain.

May 24, 2020

I don’t know if in the end blockchain technology will thrive and it will end up being widely used, but what it is clear is that many of the innovations introduced by this disruptive technology are here to stay. From the implementation of new consensus algorithms, the engineering of better distributed network protocols, and the design and implementatino of new cryptographic primitive, blockchain technology have brought us many advancements that will prevail even if we end up abandoning the everlasting promises of blockchain technology.

One of the advancements that will potentially prevail are immutable databases. Since the announcement of Amazon’s Quantum Ledger Database (Amazon QLDB) in late 2018 I have been wondering if there was a niche in corporations for the use of this kind of databases. This week I came across an article on an open source immutable database, ImmuDB, and I decided to dig a bit deeper in this matter. First Amazon, now an open source initiative, this was something worth exploring.

How it works?

Immutable databases are centralized database systems where information is stored in a way that its integrity can be cryptographically verified. Every data change is tracked, and the complete history of changes are maintained so that the “integrity of the database” can be verified over time. This is why we call them “immutable”, because the history of all changes performed in the data store is maintained so that whenever there is an unintended or malicious modification it can be detected, reported, and in many cases even recovered. I highly recommend this set of FAQs to get a quick understanding of what immutable databases can and cannot do.

Immutable databases use veriable cryptographic primitives and data structures to ensure the integrity of the data stored. Let’s take the case of ImmuDB as an example (the open source project will talk a bit more about in just a moment). ImmuDB uses a Merkle Tree to store data and keep its integrity. Thus, when in t0 we add the key k0 with value v0, the root of the database’s Merkle Tree has a value of H0 (the hash of (k0, v0)). As we keep adding new information to the database, the tree keeps growing, and the root of the tree keeps changing. When in t1 we update the value of k0 to v1, the tree is updated with a new branch and its root changes to H01. This process is repeated with every new write in the database, either because of the update of a key, or the storage data in a new key.

the merkle tree changes with every new data

With this data structure, validating the integrity of the data stored in the system is easy. Imagine that we want to verify the integrity of the data stored by client A, k0 and k1. To do this we just need to generate a proof that verifies that the first merkle root is consistent with the sencond one, generated after the addition of new data in the database. To generate this proof we would only need to (i) take the nodes from the branches of the first version of the Merkle Root (when client A added his data); (ii) take the higher nodes possible of the new branches generated in the tree after client B’s interaction with the system; (iii) and reconstruct the root of the Merkle Tree and check that its value is the same as the one of the current tree in the database. Thus, we will take H01, and H2 from the first tree, and H3, and H456 from the second tree, and reconstruct the tree up to the root. If the root obtained equals the actual root of the second version of the tree it means that no data has been changed since client A added his data. If on the contrary information had been modified in any way after client A added his data, the H01 and H2 of the first tree wouldn’t be the same as in the second tree, leading to different roots when recreating the tree up to the root.

In this case Merkle Trees are used to verify the integrity of data, but more complex immutable database systems could be devised where instead of using Merkle Trees, other cryptographic primitives could be use to ensure tamper proofness, such as Zero Knowledge Proofs (although, to be honest, I don’t know if in many cases it would compensate for the overhead).

Where to use it?

An immutable database is managed by a single entity, and there is no distribution or replication of data between different nodes in a network owned and managed by different entities. So don’t be mistaken, immutable databases won’t replace blockchain networks at all, but they are the perfect fit for specific use cases as the one illustrated below:

— “I want to build my own blockchain-based system to reliably track all the changes over my stock”, said the CIO of Company A.

— “That sounds great, Mr. Boss. What are the entities involved in these updates? Who needs to write in this blockchain, and what is the level of trust between the participantes?”, aked Mr. HardWorker from Consultancy Company Inc.

— “Ah no no! Just me. I want the different units of my company to modify the data in the database as they do now, but I want to keep track of the history of all these changes so that no inconsistencies appear in our systems. More so, I want you to be sure that this blockchain can accommodate the high load of transactions of my business. But the only company writing in this blockchain will be us”, stated Mr.Boss.

— “Let me introduce you then to Immutable Databases, the solution to your problems”, ended triumphally Mr. HardWorker. He had the sale almost closed.

I guess this toy example made my point. Immutable databases are ideal for use cases where we want the benefits from a tamper-proof storage system without the complexities and potential overheads of a blockchain system, as the database will only be written by a single entity (or a small number of trusted ones).

Actually, I expect to start seeing immutable databases applied to some of these use cases in no time:

To immutably store every update to sensitive database fields (credit card or bank account data) of an existing application database.
To store CI/CD recipes in order to protect build and deployment pipelines.
Store public certificates (this is a widespread one in corporate blockchains).
As an additional hash storage for digital objects checksums.
To store log streams (i. e. audit logs) tamperproof.

Of course, I would never use a immutable database to store large size data. If we need to offer tamper-proofness to large size data we can follow the blockchain way: hash the large file, and track changes of this hash using an immutable database.

Actually, I would like to finish this section with a cite from Amazon QLDB documentation which perfectly states where to use immutable databases:

Q: Is Amazon Quantum Ledger Database a distributed ledger or blockchain service?
Amazon QLDB is not a blockchain or distributed ledger technology. Blockchain and distributed ledger technologies focus on solving the problem of decentralized applications involving multiple parties where there can be no single entity that owns the application, and the parties do not necessarily trust each other fully. On the other hand, QLDB is a ledger database purpose-built for customers who need to maintain a complete and verifiable history of data changes in an application that they own. Amazon QLDB offers history, immutability and verifiability combined with the familiarity, scalability and ease of use of a fully managed AWS database. If your application requires decentralization and involves multiple, untrusted parties, a blockchain solution may be appropriate. If your application requires a complete and verifiable history of all application data changes, but does not involve multiple, untrusted parties, Amazon QLDB is a great fit.

Having a look at ImmuDB

Enough with the theory. Let’s see an immutable database in action. I will focus on ImmuDB, the open source project I mentioned at the beginning of the publication. And we’ll start with a video (it is always relaxing before getting to work).

ImmuDB consists of the following parts:

immudb is the server binary that listens on port 3322 on localhost and provides a gRPC interface

immugw is the intelligent REST proxy that connects to immudb and provides a RESTful interface for applications. It is recommended to run immudb and immugw on separate machines to enhance security
immuadmin is the admin CLI for immudb and immugw. You can install and manage the service installation for both components and get statistics as well as runtime information.

The easiest way to run ImmuDB and start playing with it is to clone the repo, build all the binaries, and start the database and the gateway as follows:

$ git clone https://github.com/codenotary/immudb.git 
$ make all
$ ./immudb -d
$ ./immugw -d

You can also run ImmuDB using docker. In my case I wasn’t able to connect the gateway with the database and that is why I went for the local deployment. But I thought it was worth mentioning.

$ docker run -it -d --name immudb -p 3322:3322 -p 9497:9497 codenotary/immudb:latest

$ docker run -it -d -p 3323:3323 --name immugw --env IMMUGW_IMMUDB-ADDRESS=immudb codenotary/immugw:latest

With the database running, I wanted to try an SDK to integrate ImmuDB with a simple application, but apparently only the REST API and the gRPC interface are available for now to interact with the system. According to their documentation drivers will soon be available for: Java, .NET, Golang, Python, and Node.js, but for now we will have to settle for the use of the immuclient.

Adding a key and a value with the immuclient is pretty straightforward:

./immuclient safeset mykeytest1 myvaluetest1

We see that the result of the command is the addition of the key to the database, the hash of the data, and if the addition was verified. If we don’t forge data in the database we will keep getting this every time we add new data.

To get a key from the database we can use:

./immuclient safeget mykeytest1

Or we can get the history:

./immuclient history mykeytest1

If you want to “maliciously” modify information, you can go to ./db (by default), and mess around with the files. I invite you to do this if you are courious to see what happens when you try to add new data and the database has been “corrupted” ;)

Finally some notes about performance (in case you were wondering the kind of load scenarios in which this technology can be applied) extracted from ImmuDB’s github repo:

This is all for today, folks! I would love to know what do you think about immutable databases, and the potential uses they may have. And if I don’t hear from you before, see you next week!

Paul d'Aoust

Oct 27, 2020

Thanks for this introductory article; I find it interesting that hash-addressed storage is suddenly entering the gestalt. I'm still trying to figure out the use-case for these centralised merke-proof databases though; their value seems to be in proving to _others_ that such-and-such hasn't been corrupted. The problem with Merkle trees is that you can only prove that such-and-such a tree is not corrupt; you can't prove that it is the 'right' tree. It'd be very easy for a corrupt engineer (or CEO) to compute a new ledger and quietly dispose of the one with incriminating evidence of some sort of corporate fraud.

Perhaps the value of these databases is for an auditor or other third-party oversight entity to run the ledger. Now it's starting to look an awful lot like... A blockchain ;-)

I also think there's no problem in storing huge blobs in these databases. I think it's fairly cheap to compute hashes for them (and even if it were expensive, you have to calculate the hash anyway, and computing the merkle proof after that is super cheap). As long as the tech can handle it, and there isn't any economic barrier (e.g., expensive hosted data storage, gas cost on a public blockchain) I don't see an issue.

I also think there's exciting opportunity for distributed versions of this that don't use blockchain. I can think of one really popular example: Git! It's essentially a hash-addressed, immutable, merkle-proof-secured, distributed key/value store. In fact it's the hash addressing that makes it so great for a distributed system; as a monotonically increasing data set, it adheres to the CALM principle (Kleppman and Alvaro), which means that coordination and consensus protocols aren't even needed for use in a distributed system. (Note: branch pointers are the only exception; as non-monotonically-increasing data points they do require coordination, which usually looks like human agreement to not clobber the remote with `git push -f` 😉)

Expand full comment

1 reply

Ivan Voras

I ike to at least imagine a world in which authenticated data is easily distributed around. I started working on something similar to the described databases, with a different take: each "block" is a read-only SQLite database: https://github.com/ivoras/daisy . And as an important step, starting and cloning / distributing databases is "daisy newchain ..." and "daisy pull ..."

1 more comment...

@adlrocha Weekly Newsletter

Discussion about this post