@adlrocha - Performance Best Practices in Hyperledger Fabric I: Introduction
The "Fabric Into Production" Series (Part 1)
For the past few months me and my team have been working on an extensive work of analyzing, understanding, and improving the performance of a Hyperledger Fabric in order to draw a set of performance best practice for the deployment of Fabric production networks. I was supposed to present this work first in a talk at the Hyperledger Global Forum in Phoenix, and on my return in a Hyperledger Meetup in Madrid. Unfortunately, due to the well-known global circumstances we are living these days, this was impossible.
I didn’t want all this job to be forgotten, specially when I think it could benefit the Fabric community so much. A lot of companies are developing use cases over Hyperledger Fabric, but few are talking about how hard moving into production a Fabric-based use case can be, and the set of best practices to be followed in order to succeed at this. That is why, I decided to start this series of publications where I will share all of our learning moving our proof of concept Hyperledger Fabric networks into production. With this series I hope to set a groundwork for further public performance analysis in corporate environments (my initial goal when submitting my talk to the Hyperledger Global Forum).
Preliminaries
I guess everyone starts developing their first Fabric-based use case in the same way. You go to Fabric’s “Build Your First Network” documentation, you deploy your simple network, you implement your business logic in a chaincode, deploy it in your “first network” and voilá! you’ve deployed your first blockchain use case. This is great for proof of concepts because with this simple setup you can test if it makes sense to use blockchain technology in your business problem, it allows you to validate the UX, implement and fine-tune your chaincode business logic, and even try your system with a limited number of users.
The problem comes once you’ve validated all of your preliminary assumptions, you’ve seen that your system works and is adding value to your company, and you want to move your use case “as-is” into production… this is where things start to break.
Your “first network” was prepared to comfortably accommodate a few thousand transactions a day to support your PoC use case, but in production you start noticing a significant increase in the number of transactions per second. And with this increase in the load of your network, the pressure and stress you feel, and the wetness in your forehead increases proportionately. At this point you start questioning many of your initial assumptions:
How much load is my infrastructure ready support? I didn’t consider this.
What is the optimal number of peers in the network for my use case? My “first network” has a few of them, but is this enough? I don’t know.
Users interact with the Fabric network through an SDK. I am using it right? Can the SDK be harming my network’s performance?
And WTF is this MVCC Conflict Error I am getting with high loads?
ARGHHHH!!!! Oh my! The performance! The network! Everything is falling apart!!
The aforementioned scenario is more than typical when moving your first blockchain network into production. The underlying complexity of blockchain platforms sometimes hardens the understanding of their core schemes, and their behavior with a high load. In order to explain why Fabric’s performance degraded so much with a high transaction throughput we had to understand what were the bottlenecks of the infrastructure. The first thing we did when we faced this problem was to turn to the literature. Had somebody published already anything related to the performance of Hyperledger Fabric? Fortunately, there were a few interesting works out there. These were the ones that helped us the most:
[3] "FastFabric: Scaling Hyperledger Fabric to 20,000 Transactions Per Second" - Gorenflo et al, 2019
And almost at the end of our performance evaluations (January 2020) this great paper was published, confirming many of our reached assumptions. So it came to us as if it had fallen from the sky:
These papers were very helpful for our work, but they differed in certain aspects from the analysis we were looking to do:
These papers were strictly academic, they considered isolated test environments instead of the real corporate environments we were facing.
These papers considered an unlimited amount of underlying computing power available to run the network, while in our case every new machine supposed an additional cost that we had to add to our use case cost model.
We not only needed to understand how Hyperledger Fabric scaled technically in terms of performance, but also in terms of costs as we wanted to build products whose cost model is sustainable enough to scale with the load of the production system.
In short, the result of our work wasn’t supposed to be an optimal configuration of the network and the maximum number of transactions an infrastructure could accommodate, but a set of tuples [infrastructure, performance, cost] so according to the product and the use case we could decide an optimal setup (again, technically and in terms of costs).
The Experimental Setup
For our experiments we considered an environment that resembled the most what we were aiming to build in production —and where we think permissioned blockchain networks are moving— a dynamic multi-client general purpose network where all members can deploy and interact with chaincodes and applications. This also differs from the typical setup where a Hyperledger Fabric network hosts a very specific use case whose expected load can be “predicted”. In our case, what we want to build are general-purpose networks able to accommodate several heterogeneous use cases, each with their own requirements and expected load. And we wanted the networks be able to offer a ensured baseline performance to each of them.
Consequently, we used the following base infrastructure as a baseline for our tests:
We considered a network with a set of peers (two in the baseline setup), and their corresponding state databases (CouchDB or LevelDB). Throughout the tests we modified the number of peers, the databases used, etc. from the baseline configuration in order to understand their impact in performance. More about this in future publications of the series.
To interact with the network using applications from the outside world we built a driver API that abstracts the complexity of the Fabric SDK library. If you come from the Ethereum world, what we built here is the equivalent to the RPC endpoint used to interact with Ethereum nodes. Each peer has one of these HFService drivers connected to it. The building of the HFService has been key for us in the understanding of the impact of the use of the SDK in the performance in the network (again, you’ll learn more about this by the end of the series).
Throughout all our tests, we used RAFT for our ordering service, and in our baseline configuration we used three ordering nodes. As part of our exploration, we also played with this number to understand the impact of orderers in performance.
Finally, as we are focused on real production environments, for the deployment of our networks we considered Kubernetes for the container orchestration, Prometheus for the monitoring of our infrastructure, and Gatling to generate the test load in the network.
The Testing Layers
The goal of our analysis was to deeply understand Fabric’s architecture in order to be able to draw clear conclusions about how the different layers affected performance, and the potential bottlenecks that might appear with high transaction loads. Thus, we decided to abstract a Fabric network in the following layers to ease our analysis. This is the structure we followed throughout all our tests. As already mentioned, while approaching the end of our work, Chaum et al. published in [4] a similar abstraction stack to this one in their work, reinforcing our abstraction hypothesis:
Infrastructure: This layer represents the underlying infrastructure hosting the different modules of the Fabric network. In the analysis of this layer we’ll see how the deployment of a Fabric network over Kubernetes or bare metal, the available computational resources of the infrastructure, and even its latency, can affect the network’s performance.
Architecture: In this layer we will focus on analyzing optimal architecture configurations to get the most out of the underlying infrastructure. Here we managed to answer important questions such as, what is an optimal number of peers, how can we dynamically scale a Fabric network, or when is better to use CouchDB or LevelDB in your network.
Protocol: We could definitely drive a better performance by modifying Fabric’s core protocol, but this would mean having to maintain a fork of Fabric’s source code, and build a support team for the project. Hence, at this layer we only focused on optimal protocol configurations to fine-tune the performance of our network (fine-tuning the number of transactions per block, the endorsement policies, CouchDB’s cache strategy, etc.).
Chaincodes: Yes, people! Chaincodes are critical for performance. How do you implement your business logic in the blockchain can deeply affect your future performance in production. In our case, we did an extensive work profiling and refactoring all of our chaincodes to better the performance of our network, and in this process we managed to infer several design best practices for chaincodes I will share with you throughout this series.
Finally, the SDK. In my opinion one of the most disregarded and worst understood pieces of a Fabric network, and surprisingly a big responsible in the performance of a Fabric-based applications. While building our HFService, the SDK was the reason for many of our bottlenecks until we fine-tuned it. Bear in mind that the SDK is the one responsible for orchestrating all the lifecycle of a transaction, so if misused it can break all the good we may have done already fine-tuning the other underlying layers.
To be continued…
This is the first publication of this series dedicated to moving Hyperledger Fabric networks into production. In this article I wanted to set the context and give you an overview of the publications yet to come (see if you get the itch). In the next few weeks I will go through all of our work sharing all the pieces of knowledge we have acquired throughout our extensive analysis. In order not to overload the segment of my audience less interested in blockchain technology, I will intersperse articles on other subjects publishing a new part of this series every two weeks instead of releasing a new one every week.
In any case, if you don’t want to miss any publication of this series, do not forget to subscribe. And if you can’t stand the waiting and you want to learn a bit more about my work, do not hesitate to contact me.