@adlrocha - Performance Best Practices in Hyperledger Fabric IV: SDK
The "Fabric Into Production" Series (Part 4)
We finally reached the last part of the series. By now you should be an expert on how to build an infrastructure to host a Fabric network with performance in mind; how to decide an optimal network architecture; how to fine-tune Fabric’s protocol; and how to design your chaincodes in order for them not to be the bottleneck in your production system. There is only one piece left to complete our performance puzzle in Fabric, the SDK.
Fabric’s SDK provides an API to interact with a Hyperledger Fabric network. It is our gateway to our Fabric system from the outside world. The SDK enables us with a way to manage Fabric identities; send transactions to the network and manage their lifecycle; deploy and interact with chaincodes; and communicate with the CA. As a gateway to the network that it is, we realized early in our journey, but not easily, that correctly using the SDK and its interaction with Fabric components was key for performance and the success of our systems.
Fabric-SDK: The interaction gateway
As already advanced in previous publications, our aim was to build a general-purpose Fabric network where anyone could deploy their chaincodes and use as a substrate to build their Fabric-based application. Consequently, we decided to build a piece of software that we called the “HFService”, which is a driver API to enable an easy interaction of external applications with peers and other entities in the network (let’s be honest, if you have used Fabric’s SDK you’ll be with me that is not extremely easy to use, a simpler abstraction layer wouldn’t harm).
For the implementation of the HFService we used fabric-sdk-node (Fabric’s official SDK written in Nodejs). As depicted in the following image, HFService is just a simple lightweight API server coupled to our peers (analogously to Geth’s RPC) wrapping the functionality of the Fabric SDK, but exposing a simpler set of higher-level commands easier to use for the layman than the low-level SDK API. The kind of function you can expect from the HFService are: “deploy chaincode”, “query chaincode”, “invoke chaincode”, etc. Again, really similar to geth’s RPC.
We performed all our tests using the HFService as the gateway to communicate with the Fabric infrastructure. Thus, Gatling directed its load to a set of endpoints exposed by the different HFServices in the network (in our case, one per peer).
The first disgust we got from the SDK arised the moment we started our load tests to the infrastructure (thinking that our HF Service was perfectly designed after all the tests it has passed in our PoCs, where we already used it). We started experiencing “weird” issues. At first we thought they were related to Fabric’s limitations with high-load, but again (as with chaincodes), we were the problem. We disregarded the importance of the SDK. Let me walk you through the problems we faced and how we solved them:
Our first issue, if done correctly, is actually a great feature to improve Fabric’s performance through a fire-and-forget approach, but in our case, it was a big old bug. We were implementing the transaction lifecycle in the HFService wrong. This led to the SDK sending a transaction to a peer, endorsing it, sending it to the orderer, and not waiting for the result of the ordering service. This wasn’t really a problem in a PoC environment where the transaction loads were small, and whenever a transaction reached the orderer, with high probability, it was accepted by the infrastructure. However, with high loads, MVCC conflicts occurred, inconsistencies appeared after the ordering process, etc, In short, the SDK was considering successful transactions which actually weren’t, and this was a complete mess. Fortunately, once we realized the mistake, it was easy to fix. The solution reside in the following sentence of the documentation:
After the transaction proposal has been successfully endorsed, and before the transaction message has been successfully sent to the orderer, the application should register a listener to be notified when the transaction achieves finality, which is when the block containing the transaction gets added to the peer's ledger/blockchain.
This got our SDK working fine for a while, but we reached a new stalemate when we realized that we were reusing event hubs for different users in the platform, resulting in applications not being notified correctly of when their transactions were committed, and the mix of transactions between users. Again, this is something that we didn’t realize in our PoCs but that we started experiencing with the high number of users of a production system. Actually, there are several possible strategies to manage Fabric’s Event Hub (check the documentation), and after a lot of tests we draw the conclusion that the best way (from a performance point of view) to manage users in the SDK was to reuse client instances for the same user with their corresonding cached context so that every user had a different connection (and no shared contexts). Thus, when a new user arrives to the platform, a new client and a new context is created with its corresponding event hub. Whenever this user returns to the platform his client and context is recovered. In this way, we prevent client instances from being reused by different users, and we remove the issue of mixing contexts between them (this mixing led to the SDK confusing requests in the client). If you haven’t used fabric-sdk before this may not make any sense to you. Don’t worry, you’ll get the corresponding learning of this issue at the end of the section for future reference.
Once we had the SDK working without errors and at full-speed, we started exploring different strategies to see which one was the best for performance at an SDK level:
First we tried different mappings of SDK-peer. We tried from having a single SDK pointing to every peer in the network, to having a SDK insance per peer, each pointing to a single peer. The results were clear, a 1:1 mapping between SDK and peers led to a better performance.
Our next test was to see if the type of peer the SDK pointed affected the transaction throughput, and the answer was clearly yes. The throughput was significantly higher when the SDKs pointed endorser peers.
Finally, we wanted to understand if the choice of orderer node to whom we send a transaction proposal affected performance. The impact wasn’t as significant as in the case of endorsing peers, but in Raft if we send the proposal to the master orderer, the communication overhead of the consensus is minimized, and an appreciable improvement is obtained (contact me in Twitter if you want to know how we managed to “predict” which node was “expected” to be the master node before deploying a network. It is a fun story, but it’s too long, and with not enough practical value, to be included in this write-up).
Learning in SDK
Be sure that you clearly understand how the SDK works and how to manage transaction lifecycles.
Do not reuse Fabric client instances for different users. Reuse clients for the same user so each user has a different connection and their contexts don’t get mixed up.
The SDK should point endorser peers and the master orderer in Raft for the best performance.
Use a 1:1 SDK-endorsing peer mapping, this leads to the best performance. Every endorsing peer should have his own SDK connected (i.e. HF Service).
Layer 2 performance improvements
If you recall from Part 3, one of the ways I suggested to improve the performance at a chaincode level was to build a Layer 2 infrastructure to work in concert with chaincodes in order to increase the underlying transaction throughput of the infrastructure. These layer 2 systems are built using the SDK (in our case we are building them as features of the HFService). So what can we build then to improve our systems transaction throughput?
Fire-and-forget: We can follow an asynchronous approach, fire-and-forget transactions in the SDK. Instead of asynchronously waiting for all the transaction lifecycle to end in order to notify its success, the SDK could send the transaction and respond with a listener so that it doesn’t have to wait for the commitment of the transaction and it can give service to others meanwhile. This complicates a bit the design, and it may not fit every use case, but it minimizes the overhead in the SDK and it lets the infrastructure do its thing without making users wait.
Queues: Building queues in the SDK allows the minimization of MVCC Errors. A strategy we follow, for instance, is queueing transactions for each chaincode in the platform in a different queueu flattening the load for each chaincode. Thus, instead of the SDK pushing transactions to the infrastructure, is the infrastructure (through the SDK) the one that pulls from the queue whenever is ready. This sequentializes all the calls to a chaincode avoiding MVCC conflicts due to parallel read/writes. I faced this same problem in a previous system I built, and a similar scenario was the reason for me to design Goxyq (something we already talked about in this neweletter).
Batches: As mentioned in the previous part, chaincodes can be designed so that they understand batches (or blocks) of transactions. This blocks would be built by the SDK and sent in a single transaction to the chaincode resposible for translating it in individual actions.
Decentralized transaction pool: This is the most disruptive layer 2 system we have designed. Following a similar approach to the one of queues, we built a common shared transaction pool between all the SDKs of the network (all the HFserivces), so that they can pull a transaction from the pool whenever they are available. Thus, the transaction pool flattens the load, balancing it between all the available nodes.
Nedless to say, all these enhancements may be combined at will as they are not mutually exclusive.
Thank your for following this series.
So this is it for now! I really hope you enjoyed this series. This is the first full series of the newsletter so I would love to know your opinion about this format of publications. Throughout this month I shared all my experience and suffering moving Hyperledger Fabric into production: from the impact of the infrastructure, to my ignorance of the SDK. The result of this work are a set of best practices for Fabric deployments at each of its layers that I’ve tried to compile at the end of each section for ease of reference.
Of course, this is a work in progress, and I may have made mistakes in the process, so please, do not hesitate to get in touch for suggestions and feedback. I really hope to see other analysis of Fabric in production as this one from which all the community can benefit in order to make corporate blockchains and DLT networks a reality. For the time being, Fabric 2.0 would require a reevaluation of all the best practices included in this series (specially to the upper layers considering the changes in chaincodes and the SDK). So expect new releases of this series soon. Stay tuned (i.e. subscribe)!
Part IV: The SDK (you are here)