4D Shoal Framework Explained: How to Reduce Bullshark Latency on Aptos?

Question

Original: Aptos LabsCompilation: Aptos Global![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-a5fdeca023-dd1a6f-62a40f)### Overview1) Aptos labs has solved two important open problems in DAG BFT, greatly reducing latency, and for the first time eliminating the need for pauses in deterministic practical protocols. Overall, this improves Bullshark latency by 40% in the no-failure case and 80% in the fault case.2) Shoal is a framework that enhances any Narwhal-based consensus protocol (eg, DAG-Rider, Tusk, Bullshark) through pipelining and leader reputation. Pipelining reduces DAG ordering latency by introducing an anchor per round, and leader reputation further improves latency by ensuring that anchors are associated with the fastest validators. Additionally, leader reputation enables Shoal to leverage asynchronous DAG construction to eliminate timeouts in all scenarios. This allows Shoal to provide a property we named Universal Response, which contains the Optimistic Response that is usually required.3) Our technique is very simple, it involves running multiple instances of the underlying protocol sequentially one after the other. So, when instantiated with Bullshark, we get a group of "sharks" that are doing a relay race.### MotivationIn the pursuit of high performance in blockchain networks, there has been a constant focus on reducing communication complexity. However, this approach did not result in a significant increase in throughput. For example, the Hotstuff implementation in an early version of Diem only achieved 3500 TPS, far below our goal of achieving 100k+ TPS.However, recent breakthroughs stem from the realization that data propagation is the main bottleneck of leader-based protocols, and that it can benefit from parallelization. The Narwhal system separates data propagation from the core consensus logic and proposes an architecture where all validators propagate data simultaneously, while consensus components only order a smaller amount of metadata. The Narwhal paper reports a throughput of 160,000 TPS.In a previous article, we introduced Quorum Store. Our Narwhal implementation decouples data propagation from consensus, and how we use this to extend our current consensus protocol, Jolteon. Jolteon is a leader-based protocol that combines Tendermint's linear fast path with PBFT-style view changes to reduce Hotstuff latency by 33%. However, it is clear that a leader-based consensus protocol cannot fully exploit the throughput potential of Narwhal. Despite separating data propagation from consensus, Hotstuff/Jolteon leaders are still limited as throughput increases.Therefore, we decided to deploy Bullshark, a consensus protocol with zero communication overhead, on top of Narwhal DAG. Unfortunately, the DAG structure that supports Bullshark's high throughput comes with a 50% latency penalty compared to Jolteon.In this post, we describe how Shoal achieved a dramatic reduction in Bullshark latency.#### DAG-BFT backgroundLet's start by understanding the relevant background of this article. For a detailed description of Narwhal and Bullshark, please refer to the DAG meets BFT post.Each vertex in a Narwhal DAG is associated with a round number. In order to enter round r, a verifier must first obtain nf vertices belonging to round r-1. Each verifier can broadcast one vertex per round, and each vertex refers to at least nf vertices in the previous round. Due to network asynchrony, different validators may observe different local views of the DAG at any point in time. Here is an illustration of a possible partial view:![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-931319a954-dd1a6f-62a40f)Figure 1: The causal history of vertices identified by round 2 validation node 2 are highlighted in greenA key property of DAGs is not ambiguity: two validators have exactly the same causal history of v if they have the same vertex v in their local view of the DAG.### Total OrderIt is possible to agree on the total order of all vertices in a DAG without additional communication overhead. To this end, validators in DAG-Rider, Tusk, and Bullshark interpret the structure of the DAG as a consensus protocol, where vertices represent proposals and edges represent votes.Although the group intersection logic differs on the DAG structure, all existing Narwhal-based consensus protocols have the following structure:1) Predetermined anchors, there will be a pre-determined leader every few rounds (for example, two rounds in Bullshark), and the vertices of the leader are called anchors;2) Ordering anchors, the verifier independently but deterministically decides which anchors to order and which anchors to skip;3) Order causal histories, where validators process their ordered list of anchors one by one, and for each anchor, sort all previously unordered vertices in its causal history by some deterministic rules.![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-51897602ab-dd1a6f-62a40f)Figure 2: Illustration of a possible partial view of a DAG in the Bullshark protocol. In this example, red and yellow anchors are sorted, while green (not in the DAG) is skipped. Therefore, to order the DAG, validators deterministically order the causal histories of the red anchors first, followed by the yellow anchors.The key to satisfying security is to ensure that in step (2) above, all honest validators create an ordered list of anchors such that all lists share the same prefix. At Shoal, we make the following observations for all of the above protocols:** All validators agree on the first ordered anchor. **### Bullshark LatencyBullshark's latency depends on the number of rounds between ordered anchors in the DAG. While the most useful partially synchronous version of Bullshark has better latency than the asynchronous version, it is far from optimal.Problem 1: Average block latency. In Bullshark, every even round has an anchor, and every odd round's vertices are interpreted as votes. Typically, it takes two rounds of the DAG to order anchors, however, vertices in an anchor's causal history take many more rounds to wait for an anchor to be ordered. In the common case, vertices in odd rounds require three rounds, while non-anchor vertices in even rounds require four rounds (see Figure 3).![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-94f64a51cc-dd1a6f-62a40f)Figure 3: In the common case, anchors in round i+1 need to be sorted for two rounds. Vertices in round i are sorted simultaneously, so their latency is three rounds. However, vertices in round i+1 have to wait for the next anchor to be sorted (the one in round i+3), so their sorting delay is four rounds.Problem 2: Failure case latency, the above latency analysis applies to the no-failure case, on the other hand, if the leader of a round fails to broadcast anchors fast enough, anchors cannot be ordered (and thus are skipped), so All unsorted vertices in previous rounds must wait for the next anchor to be sorted. This can significantly reduce the performance of a geo-replicated network, especially since Bullshark uses timeouts waiting for the leader.### Shoal FrameworkShoal solves both of these latency issues by enhancing Bullshark (or any other Narwhal-based BFT protocol) by pipelining, allowing for an anchor in each round, and reducing the latency of all non-anchor vertices in the DAG to three rounds. Shoal also introduces a zero-overhead leader reputation mechanism in the DAG, which biases selection in favor of fast leaders.### challengeIn the context of DAG protocols, pipelining and leader reputation are considered difficult problems for the following reasons:1) Previous pipelining attempts to modify the core Bullshark logic, but this seems inherently impossible2) Leader Reputation, introduced in DiemBFT and formalized in Carousel, is the idea of dynamically selecting future leaders (anchors in Bullshark) based on past performance of validators. While disagreement over leader identity does not violate security in these protocols, in Bullshark it can lead to a completely different ordering, which gets to the heart of the question, which is that dynamic and deterministic selection of wheel anchors is the key to solving consensus required, while validators need to agree on an ordered history to select future anchors.As evidence of the difficulty of the problem, we note that Bullshark's implementation, including the current implementation in production, does not support these features.### protocolDespite the aforementioned challenges, it turns out that the solutions, as the saying goes, lie behind simplicity.In Shoal, we rely on the ability to perform local computations on the DAG and implement the ability to preserve and reinterpret information from previous rounds. With the core insight that all validators agree on the first ordered anchor, Shoal pipelines them by composing multiple Bullshark instances in sequence such that (1) the first ordered anchor is the switching point for the instances, and (2) The causal history of the anchors is used to calculate the reputation of the leader.### PipeliningSimilar to Bullshark, validators agree a priori on potential anchors, i.e., there is a known mapping F: R -> V mapping rounds to leaders. Shoal runs instances of Bullshark one after the other such that for each instance the anchor is predetermined by the map F. Each instance orders an anchor, which triggers a switch to the next instance.Initially, Shoal launches the first instance of Bullshark in the first round of the DAG and runs it until the first ordered anchor is identified, say at round r. All validators agree on this anchor. Therefore, all validators can deterministically agree to reinterpret the DAG starting from round r+1. Shoal just starts a new instance of Bullshark at round r+1.In the best case scenario, this allows Shoal to order an anchor each round. The anchors for the first round are sorted by the first instance. Then, Shoal starts a new instance in the second round, which itself has an anchor ordered by the instance, then another new instance orders the anchors in the third round, and the process continues. Please see the illustration below:![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-815c54edb-dd1a6f-62a40f)Figure 4: The vertices corresponding to the leaders determined by F are marked with a crown. The first instance of Bullshark first interprets the DAG with the anchors in rounds 1, 3, and 5. Bullshark determines the anchors in round 1 (in green checkmark mark) is the first to be sorted in the first instance. (Note that in the general case, this anchor can be skipped, while some other anchors will be sorted first.) Then, the rest of the first instance is ignored, and a new instance of Bullshark starts from round 2, Anchor points are marked in rounds 2 and 4.### Leader reputationIncreased latency when skipping anchors during Bullshark sorting. In this case, the Pipelining technique is powerless because a new instance cannot be started until the previous instance has ordered an anchor. Shoal ensures that the appropriate leader is less likely to be elected in the future to deal with a lost anchor by using a reputation mechanism to assign each validator a score based on its history of recent activity. A validator who responds and participates in the protocol will be given a high score, otherwise, a validator will be assigned a low score because it may crash, be slow, or be malicious.The idea is to deterministically recompute the predefined mapping F from rounds to leaders at each score update, favoring leaders with higher scores. In order for validators to agree on the new mapping, they should agree on the score and thus the history used to derive the score.In Shoal, pipelining and leadership reputation can be combined naturally because they both use the same core technique of reinterpreting a DAG after agreeing on a first ordered anchor.In fact, the only difference is that, after sorting the anchors in round r, the validator only needs to compute a new mapping F' from round r+1 based on the causal history of the ordered anchors in round r . Then, the validator executes a new instance of Bullshark starting from round r+1 using the updated anchor selection function F'. See the image below:![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-c1d8088bd3-dd1a6f-62a40f)Figure 5. Vertices corresponding to leaders identified by F are marked with transparent crowns. The first instance of Bullshark orders an anchor in round 1, marked with a green checkmark, and then computes a new mapping F' based on the information in the anchor's causal history. Leaders determined by F' are marked by colored crowns.### No more timeoutsTimeouts play a crucial role in all leader-based deterministic partially synchronous BFT implementations. However, the complexity they introduce increases the amount of internal state that needs to be managed and observed, which complicates the debugging process and requires more observability techniques.Timeouts can also add significantly to latency, as it's important to configure them appropriately, and usually need to be adjusted dynamically since it's highly dependent on the environment (network). The protocol pays the faulty leader the full timeout delay penalty before moving to the next leader. Therefore, the timeout setting cannot be too conservative, but if the timeout is too short, the protocol may skip good leaders. For example, we observed that under high load the leaders in Jolteon/Hotstuff were overwhelmed and timeouts expired before they could push progress.Unfortunately, leader-based protocols such as Hotstuff and Jolteon inherently require timeouts to ensure that the protocol can make progress every time the leader fails. Without a timeout, even a crashed leader could halt the protocol forever. Due to the inability to distinguish between a faulty leader and a slow leader during async, timeouts can cause validators to see changes to all leaders without consensus liveness.In Bullshark, timeouts are used in DAG construction to ensure that during sync an honest leader adds anchors to the DAG fast enough for them to be ordered.We observe that the DAG construction provides a "clock" for estimating the speed of the network. In the absence of pauses, rounds continue to progress as long as nf honest validators continue to add vertices to the DAG. While Bullshark may not be able to sort at network speed (due to problematic leaders), the DAG still grows at network speed despite some leader problems or the network being asynchronous. Eventually, when a fault-free leader broadcasts anchors fast enough, the entire causal history of anchors will be ordered.In our evaluation, we compared Bullshark with and without timeout in the following cases:1) Fast leader, meaning at least faster than other validators. In this case, both methods provide the same latency because anchors are ordered and timeouts are not used.2) False leader, in this case Bullshark without pauses provides better latency as validators will immediately skip its anchors, while validators with pauses will wait for their arrival before continuing Expect.3) Slow leader, this is the only case where Bullshark has better timeout performance. This is because, without a pause, the anchor may be skipped because the leader cannot broadcast it fast enough, whereas with a pause, validators will wait for the anchor.At Shoal, avoiding timeouts and leadership reputation go hand in hand. Repeatedly waiting for a slow leader increases latency, and the leader reputation mechanism precludes slow validators from being elected as leaders. In this way, the system utilizes fast validating nodes to run at network speed in all real-world scenarios.Note that the FLP impossibility result shows that no deterministic consensus protocol can avoid timeouts. Shoal cannot circumvent this outcome because there exists a theoretical schedule of adversarial events that would prevent all anchors from being commanded. Instead, Shoal falls back to a timeout after a configurable number of consecutive anchor skips. In practice, this is extremely unlikely to happen.### General ResponseThe Hotstuff paper popularized the concept of optimistic response, which, while not formally defined, intuitively means that the protocol can run at network speed under good conditions including a fast leader and a synchronous network.Shoal provides an even better property, which we call Universal Response. Specifically, in contrast to Hotstuff, Shoal continues to run at network speed even if the leader fails for a configurable number of consecutive rounds or asynchronous cycles through which the network goes through. See a more detailed comparison in the table below.![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-b87c1b94cb-dd1a6f-62a40f)Note that universal responsiveness provides strictly better progress guarantees during asynchronous periods and when the leader fails. During synchronization with a slow leader, these properties are not comparable because it depends on how slow the leader is. However, given the leader's reputation, slow-moving leaders should rarely appear in Shoal.### EvaluateWe implemented Bullshark and Shoal on top of our Narwhal implementation Quorum Store. A detailed comparison between Shoal, Bullshark, and Jolteon can be found in the evaluation section of the Shoal paper, where we provide some highlights.First, to demonstrate the power of asynchronous DAG construction, we compare Bullshark with and without pauses. The full Bullshark paper assumes an asynchronous network, but provides a fast-path mode, thus requiring pauses during all rounds. We call this version the Vanilla Bullshark. We observe that for independent partially synchronous network hypothetical versions, no pauses in voting rounds are required. We call this version Vanilla Bullshark w/o Vote timeout without voting timeout, Baseline Bullshark is the version without any timeout.The graph below compares the impact of timeouts on Bullshark latency with and without failures. Apparently, Baseline Bullshark (no timeout) performs best in the event of a failure. With no failures, Baseline Bullshark is comparable to Vanilla Bullshark without voting suspension. This is because, as mentioned earlier, without a leader reputation mechanism, timeouts may have an advantage in situations where the leader is good but slow.![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-d49abda989-dd1a6f-62a40f)Figure 6.: Impact of timeouts on Bullshark latency without failures (left) and with failures (right), with 50 validators in the failure caseNext, we instantiate Shoal using Baseline Bullshark (no timeout) and demonstrate the latency improvements of pipelining and the leader reputation mechanism with and without failure, and for completeness we also compare it with Jolteon with and without failure compared.Figure 7 below shows the no-failure case, and while both pipelining and leader reputation can reduce latency individually, combining them results in the best latency.As for Jolteon, it cannot scale to more than 20 validators, and even if it runs on Quorum Store, it can only achieve about half of the throughput of Bullshark/Shoal, which removes the data propagation bottleneck.The results show that Shoal greatly improves Bullshark latency. As for Jolteon, it is important to note that although we only measured consensus latency. Since Jolteon cannot run natively on top of a DAG, it requires additional latency to decouple data propagation, which we did not measure. Therefore, under high load, Shoal should match Jolteon's end-to-end latency.![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-2ccc6874f4-dd1a6f-62a40f)Figure 7: Throughput and latency without failure, Shoal PL and Shaol LR only support pipelining and leader reputation, respectively.Figure 8 below shows a failure case with 50 validators, where the leader reputation mechanism significantly improves latency by reducing the probability of a failed validator being elected as a leader. Note that with 16 failures out of 50, Shoal's latency was 65% lower than Baseline Bullshark.![Explanation of the Shoal framework in detail: How to reduce Bullshark latency on Aptos? ](https://img.gateio.im/social/moments-69a80767fe-729e379b05-dd1a6f-62a40f)