What are the benefits of Data Availability Sampling technology? Why do we need it?

Photo by @testalizeme?utm_source=medium&utm_medium=referral">Testalize.me on Unsplash
Prerequisite knowledge:
The following will beData Publicationto callData Availability, but some words related to Data Availability, such as DAS and DAC, will retain the original words to prevent readers from being unable to connect to the original English text. For an introduction to the name Data Publication, please refer to:
This series of articles will introduce the operating mechanism of DAS through Danksharding, as well as the similarities and differences between Celestia, EigenDA, and AvailDA. The first article will introduce why we need DAS and the benefits that DAS brings.
Currently, the most common situation in which data release is discussed in the Ethereum ecosystem is when discussing the design of L2: L2 itself is also a chain, so it will also have its own blocks and transaction data, so where should these data be placed? Because users need this information to ensure security.
For more information about the relationship between Rollup and data release, please refer to:
But in fact, it is not just L2, the Ethereum chain itself will also need to face the problem of data release, because light nodes will not download complete block data like full nodes, so light nodes need to believe that “when a new block appears At that time, the complete data of the block was actually released.” When a light node is tricked into believing an “incomplete” block, the impact is the same as when it is tricked into believing an “illegal” block - it is connected to a link that is not used by others. Recognized forked chains.

Full nodes will not believe incomplete blocks, but light nodes will
So who does the current Ethereum light node trust to ensure that the data of a new block is completely released? The answer is “Validators”. When a light node receives a new block, it will not download the full block data, but it will see how many validators voted for the block. When enough validators vote for this block, it will believe that the complete information of this block has indeed been published. This is an assumption belonging to Honest Majority, which is the belief that most validators are good people.

When enough validators sign the block, the light node will believe that the block has been fully released.
Note: At present, Ethereum light nodes do not actually collect the votes of hundreds of thousands of validators for calculation, because the number is too large and the calculation is too resource-consuming. Instead, another batch of Sync Committee with a much smaller number is assigned, consisting of The validator signature inside serves as a guarantee for the light node. This is considered a transitional approach, waiting for a more complete and mature design to replace it in the future.
“Believing that most validators are good people” sounds like a reasonable and good choice, but what if we can do better? What if one day it really happens that most validators want to join forces to deceive us, saying that they have complete block data but in fact they don’t, but as long as there are a few good people in the p2p network, we can avoid being deceived?
Such a great capability will definitely not appear out of thin air. To have this capability, you need to build a stable enough p2p network, have enough users, and even add privacy functions to the network layer. This capability is the focus of this series of articles - Data Availability Sampling (DAS).
In a blockchain with DAS, light nodes will not just passively receive new block data, but will jointly participate in the operation of DAS: each light node must go to the p2p network for each block. On the way, it searches for several pieces of data on the block and saves the data, and shares it with other nodes when they ask for data. Just like the decentralized file sharing protocol BitTorrent, nodes in the network save and share the data they care about together instead of relying on a centralized server.

Light nodes work together to retrieve and share block fragment data through the p2p network
Note: In an ideal scenario, blockchain users would all run light nodes, participate in the operation of DAS and ensure security together, instead of completely trusting other nodes like now.
Each light node cannot believe that the block data has been completely released until it successfully obtains every piece of data it requests. However, once every piece of data it requests is obtained, it can safely believe that the block data has been released. The information has been released in full. But why can light nodes safely believe that the complete block data has been released even though they only have fragmented data? This is because other light nodes in the network will also store fragment data separately, so when necessary, everyone will be able to work together to piece together complete data from each other’s fragment data, and another magical ability of DAS is: The data saved by the light nodes does not need to cover the complete block data. The light nodes can restore 100% of the data as long as they get 50% of the data.

As long as light nodes have more than 50% of the fragment data, they can restore the complete block data.
Note: It is not necessarily 50%, it may be higher, depending on the setting of the DAS, but it will definitely be less than 100%.
Therefore, in DAS, even if most validators vote for a certain block, light nodes will not easily believe the block. The light node will perform sampling (Sampling), that is, request the fragment data of the block. When all the requested fragments are obtained, it will be believed that the block data has been completely released.
Compared with Ethereum, which currently does not have DAS, light nodes must rely on “most validators are good people”, which is the assumption of Honest Majority; after Ethereum joins DAS, light nodes rely on “a small number of nodes who are (will sample and “A good person who keeps information” is the assumption of Honest Minority.
Note: “Few” refers to the fact that the number of light nodes that need to be sampled and stored to restore complete data is relatively small compared to all (very many) light nodes in the network.
As mentioned earlier, when a light node receives all the fragment data it requested, it will believe that the block data has been fully released, and the light nodes in the network can work together to borrow it when needed. The complete data can be restored from the fragmented data, but… what if the person who produced the block is malicious, and it did not release enough block data in the first place? What if it is targeted at a certain light node and no longer provides any data after providing the fragment data required by that light node?

Alice successfully obtained the requested information, so she believes that the block information has been completely released, and she will eventually pick up the block that was discarded by other nodes.
Then this poor light node will have to be deceived, and this is also the limitation and trade-off of DAS: the guarantee of “complete release of data” provided by DAS is a guarantee of probability, not a 100% and categorical guarantee, but This is still better than having to completely trust the majority of validators. If you feel that the probability guarantee is not safe enough and want to be 100% sure whether the block data has been completely released, then I’m sorry, you can only run a full node yourself to download the complete block data.
Note: The program running the light node has no way of knowing that it has been cheated. For the program, as long as it receives all the requested data, it will believe that the block has been fully published. Users can only find out through their own social channels that a certain block is actually incompletely released, and quickly instruct their light nodes to mark the block as incomplete data.
If the person who produced the block is malicious and wants to deceive some light nodes, then unfortunately, it will be able to deceive those light nodes. However, there will be an upper limit on the number of deceivers. After all, the attacker cannot release too many blocks so that the light nodes can really work together to restore the complete data.
At this time, you may be worried, can your security only be based on “no enmity with the person who produced the block” or “the attacker does not know who I am, so he will not target me”? Yes, but this is why it was mentioned earlier that DAS needs the privacy function of the network layer, because if the attacker can know “who” is requesting the fragment data, or knows that the three fragments A, B, and C are “the same “Individual” is seeking, then it can naturally easily target the target victim and provide information to the other party. If today’s network layer has a privacy function, then the attacker will have no way to know who is requesting this piece of information. Naturally, there will be no way to deceive the target, and the attacker’s attack efficiency will be greatly reduced: it has no way to Method to determine whether the victim has been deceived, or even any light node.

If the attacker does not know who is requesting information, it will be very difficult to deceive light nodes.
For DAS to be secure, it requires:
What would be the problem if the blocks are not encoded through erasure coding, but let light nodes directly sample the original block content? The answer is: the fragment data sampled by light nodes must reach 100% coverage to ensure the integrity of the data. Even if the light nodes work together to sample 99% of the block data, this block is still incomplete and not recognized.

Without erasure coding, as long as a little bit of the block is missing, it means that the block has not been fully released.
If the block is encoded by erasure coding, then 100% of the data can be restored with, for example, any 50% of the data. This means that the light nodes can ensure that they can restore the data as long as the fragment data sampled together reaches 50% coverage. Complete block information. Compared with 100% coverage, the requirement for 50% coverage is much simpler. It will be much more difficult for malicious block producers to hide some information to deceive light nodes.
Note: 50% is just an example. Different needs will have different % required for restoration.
If the light node does not save enough fragment data, even if the block data is encoded through erasure coding, 100% of the data cannot be restored. For example, if the light nodes only store 40% of the fragment data in total, and they cannot restore 100% of the data together, then these nodes will all be fooled and believe that the block data has been fully released.
How to ensure that light nodes save enough data? We need to have enough light nodes or enough samples for each light node. If there are enough light nodes, the number of sampling times for each light node does not need to be high; but if there are not enough light nodes, then the number of sampling times for each light node must be high enough to ensure that the light nodes work together to save enough data. .
Note: If the number of light nodes continues to grow, the data size they can store together can actually increase if the number of samples remains unchanged. For example, Celestia, which will be introduced in this series of articles, can support flexible block sizes. Blockchain: Celestia’s block size can be adjusted according to the number of light nodes in the network.
Light nodes need to share fragmented data through the p2p network so that complete block data can be restored when necessary. If the p2p network is unstable and cannot handle a large number of data requests, nodes may not be able to obtain certain fragments of data. In addition, it is also necessary to prevent all fragments of data from circulating in the same p2p network, resulting in network bandwidth overload. Ideally, a light node should be able to only receive the data it requests, rather than all other irrelevant data. Flow through its hands through p2p networks.
In addition, the network layer also needs to have privacy functions, otherwise the light node will be identified by the attacker. The attacker will not publish more than 50% of the data, but it will provide the fragmented data requested by the light node it locks. To mislead the light node into believing that the block data has been completely released.
The details of these three parts will be introduced in more detail in this series of articles.





