From blockchain to LLM, an in-depth interpretation of the evolution and challenges of data indexing technology

2023-08-08 09:32:49

Since Satoshi Nakamoto decided to embed a message in the genesis block, the data structure of the Bitcoin chain has undergone a series of changes.

I started to study blockchain development in depth in 2022, and the first book I read was "Mastering Ethereum". This book is excellent and provided me with a lot of in-depth understanding of the fundamentals of Ethereum and Blockchain. However, from today's perspective, some of the development techniques in the book have become somewhat outdated. Preliminary steps involve running a node on a personal laptop, even for the wallet dApp, requiring a light node to be downloaded on its own. This reflects the pattern of behavior of early developers and hackers in the blockchain development ecosystem between 2015 and 2018.

Back in 2017, we didn't have any node service providers. From a supply and demand perspective, their main function is to conduct transactions due to limited user activity. This means that maintaining or hosting a full node yourself is not too much of a burden, as there are not many RPC requests to process and transfer requests are infrequent. Most of the early adopters of Ethereum are tech geeks. These early users have a deep understanding of blockchain development and are used to directly maintaining Ethereum nodes, creating transactions, and managing accounts through the command line or integrated development environment.

Therefore, we can observe that early projects usually have a very clean UI/UX. Some of these projects don't even have a frontend, and user activity is fairly low. The characteristics of these projects are mainly determined by two factors: user behavior and the data structure of the chain.

The Rise of Node Providers

As more and more users with no programming background join the blockchain network, the technical architecture of decentralized applications has also changed. The original mode of hosting nodes by users has gradually changed to node hosting by project parties

People tend to choose node hosting services, mainly because the rapid growth of data on the chain makes the cost of personal hosting nodes gradually increase over time.

However, self-hosting nodes by project teams remains a challenge for developers of small projects, requiring ongoing maintenance investment and hardware costs. Therefore, this complex node hosting process is usually entrusted to companies that specialize in node maintenance. It is worth mentioning that the timing of these companies' large-scale construction and fundraising coincides with the rising trend of cloud services in the North American technology industry.

| Project | Category | Established since | | --- | --- | --- | | Alchemy | Nodes | 2017 | | Infura | Nodes | 2016 | | NowNodes | Nodes | 2019 | | QuickNodes | Nodes | 2017 | | Anchor | nodes | 2017 | | ChainStack | Nodes | 2018 |

Simply hosting nodes remotely cannot completely solve the problem, especially now that related protocols such as DeFi and NFT are emerging. Developers need to deal with a lot of data problems, because the data provided by the blockchain nodes themselves is called raw data, which is not standardized and cleaned. The data in it needs to be extracted, cleaned and loaded.

For example, suppose I am a developer of an NFT project and I want to conduct NFT transactions or display NFTs. Then my front end needs to read the NFT data in the personal EOA account in real time. NFT is really just a standardized form of token. Owning an NFT means that I own a token with a unique ID generated by the NFT contract, and the image of the NFT is actually metadata, which may be SVG data or a link to an image on IPFS. Although Ethereum's Geth client provides indexing instructions, for some projects with large front-end requirements, it is impractical to request Geth continuously and then return to the front-end. For some functions, such as order auction and NFT transaction aggregation, they must be carried out off-chain to collect user instructions, and then submit these instructions to the chain at the appropriate time.

Therefore, a simple data layer was born. In order to meet the real-time and accuracy requirements of users, the project party needs to build its own database and data analysis functions.

How did the data indexer evolve?

Starting a project is usually a relatively simple affair. You have an idea, set some goals, find the best engineers, and build a working prototype, which usually includes a front end and a few smart contracts.

However, it is quite difficult to make the project scale. One needs to think deeply about the design structure from day one of the project. Otherwise, you can quickly run into problems, which I usually refer to as "icing problems".

I borrowed this term from the "Iron Man" movie, and it seems to be very suitable to describe the situation of most startups. When startups grow rapidly (attract a lot of users), they often run into trouble because they didn't foresee it in the first place. In the movie, the villain never expected his war gear to fly into space because he didn't account for the "icing problem." Likewise, for developers of many Web3 projects, the "freezing problem" involves dealing with the increased burden of mass user adoption. This puts heavy pressure on the server side as the number of users grows dramatically. There are also issues related to the blockchain itself, such as network issues or node shutdowns.

Most of the time it's a backend issue. For example, in some blockchain game protocols, this situation is not uncommon. When they planned to add more servers and hire more data engineers to parse the data on the chain, they did not foresee so many users participating. By the time they realized this, it was too late. And these technical problems can't be solved just by adding more back-end engineers. As I said before, these considerations should be built into the plan from the start.

The second problem involves adding new blockchains. You probably avoided server-side issues in the first place and hired a bunch of good engineers. However, your users may not be happy with the current blockchain. They want your service to also run on other popular chains like zk chains or L2 chains. Your project structure might end up looking like this:

In this type of system, you have full control over your data, which allows for better management and increased security. The system limits call requests, reducing the risk of overload and increasing efficiency. And the setup is compatible with the front end, ensuring a seamless integration and user experience.

However, operating and maintenance costs multiply, which can put a strain on your resources. Adding a new blockchain each time requires repeated work, which can be time-consuming and inefficient. Selecting data from large datasets can reduce query times, potentially slowing down the process. Due to blockchain network issues such as rollbacks and reorganizations, data can become tainted, compromising data integrity and reliability.

Projects are designed to reflect your team members. Adding more nodes and trying to build a backend-focused system means you need to hire more engineers to operate those nodes and decode the raw data.

This model is similar to the early days of the Internet, when e-commerce platforms and application developers chose to build their own IDC (Internet Data Center) facilities. However, as user requests grow and the state of the blockchain network explodes, cost goes hand in hand with program design complexity. Furthermore, this approach hinders the rapid expansion of the market. Some high-performance public blockchains require hardware-intensive node operations, while data synchronization and cleaning consume human resources and time costs.

If you're trying to build a blockchain-based NFT marketplace or cool game, isn't it surprising that 65% of your team members are backend and data engineers?

**Perhaps developers will wonder why no one decodes and transmits this on-chain data for them so they can focus on building better products. **

**I believe this is why indexers are there. **

In order to reduce the difficulty of accessing Web3 applications and blockchain networks, many development teams including us choose to integrate steps such as archive node maintenance, on-chain data ETL (extract, transform, load) and database calls. These tasks originally required the project team to maintain themselves, but now they have realized integrated operations by providing multi-chain data and node APIs.

With the help of these APIs, users can customize on-chain data according to their needs. This covers everything from popular NFT metadata, monitoring on-chain activity of specific addresses, to tracking transaction data of specific token liquidity pools. I often refer to this approach as part of the structure of modern Web3 projects.

The financing and construction of the data layer and index layer projects will mainly be carried out in 2022. I believe that the business practices of these index layer and data layer projects are closely related to the design of their underlying data architecture, especially to the design of OLAP (On-Line Analytical Processing) systems. Adopting a suitable core engine is the key to optimizing the performance of the index layer, including improving the indexing speed and ensuring its stability. Commonly used engines include Hive, Spark SQL, Presto, Kylin, Impala, Druid, ClickHouse, etc. Among them, ClickHouse is a powerful database that is widely used in Internet companies. It was open-sourced in 2016 and received a financing of 250 million US dollars in 2021.

Therefore, the emergence of a new generation of databases and improved data index optimization architectures has led to the creation of the Web3 data index layer. This enables companies in this field to provide data API services in a faster and more efficient manner.

However, the building of on-chain data indexing is still shrouded in two dark clouds.

Two Dark Clouds

The first dark cloud is about the impact of blockchain network stability on the server side. Although the blockchain network has strong stability, it is not the case during data transmission and processing. For example, events such as reorganizations (reorgs) and rollbacks (rollbacks) of the blockchain may pose challenges to the data stability of the indexer.

A blockchain reorganization is when nodes temporarily lose synchronization, causing two different versions of the blockchain to exist at the same time. Such situations can be triggered by system failures, network delays, or even malicious behavior. When nodes resync, they will converge to a single official chain, and the previous alternative "forked" blocks will be discarded.

By the time the reorganization occurs, the indexer may have processed data from blocks that were eventually discarded, polluting the database. Therefore, indexers must adapt to this situation, discarding data on invalid chains and reprocessing data on newly accepted chains.

Such adjustments may result in increased resource usage and potentially delay the availability of data. In extreme cases, frequent or large-scale block reorganizations can severely impact the reliability and performance of services that depend on indexers, including those Web3 applications that use APIs to fetch data.

Additionally, we are faced with issues regarding data format compatibility and diversity of data standards across blockchain networks.

In the field of blockchain technology, there are many different networks, each with its own unique data standards. For example, there are EVM (Ethereum Virtual Machine) compatible chains, non-EVM chains, and zk (zero-knowledge) chains, each of which has its own special data structure and format.

This is undoubtedly a big challenge for indexers. In order to provide useful and accurate data through APIs, indexers need to be able to handle these diverse data formats. However, since there is no universal standard for blockchain data, different indexers may use different API standards. This can lead to data compatibility issues, where data extracted and transformed from one indexer may not be usable in another system.

Additionally, as developers explore in this multi-chain world, they often face the challenge of dealing with these different data standards. A solution that works for one blockchain network may not work for another, making it difficult to develop applications that can interact with multiple networks.

Indeed, the challenges facing the blockchain indexing industry are reminiscent of two unsolved problems in physics identified by Lord Kelvin in the early 20th century, which eventually gave birth to revolutionary fields such as quantum mechanics and thermodynamics.

Faced with these challenges, the industry has indeed taken some steps, such as introducing latency or integrating streaming in the Kafka pipeline, and even establishing a standards consortium to strengthen the blockchain indexing industry. These measures are currently able to address the instability of blockchain networks and the diversity of data standards, so that indexers can provide accurate and reliable data.

However, just as the advent of quantum theory revolutionized our understanding of the physical world, we can also consider more radical ways to improve blockchain data infrastructure.

**After all, the existing infrastructure, with its neatly organized data warehouses and stacks, may seem too perfect and too beautiful to be true. **

So, **Is there any other way? **

Finding patterns

Let's go back to the original topic about the emergence of node providers and indexers, and consider a peculiar problem. Why did node operators not appear in 2010, but indexers suddenly appeared in large numbers and received a lot of investment in 2022?

I believe my above has partially answered these questions. This is because of the widespread use of cloud computing and data warehousing technologies in the software industry, not just in the field of encryption.

In the world of encryption, something special happened too, especially when the ERC20 and ERC721 standards became popular in the public media. Additionally, the DeFi summer has made on-chain data more complicated. Various call transactions are routed on different smart contracts, instead of simple transaction data as in the early stage, the format and complexity of data on the chain have undergone surprising changes and growth.

Although in the cryptocurrency community, it has always been emphasized to separate from traditional Web2 technology, but what we cannot ignore is that the development of cryptocurrency infrastructure relies on continuous development and breakthroughs in the fields of mathematics, cryptography, cloud technology, and big data. . Similar to the traditional Chinese mortise and tenon structure, the various components in the cryptocurrency ecosystem are closely connected.

The progress and innovative application of science and technology will always be bound by some objective principles. For example, without the basic support of elliptic curve encryption technology, our cryptocurrency ecosystem today cannot exist. Likewise, the practical application of zero-knowledge proofs would not have been possible without the important research paper on zero-knowledge proofs published by MIT in 1985. So we see an interesting pattern. ** The wide application and expansion of node service providers is based on the rapid growth of global cloud services and virtualization technology. ** At the same time, The development of the data layer on the chain is based on the vigorous development of excellent open source database architecture and services,** these architectures are the data solutions that many business intelligence products rely on in recent years **. These are all technical prerequisites that startups must meet in order to achieve commercial viability. When it comes to Web3 projects, those that employ advanced infrastructure tend to have an advantage over those that rely on outdated architectures. The erosion of OpenSea's market share by faster and more user-friendly NFT exchanges is a vivid example.

In addition, we can also see an obvious trend: artificial intelligence (AI) and LLM technology has gradually matured and has the possibility of wide application.

**Therefore, an important question emerges: How will AI change the pattern of data on the chain? **

Fortune-telling

Predicting the future is always fraught with difficulties, but we can explore possible answers by understanding the problems encountered in blockchain development. **Developers have a clear demand for on-chain data: what they need is accurate, timely, and easy-to-understand on-chain data. **

One of the problems we are currently facing is that complex SQL queries are required to obtain or display certain data in batches. This is why the open source SQL functionality provided by Dune is so popular in the crypto community. Users do not need to write sql to build charts from scratch, they only need to fork and modify the address of the smart contract they want to pay attention to, and then they can create the charts they need. However, this is still too complicated for the average user who only wishes to view liquidity or airdrop data under certain conditions.

In my opinion, the first step in solving this problem is to utilize LLM and natural language processing.

We can build a more user-centric "data query" interface and leverage LLM techniques. In existing cases, users must use complex query languages such as SQL or GraphQL to extract corresponding on-chain data from API or Studios. However, by using LLM, we can introduce a more intuitive and human-like way of asking questions. In this way, users can express their questions in "natural language", and LLM will translate them into suitable queries and provide users with the answers they need.

From the perspective of developers, artificial intelligence can also optimize the analysis of contract events on the chain and ABI decoding. Currently, the details of many DeFi contracts require developers to manually parse and decode them. However, if artificial intelligence is introduced, we can significantly improve various contract disassembly techniques and quickly retrieve the corresponding ABI. Combined with a large language model (LLM), this configuration can intelligently parse function signatures and efficiently handle various data types. Furthermore, when the system is combined with the "stream computing" processing framework, it can process the analysis of transaction data in real time to meet the immediate needs of users.

From a more global perspective, the goal of an indexer is to provide users with accurate data. As I've highlighted before, a potential problem with the on-chain data layer is that individual pieces of data are scattered across different indexer databases and isolated from each other. In order to meet the diverse data needs, some designers choose to integrate all the data on the chain into a database, so that users can select the required information from a single data set. Some protocols choose to include only some data, such as DeFi data and NFT data. But the problem of incompatible data standards still exists. Sometimes, developers need to fetch data from multiple sources and reformat it in their own database, which undoubtedly increases their maintenance burden. Additionally, they cannot migrate to another provider in a timely manner should there be a problem with one data provider.

So, how can LLM and AI solve this problem? LlamaIndex provided me with a revelation. What if developers don't need an indexer, but use a deployed proxy service (agent) to directly read the raw data on the chain? This agent combines the technologies of indexer and LLM. From the user's point of view, they don't need to know anything about the API or query language, they just need to ask questions and get instant feedback.

Equipped with LLM and artificial intelligence technology, Agent understands and processes raw data and converts it into a format that is easy for users to understand. This eliminates the need for users to face complex APIs or query languages, and they can simply ask questions in natural language and get real-time feedback. This feature increases the accessibility and user-friendliness of data, attracting a wider user base to access on-chain data.

In addition, the way of agent service (Agent) solves the problem of data standard incompatibility. Since it has been designed with the ability to parse and process raw on-chain data, it can adapt to different data formats and standards. As a result, developers do not need to reformat data from different sources, reducing their workload.

Of course, this is just a speculation about the future development trajectory of on-chain data. But in technology, it's often these bold ideas and theories that drive revolutionary progress. We should remember that whether it is the invention of the wheel or the birth of the blockchain, all major breakthroughs start from someone's assumption or "crazy" idea.

As we embrace change and uncertainty, we are also challenged to continually push the boundaries of possibility. Against this backdrop, we envision a world where the combination of AI, LLM, and blockchain will breed a more open and inclusive technological field.

Chainbase upholds this vision and is committed to making it a reality.

Our mission at Chainbase is to create an open, friendly, transparent and sustainable encrypted data infrastructure. Our goal is to simplify the use of this data by developers, eliminating the need for complex refactoring of the backend technology stack. In this way, we hope to usher in a future in which technology not only serves users, but empowers them.

However, I must clarify that this is not our roadmap. Rather, this is my personal reflection on the recent development and progress of on-chain data in the community as a developer relations representative.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

2 Likes