Why good data is hard to find

Information Asymmetry and Data

The data market is characterized by a significant information asymmetry between data producers and consumers. Data providers often have comprehensive insights into the origin, accuracy, completeness, and evolution of their data sets, whereas consumers must rely on providers’ representations.

For financial data, in particular, the timestamps of observations and the completeness of datasets, records, and revisions are vital. Timestamps are the foundation of reliable backtests, and completeness is required to guard against various biases in the data and analysis, such as survivorship bias.

Adverse Selection and Market Dynamics

Data that LOOKS VALUABLE can be very cheap to create. One need only slightly modify an existing dataset or backfill data knowing what you want it to say to make a very valuable-looking dataset:

“GPT5.1, Construct a dataset of daily parking lot car counts for physical locations of Goliath Retail, Inc. that is correlated with the price change of the S&P 500 in the following calendar month.”

However, TRULY VALUABLE data is often very expensive to produce, requires highly skilled practitioners, and offers a tremendous edge.

These economics lend themselves to adverse selection, as lower-quality data can flood the market and make it difficult for consumers to identify high-quality data. While years in business, reputation, etc., are potential alternative indicators that consumers can use, their correlation to data quality is quite imperfect. Furthermore, the long and expensive journey for new vendors and datasets toward building credibility and proving their value imposes a severe barrier to new valuable data and prevents consumers from realizing its value sooner. This is especially costly in finance, where the predictive value of new data, or alpha, can be arbitraged away very quickly.

A few example markets illustrate the point:

  • In the financial data market, many vendors claim to maintain point-in-time data but at the same time, claim that many of their competitors’ data isn’t truly point-in-time
  • In the financial data market, data consumers typically require lengthy trials before deploying data in production because only live consumption of data assures them of its completeness and evolution
  • In the data analytics market, insights are often based on proprietary algorithms which are marketed using unverifiable claims

A Vicious Cycle

Aware of their informational disadvantage, data users may be unwilling to pay premium prices for data, knowing they cannot accurately assess its value. Alternatively, they may test every dataset live to ensure its quality. Live testing is very expensive.

The necessity for extensive and expensive diligence depresses transaction value and lengthens the sales cycle, thus disincentivizing the production of high-quality data. Simply put, potential providers of superior data find it unprofitable to compete in a market that treats all with equal suspicion and cannot reward quality.

Sellers, on the other hand, must contend with the difficulty of distinguishing their high-quality offerings in a market skeptical of their claims.

This negatively reinforcing cycle reduces the size of the market as a whole and makes deals more difficult to consummate. For example, a vendor with valuable data may not wish to offer lengthy trials and may be unable to satisfy certain buyers’ needs, or may exit the market altogether.

Mitigating the Market for Lemons in Data

Addressing this problem requires reducing information asymmetries between buyers and sellers and enhancing trust. The traditional ways to build confidence, such as contracts and data trials, are costly and do not scale well.

vBase allows a data producer to incontrovertibly show anyone when data was created, by whom, and how it was revised, as well as share any valuable metadata. By establishing this record for the complete history of a particular dataset, a producer can build strong trust with potential users, increase the value of their data, and eliminate or shorten the long road to credibility via trials and time served.

In the financial realm, in particular, provably point-in-time data eliminates the potential for lookahead bias and makes any signals or backtests calculated from the underlying data provably point-in-time.

What’s more, the proof of integrity and completeness of data, its timestamps, and all revisions exist on the neutral peer-to-peer blockchain infrastructure that does not rely on the trustworthiness, benevolence, or even the existence of vBase or any other company. Even if a producer or consumer does not wish to use vBase, or if vBase ceases to operate, the underlying infrastructure and records will remain available!

Want to learn more?

Contact hello@vbase.com.test for more information