Data-Driven Insights

It’s no secret that the startup world lacks transparency.  It is comprised of tiny, young companies, with limited reporting requirements, little public data, and frequent misinterpretation of the small amount of data that does exist.  Collecting accurate information and drawing definitive conclusions about early-stage investments is nearly impossible. Or at least it once was. With the advent of self-reporting and rise in both media and regulatory attention, information on the early-stage market is growing. Yet, data-driven insights are only powerful if coupled with a deep understanding of the data and its limitations.

Data Challenges

Incomplete Data

The first challenge is that early-stage investments are privately placed. This means that unless a startup or an investor brings the funding round into the public domain, it may very well remain under the radar. The Security and Exchange Commission does require that a Form D be filed within 15 days of the first sale of securities within an offering, yet in practice both compliance and enforcement are weak.

Traditional databases – PitchBook, Preqin, and Thomson Reuters – provide incentives for institutional investors (such as accelerators and early-stage venture capital firms) to record their investments, in order to properly benchmark activity against peers. However, the vast majority of early-stage investments are completed by individual investors — who lack that incentive to report, as well as a process by which reporting can actually happen.

Crowdsourced as well as algorithmic databases are attempting to overcome this challenge, but are still in the relatively early stages.

Lack of Standard Terminology (and Methodology)

Terminology is not yet standardized across the early-stage landscape. For instance, in this piece (and across the seedchange institute), early-stage refers to a round of financing after “friends and family” but before Series A. However, other industry players use “early-stage” to refer to all pre-Series A investments while others capture Series A through C investments under the early-stage umbrella.  Yet others equate angel investments to early-stage investments — even though anecdotal evidence (and research by the Center for Venture Research) suggests that angels invest across multiple stages, so the two terms are by no means the same.

Further problematic, a significant portion of data across all databases that we examined is classified as “stage unknown.” Absent a detailed examination of each individual investment, it is impossible to tell if these financings are early-stage, friends and family, down rounds, gap financings, grants, or the like.  Moreover, a portion of convertible notes appears to be mis-categorized under “venture debt.” Yet again, a detailed examination of each round is the only way to confirm accurate classification.

Lastly, there is no universal consensus on when to record a convertible note — at issuance or at conversion. For instance, CrunchBase often records at issuance but MoneyTree, which draws upon Thomson Reuters’ data, records at conversion. Since conversion, if it occurs at all, occurs months to years after issuance, analyses will differ.

Limited History

Adding to the hurdles is the limited history of data. On the one front, data on early-stage investments – in particular non-institutional deals – has only recently become more robust. For instance, CrunchBase, a crowdsourced database, was launched in 2007 and only in the past few years has it become a norm for early-stage companies to create a profile and list funding rounds. On the other front, the market is changing and new players have recently entered the scene. For example, the average accelerator is less than four years old and as such not enough time has passed to assess the outcomes of these programs.

Overview of Industry Databases

Keeping these serious challenges in mind, below is a comparison of five commonly cited databases collecting information on early-stage investments:


Data collection methods, coverage, and definitions vary wildly from one database to the next.  But since analyses are sensitive to the underlying data, it is critical that the caveats are understood in order to avoid misinterpretation.

For instance, CrunchBase and CB Insights have recently captured roughly twice as many early-stage investment deals than traditional databases.  Why?  The reason is likely a rise in self-reporting as well as wider press coverage of early-stage investment rounds.  Additionally, the traditional databases necessarily capture fewer deals and are skewed toward companies that survive to further rounds (survivor bias) as they often only capture early-stage deals if an institutional investor participated in the round or a subsequent financing, which changes the dataset significantly. This is not a knock on traditional databases – instead, it’s a recognition that their mission – to cover venture capital deals – does not align with our goal to analyze early-stage investment deals completed by both institutional and individual investors.

Early-Stage Investment Deals

[su_heading size=”10″ align=”left” class=”.home .page-header “]Notes: CrunchBase includes all deals (regardless of size or year company founded) classified as seed, angel, convertible note, equity crowdfunding, or product crowdfunding; CB Insights, PitchBook, and Preqin include “seed / angel”; Preqin includes deals across North America; Thomson Reuters data via the NVCA 2013 Yearbook and includes seed capital provided by angels, friends and family.[/su_heading]

Using CrunchBase: Our Rationale

Despite the challenges and caveats, insights backed by data are powerful.  Our survey of the data landscape has led us to conclude that CrunchBase is the best underlying dataset for our analyses. CB Insights was a close runner-up, but we didn’t see enough evidence that its data was significantly different from CrunchBase. That said, we look forward to further engaging with data providers as the industry and our analyses evolve.

The primary critique of CrunchBase is that its data is crowdsourced. Using Wikipedia as a successful example, we believe that crowdsourced data has the potential to be just as accurate as data from other sources.  To increase the quality of its data, CrunchBase does not allow profiles to be deleted even after a company has closed, it actively verifies and updates incorrect data, and its Venture Program allows institutions to verify the accuracy of their public data.

[su_pullquote]”We created the CrunchBase Venture Program to ensure that the startup community continues to have an open, up-to-date, and accurate database of companies, investors, and entrepreneurs. Through this initiative, venture funds, angel groups, accelerators, and incubators can guarantee that their public data is accurately represented inside CrunchBase.” – CrunchBase Venture Program[/su_pullquote]

Moreover, CrunchBase has taken upon itself the task of analyzing its data output versus its competitors. This helped tip us over the edge and use CrunchBase to run our own analyses.

CrunchBase Methodology

In order to ensure an accurate interpretation of the CrunchBase dataset, we conducted multiple analyses using various definitions and insights from the industry.  Here is an overview of our methodology as of July 2014.

CrunchBase contains over 70,000 investment rounds — raised by over 40,000 companies across the globe. But since CrunchBase launched in 2007, we only included companies that launched in 2007 or later to avoid survivor bias.

In addition, only U.S.-based companies are included in the dataset, as CrunchBase’s non-U.S. penetration is unknown.  For instance, we have not been able to easily assess whether or not it is the norm in places like China or Brazil for startups to have a CrunchBase profile.

These two restrictions have narrowed the data set to approximately 20,000 investment rounds raised by over 12,000 U.S.-based companies launched in 2007 or later.

In order to conduct analyses, we defined three key terms:

Early-Stage Investment Round

Definition: An investment round greater than (or equal to) $100,000 and less than $2 million that is classified as seed, angel, convertible note, equity crowdfunding, or product crowdfunding.

Rationale: Investment rounds of less than $100,000 are often raised from friends and family and investment rounds greater than (or equal to) $2 million are often equity-based priced rounds and considered “super” early-stage rounds.

Caveat: Roughly 1,300 potential early-stage investment rounds were classified as unknown and thus excluded from our dataset since we could not verify that they were not down rounds, gap financings, grants, etc.

Technology Company

Definition: A company in the technology or hardware sector.

Rationale: To maximize the likelihood of high returns, skilled early-stage investors often look for companies in the technology space, as it lends itself to highly scalable growth, rapid business development and the possibility of quickly rising revenue.

Caveat: CrunchBase has 502 unique industry tags.  Each company selects between one and twenty-one tags to include in its profile. Thus, the accuracy of our data is dependent on our ability to correctly categorize the 502 tags as technology or not technology. There are gray areas; we erred on being inclusive.

Data quality is also dependent on each startup’s accuracy in tagging its industry sector.

We believe our classifications are roughly correct, though not perfect, and look forward to updates from CrunchBase as it attempts to solve this problem.

Traditional Technology Company

Definition: A company in the technology or hardware sector, excluding bio-technology, medical devices, clean-technology, nano-technology, and others that require regulatory approval or significant upfront capital.

Rationale: Technology companies that require regulatory approval or significant upfront capital often have no choice but to take non-traditional funding paths. Their investors are often skilled in these particular sectors and take a different approach to analyzing and funding companies.

Caveat: Again, the industry classifications are considered to be roughly right, but not perfect.

Overall, while we are confident that our analyses are directionally correct and within the ballpark, they should not be interpreted as definitive – at least not yet. As CrunchBase releases additional data tools and our ability to understand the data improves, we hope to increase our confidence and accuracy.

A Glance at the Data

The number of early-stage investment deals has risen over tenfold since 2007. While a portion of this hockey stick growth is likely attributable to a rise in self-reported funding rounds, evidence suggests that early-stage investing is gaining traction.


The average amount raised in early-stage financing rounds has held steady in the $0.6-0.7 million range.


In terms of geography, California has a leadership position, with roughly 40% of all early-stage deals. New York is a distant second and Massachusetts is in third place.


The number of unique investors has grown significantly since 2007. Again, a portion of this growth is likely attributable to a rise in self-reported funding rounds, yet anecdotal evidence points to a rise in the number of accredited investors, accelerators, and new venture capital firms that are getting into the game.


Perhaps shocking to some, in terms of investment count, the top ten early-stage investors accounted for only 9% of traditional technology investments in 2013.


A deep dive into early-stage investments in traditional technology shows that ballpark expectations can be set in regard to financing paths. A close look at the 2009 class, which is almost mature at this point, shows that roughly 60% of early-stage deals went on to at least one follow-on round. Almost 20% of ventures have experienced a confirmed exit, while the remaining are either self-sustaining or have closed their doors. Unfortunately, due to limited data, analyzing exit values and calculating returns is not wise, as the data is biased towards headline deals.


One thing to keep in mind is that “grand slams” highly affect averages. For instance, average amount raised in follow on 5 jumps to $160.7 million due to Uber’s $1.2 billion Series D round in 2014. Excluding this outlier, average amount raised is only $74 million.

The results of this 2009 cohort are roughly in line with an analysis conducted by CB Insights on its 2009 class of technology companies.

Comparing data across mature company vintages, the follow on rate does not swing wildly from one year to the next. As such, ballpark figures can be extracted.


Finally, similar to the funding funnel, the fraction of ventures with a confirmed exit does not swing wildly from year-to-year. It would not be irrational to assume that roughly 20% of early-stage investments experience an exit.

Moreover, since it is unlikely that 20% exit and the other 80% all go to zero, it can be argued that another portion – though it is unclear what percentage – of ventures experience some sort of exit, likely less notable than the 20% already accounted for. These exits don’t make headlines and are not reported in databases as they are likely mediocre at best.

In closing, this insight is not static insight. We’ll continue to dig into the data and report back our findings, on this metric, the others in this piece, and new measures we develop as we continue to make our best efforts to sort through the data out there and find useful insights to share.