Predicting Venture Capital Investment Success
Have you ever wondered how startups raise money or become big enough to become a publicly traded company? Venture capital firms are often the early investors with large pockets ready to fund what they assume will be a profitable investment. Large Venture Capital (VC) firms often look to return over 3x profit margins when they invest in high growth, market disruptive companies. However, sometimes these firms miss or choose to invest in a startup that fails. This hurts their profit margin and they must balance their losses with a successful investment. The aim of this project is to classify companies as either successful (returns a profit for the VC firm) or unsuccessful (does not meet investment expectations), then train a machine learning model to predict new data.
I collected our data from Crunchbase, one of the worlds leading private market databases. Our query was US companies founded between 2005–2013, that contained data on key features like Operating Status, Acquisition Status & IPO status.

From this search we are able to compile 7,569 companies into our dataset before we started cleaning the data. The data frame had 123 features after setting the index to be the organization name. We started by removing all features that had more than 50% of the data missing and creating a class column with the default value of 0 since we haven’t set success criteria yet. It was interesting to see the distribution of companies based on the state they were founded in throughout the US, with California leading the pack by a lot.
Exploratory Data Analysis:
From there we plotted bar graphs do explore the data a bit further.
To set success criteria, scholarly papers were referenced and we determined our success criteria to be companies that were publicly traded, had more than 7 funding rounds or had a funding round in 2018 or later. The thinking behind these success criteria was that if a company is publicly traded they are large enough and opens up an exit opportunity after the 180 day lockup period. If a company had more than 7 funding rounds or had a funding round past 2018 they were assumed to be continuing growing and increasing their investment at a higher valuation. This also opens up an exit point for VC firms to capitalize on gains.
Following the initial data exploration steps we dug into the VC firms that had the largest amount of successful companies & which industries were the largest spot for VC investment.


As expected, software, SaaS & overall tech were the leading industries of growth as we saw a global transformation in technology through the years of 2005–2013.
Data Pre-Processing:
Next, we dove into developing different ML models to classify our data as a successful or non-successful company. To transform the data into a consistent format for ML models, the data was cleaned using one-hot encoding, tf-idf vectorization (term frequency — inverse document frequency) & converting feature types. Since the data was not a 50/50 split between successful & non-successful companies we used an undersampling technique from the imblearn package in python.
We tried various models on our dataset with the aim of predicting successful companies while not overfitting the data. Model complexity and training time were also factors in this case as we wanted to be able to iterate on our hyperparameter selection. Following model creation for each algorithm we analyzed the model accuracy on both the training & test set as well as graphed the confusion matrix to determine which models would be the best selection for our use case. Outlined below is the description of each of the algorithms we used as well as some of the hyperparameters for each case.
In addition, we used a grid search approach to determine the best test & training split optimizing by the combined test & training set recall score.
Random Forest:
We implemented the random forest algorithm from sklearn to implement the base decision tree learning on our dataset. In addition, we were looking to utilize the feature importance attribute from random forest. This was able to tell us the most important features used to split the data. Using this feature list, the top 5 most important features were:

We used bootstrapping as well as the Out of Bag Score to discover which features had the most importance. Our Out of Bag Score was 0.7482, which was close to our training & test accuracy.
Using the most important features, we were able to create a list of private companies that fulfilled the important success criteria. The aim of this was to create a list of companies ready for late-stage venture capital investment. Using the industry criteria & revenue range a sample of our list is attached here:
‘UiPath’, ‘Automattic’, ‘Interos’, ‘appfire’, ‘Bossa Nova Robotics’, ‘DialogTech’, ‘Nayax’, ‘Performive’, ‘RockYou’, ‘Persivia’
This would be used by a VC firm to select the next winners based on past success.
Neural Net:
Following our Random Forest model, we aimed to implement the MLPClassifier from Sklearn. We chose 100 hidden layers, and 50 maximum iterations, with early stopping set to true. After 11 iterations our model stopped improving and was stopped with a validation score of .728814. We chose to use the default SGD loss function. We knew that neural networks generally outperform less complex models on a large data set. Our dataset could be classified as medium sized in this case, but our best results were from the Neural Net model.
Logistic Regression:
We used our logistic regression model as a baseline comparison for other models’ performance. Since this is a simpler model than others but works very well on our continuous dataset.
Boosting:
We wanted to implement boosting techniques on our dataset to determine if this method would improve results by compiling weak classifiers. We started with the AdaBoost algorithm to try and improve our model’s performance with multiple weak classifiers. For this case we chose to use the default parameters with 100 estimators. Getting similar results to other models we then tried to use gradient boosting. For our gradient boosting algorithm, we chose to use 100 estimators, a learning rate of 1, and a max depth of 1. Both boosting algorithms provided similar results with gradient boosting slightly outperforming AdaBoost in this case. A full table of results can be seen in the results section for each algorithm.
KNN:
KNN is a lazy learner & nonparametric model which is heavily affected by the hyperparameters k & c. We chose to use the 7 nearest neighbors & the default c value. KNN performed better than logistic regression on the training set but not on the test set likely due to less data to learn from.
Ensemble — Soft Voting & Hard Voting:
We combined 7 estimators in this ensemble. The 7 estimators in this case were: Random Forest, Nueral Net, LinearSVC, Logistic Regression, AdaBoost, Gradient Boosting & KNN. We choose to use the default parameters for both Soft & Hard voting. Soft voting overall performed better than hard voting due to the added flexibility.
Results:

Our primary metric for evaluating model success is the recall score or true positive rate. Venture Capital firms expect to take some losses as long as they can correctly select successful companies that are able to far exceed any losses. Therefore, recall matters much more than Precision so that an opportunity, especially a unicorn like Facebook, Uber or UiPath isn’t missed.
From the above table, the Neural Net performs the best on the Test set with very similar scores to the Training. With such a small difference between the two scores, it indicates that this model is not overfitting to the training set and is a good model to use. Neural nets often outperform other models, especially as the number of features & data points increase.
Discussion:
Our model can successfully classify companies based on preset metrics with a high degree of success in accuracy compared to the baseline of other research papers (~7% better). One reason for this difference is that most papers do not have the same focus on US companies like we do, and instead use international data. While there can also be successful startups abroad, comparing companies that don’t face the same regulatory risk, entrepreneurial support, or other outside factors, is not an apples-to-apples comparison and the model will perform worse overall. Most Venture Capitalists will have an area of expertise, and this model could be used to further aid them in surveying US based startups.
Full code is attached here: https://github.com/colaso96/Predicting_Venture_Capital_Investment_Success