We were approached by a nationally ranked venture capital firm and accelerator that brings together startup founders, investors, corporations, job seekers, universities, musicians, and artists. The platform includes more than 75 programs spanning startup accelerators, corporate programming, speaker series, conferences, skills accelerators, and fellowships. They believe that everyone deserves opportunities regardless of race, place, or gender.
Many of their programs are focused on geographical areas, largely outside of major venture capital hubs. Program X in particular works with very early-stage companies in various metro areas ranging from Birmingham to Cheyenne. Not every metro area is a good fit for the program, but programs have been successful in many different locations.
What do these locations have in common? If you were to pick a city to open a new accelerator in, what kinds of cities would you look at?
These questions obviously aren’t unique to startup accelerators - they apply to any business that knows they’ve been successful in a number of locations but want an idea of where to expand.
There were a number of processing and cleaning steps that went into the data prior to modeling and analysis, such as compiling and categorizing lists of the companies' operations by program type, and merging a large census dataset. The dataset consisted of more than 70 variables that detailed demographics such as education level, income, housing statistics, and types of employment. With the data in place, we began by selecting a subset of cities where the market had shown success and another subset where they had yet to establish a presence. These subsets served as the foundation for training our model, allowing us to discern the factors that led to prosperous markets.
As we analyzed the data, we observed intriguing variations in how specific variables predicted success in different markets. These distinctive insights unveiled the essence of what constitutes a thriving market. Armed with these, we distilled our findings into fundamental conclusions about the characteristics that foster a conducive environment for the companies' acceleration programs.
To tackle this problem, we took two parallel approaches. The data presented a unique challenge: the data was labeled for cities where accelerator programs have succeeded (positive class), but the majority of cities remain unlabeled, lacking information about their potential for success (negative class).
The first approach to this problem was classifying using the Positive Unlabeled approach. To start, we trained our classification model using the labeled data from cities where our accelerator programs have thrived. This initial training laid the foundation for the model to recognize the characteristics of successful cities. We then utilized the trained model to predict the class probabilities for the unlabeled cities. Instead of directly labeling them, we turned to the prediction probabilities as a guide. Modeling in DataChat allowed us not only to see the predicted class for each city, but also to see the probability that the prediction was correct according to the model.
To understand which cities were the best fit, we simply filtered out the true accelerator cities from the dataset, then sorted on the likelihood that a city was predicted true. The result was a ranked list of cities most similar to those where they already ran accelerators.
One issue we face in this approach is that it relies on the assumption that all the cities where our accelerator programs ran were uniformly successful. In reality, while some cities may have experienced tremendous success, others might have faced challenges or even failed to thrive. As a result, DataChat's positive unlabeled approach is effective in finding cities similar to those where programs were conducted. However, it does not differentiate between cities where the programs succeeded and where they didn't.
To address this limitation and refine our expansion strategy, we incorporated a complementary technique - clustering. Clustering is a powerful data analysis method that groups similar data points together based on their shared characteristics. By leveraging clustering algorithms like K-means, hierarchical clustering, or DBSCAN, we aimed to identify distinct clusters of cities with varying degrees of success.
Clustering through DataChat allowed us to partition the cities into distinct groups based on similarities in demographics, economic indicators, consumer behavior, and other relevant factors. Each cluster represented a unique category of cities that share common traits, making it easier to identify potential target locations for our accelerator programs. We trained a few models, the most successful being K-means.
The first step in analyzing these clusters was to identify the cluster that contained the highest percentage of Program X cities – the cities where the accelerator program, Program X, had been successful. This cluster represented the most promising group of cities, as it shared similarities with the locations where the accelerator had already achieved success. Once this cluster (cluster 1) was identified, the hand analysis process involved a detailed inspection of the other cities within that same cluster. The goal was to understand what factors made these cities similar to the successful Program X cities and what potential opportunities they held for future expansion.
Findings from our hand analysis of the clusters unveiled a host of significant conclusions that illuminate the key characteristics shared by cities. Successful clusters exhibited distinctive traits, including moderate to high population densities and sizeable populations. Moreover, these cities seemed to have a slight skew towards a younger demographic. Education played a pivotal role, as cities in successful clusters showcased higher levels of educational attainment, as well as higher median household incomes.
By using a combination of techniques in DataChat, as well as the hand-analysis skills provided, we extracted actionable strategies from our client’s raw data. The Positive Unlabeled technique provided a launching point for analysis, allowing us to gain quickly digestible insights into the data. Clustering and hand-analysis provided a robust method for grouping cities, then understanding trends and relationships of these clusters. The synergy of these approaches allowed us to discern not only what next cities are ideal for startup accelerator programs, but also why.
DataChat is a cohesive analytics platform that uses natural language to make a broad range of data science tools, including data wrangling, preparation, exploration, visualization, and predictive modeling, accessible to everyone to improve business outcomes. Contact us or schedule a demo to learn more about how DataChat can help you improve your business outcomes.