From government, health, and financial data to weather, baseball, and Star Trek, countless free data collections are available to tackle your analytical itch
Bosses love to hear the word “free”. Everyone wants to get something for nothing. The good news is that there is a growing free data collection available for use. Some of them may even be useful for your project or career.
What’s a catch? Sometimes nothing is caught. Many of the sources below come from government agencies. Once they’ve gathered information, they often spend very little to share it publicly with people. Technically, it’s not free because you’ll pay for it on April 15. But the good news is that your project budget won’t be limited.
Other data collections are a sophisticated form of advertising. All major cloud companies store different open data sets. You don’t need to use their cloud server, but the performance is much better when the bits are stored in the same data center. Cloud companies can buy 30-second positions on the Super Bowl, but this form of advertising is a better strategy for everyone.
One dangerous thing when working with free data is that the boss will argue that it also has no problem. Sometimes data will require you to work a little more. Perhaps the government agency that has collected it prefers to use its own special format. Perhaps the data needs to be re-synthesized for your needs. Most likely you will need to write a little code for it to work.
Some data projects work like open source software and work best when everyone contributes their small part. I have a weather station in the backyard connected to the Personal Weather Station network that collects data from nearly a quarter of a million different citizen scientists. Participation is essential, but you will be able to take advantage of the work of others at the same time. If your work will help build these projects, be prepared to pull your weight with project management.
The good news is that the barriers to entering the market are small. You don’t have to ask permission and you don’t need to ask for forgiveness. Here are N different corners of the web for you to just start downloading and exploring.
The General Services Agency (GSA) maintains Data.gov, a large list of data sets that the U.S. government shares publicly. As of the time of writing this article, there are 210,756 articles, many of which come from agencies specialized in supporting trade (maritime, agricultural, energy). However, there are no secrets from the secret agencies and nothing from Area 51.
Some data sources are no more than a file repository. Kaggle is more denomination. They started with over 50,000 different data sets and then added basic tools (Jupyter notebooks) to understand them. There were 400,000 different public notebooks that other data scientists shared to analyze the data below. On top of that, Kaggle has added some online courses on how to use things and combines in some contests with real cash prizes.
For example, Cornell’s School Conditions Lab is offering $25,000 for the best sorters for birds singing or what they call “bird noise.” The Open Vaccine Initiative will award $25,000 to the best models to predict how the RNA degradation will affect the COVID-19 vaccine. There’s a lot of serious work to be found among CSV or JSON files, but if you’re tired, you can also have entertainment. For example, a data collection filled with lines is cut from all Star Trek episodes from six major series.
The FiveThirtyEight website is dedicated to reporting stories with the support of a rich data collection. When possible, they also share these data set for you to do your own research. There are past records of their predictions for major sports leagues, discoveries of social attitudes such as surveys of men asking what it means to be a man, and, of course, relentless polls of upcoming political polls.
The United Nations agency is responsible for helping raise healthy children around the world shares various data sets that are useful to anyone with the same goal. You can find the big picture in small data sets like the 2019 World Children’s Health Statistics Table for those who want to track change by numbers. More concentrated images can be explored in tables exploring how iodine salts affect disease or the success of primary education.
The Ohio State Library always updates a website with pointers to some of the largest collections of economic and financial data. There are historical records of U.S. data collections and some collected by the World Bank. Some require study accounts and some are free for the public.
American sport is blessed by some fans who are proficient enough with computers to develop rich data collections about the players and the results of their matches. For example, Sean Lahman’s database contains full statistics on batting and pitching from 1871 to 2019. There are also other detailed tables such as polish statistics, management changes, and World Series results that may be incomplete, but can also be devoted to modern eras that in major baseball began in the 20th century.
Project Retrosheet was started to gather turn-by-turn summaries of all major league games whenever possible and it is now completed until 1974. If you happen to have access to a scorecard from a previous game, check the “most desirable” list to see if you can fill a vulnerability. Chadwick Baseball Bureau maintains a GitHub repo for data if you wish.
The American Baseball Research Association maintains a list of other sources including services from commercial organizations such as FanGraphs, Baseball-Reference, and Major League Baseball itself.
If you’re just looking for a specific data set, Google Data Set Search lets you search the entire web to find data sets using keywords. Results can be filtered by license, data format, and time from the last update. Some of the most compelling data set are also included in Google’s public data directory, which not only lists sources but also provides some interactive overview pages. For example, the World Bank charts birth rates versus life expectancy, and you can track how this changes over the years with sliders.
Amazon web services
AWS users who want data stored in the S3 group can switch to the Open Data Store on AWS or RODA. There is diversity in thousands of data collections but the highlights tend to be data collections from sources aws is working with publicly such as the Space Telescope Institute (stars), NOAA (NEXRAD weather radar imagery), and Common Crawl (more than 25 billion websites). Of course, there are some good examples to help you get started analyzing data using AWS services like Lambda or Comprehend.
Microsoft also has several data set on Azure. City planners can search for details in the filing from the New York taxi council CIty, which tracks all fares. Economists and traders can view the price profile of goods for insight into inflation and economic changes. All are ready to be analyzed using Microsoft’s machine learning tools.
Some of the content we host on Facebook is private because we make it so. Some are shared with friends. Some content is completely open. Facebook supports research on the so-called “Facebook graph” with its Graph API. It’s not like downloading an entire data set, but it can be useful for some queries. Just remember that not everyone uses the same privacy settings, so you may not see people or every post.
The site is famous for reviews of restaurants, bars, and other public accommodations that share a lot of information in the public data set that you can research. There are over eight million reviews of more than 200,000 facilities waiting for you or your AI to analyze the syntax. They are a good source for training data for natural language processing and machine learning.
Open data set
extracting website content
Not all data is in a database that is easily accessible with an API. A large amount of information is embedded in websites and data needs to be separated from them by some smart tool. This so-called web search is still a pretty good method, but it may have legal limitations. Some sites ban it in their terms of service and others track too many requests from one user and then cut users or slow down responses.
Tools like Puppeteer help create one (or more!) The headless version of a web browser, downloading a website, extracting the right data, and re-doing it over and over again becomes simpler. There are currently headless versions available for most major browsers, thanks to the software testing community that needs to automate out the testing process. Searching the web may not always be appropriate, but it may be the fastest way to get the data you need. There is nothing more open than the open web.