Data Acquisition for Beginners
By Lylla Younes
Data acquisition illustration
In Short: Finding and gathering information is essential to any investigation, but not all data sets are created equal. This introduction includes descriptions of common file types, methods for gathering data, and a how-to guide on converting your data into a format more easily analysed by computer programs.
Once you have found information that is pertinent to your investigation, how do you turn it into a machine-readable, analyzable form? What are the different forms and file types that data can take?
In this introductory guide, we will discuss the basics of data files and consider different methods of data acquisition using open source tools and Microsoft Excel.
Data acquisition is the process by which researchers find and gather information. In many cases, data is not available in an easily downloadable, analyzable format. It could be scattered across a hundred PDF files or located in a data table on a webpage without a download option. In some cases, the data does not exist anywhere, and the researcher must create their own dataset.
Examples highlighting the importance of data acquisition in investigative work are endless. Take, for instance, the data behind ProPublica’s database of priests in the US who have been credibly accused of sexual misconduct. Journalists at ProPublica had to pull lists of priests from hundreds of PDF files into one central and searchable database. Or The Guardian’s interactive database, The Counted, which documented every police killing in the US in 2015 and 2016. People at The Guardian acquired and aggregated the information by painstakingly combing through local news clippings, among other methods. Or, take Ad.watch, an ongoing project where an artist and a researcher have been spending months collecting data of political ads on Facebook, Snapchat, Instagram etc. to show the extent to which political actors invest in voter targeting via the social media platforms (see more about their methodology in this ETI Kit guide.)
File Types and Machine-Readable Formats
Not all datasets are created equal. The first thing to consider if you want to begin to analyze a dataset is whether or not the available format allows for computer-based analysis.
In many instances, investigators will want to pull data from a website or document into “machine-readable format,” or a structured form that can be processed by a computer. A text file with a jumble of words is considered unstructured. You cannot open it in Microsoft Excel and find its information displayed in neat rows and columns. Generally, any data that you want to analyze using a computer program or software must be machine-readable.
Delimited Data Files
One of the most common machine-readable formats is CSV, short for “comma separated values.” A CSV file is formatted the way it sounds: each line in the file is a row of data, and the information in each row of data is separated by commas. The comma, in this case, is called a “delimiter,” or a character that marks the beginning or end of a unit of data. While commas are perhaps the most common delimiters that you will find in data files, you might encounter other characters as well. For example, a TSV file, short for “tab separated values,” separates each unit of data by a tab. Delimited files are useful because they are very easy to read, write, and manipulate across software and programming languages.
How do you open a delimited file?
Files with the extension “.csv” or “.tsv” indicate the use of a delimiter. They might also have the extension of a text file (“.txt”), or no extension at all. Let’s check out an example similar to one you might encounter when trying to download publicly available information.
tsv example Example: data in comma-separated vs. tab-separated format.
Let’s say that you’re investigating the surface water quality in an area of Arizona where an industrial plant recently spilled a dangerous quantity of chemical waste. Your goal is to determine whether the nearby community’s drinking water was compromised.
Through your research, you find the United States Geological Survey (USGS), a governmental research organization that collects vast amounts of data on the environment. Start by downloading this sample of water quality data from the state of Arizona. This data was pulled directly from USGS’s water data portal.
Once you have downloaded the file, open it up in your text editor of choice - I use Sublime Text.
`` > **Note**: > > [Sublime Text](https://www.sublimetext.com/3) is one of the most popular open source text editors > because it has text highlighting for many different programming > languages. That means, it colors certain parts of the code differently > to indicate syntax. It also has a lot of customizable features (meaning you can adapt them to your needs) and > plug-in options. I recommend it! You will notice that the file has no extension. But a little bit of scrolling makes it clear that the file is delimited by _**[tabs](https://en.wikipedia.org/wiki/Tab_key)**_. Now that you know the delimiter, you can open the file in Microsoft Excel. ```cik-note
For this guide, we will be using Microsoft Excel and Google Sheets to open delimited files and analyze data. However, you will find similar functionality in software like Libre Office and Apache Open Office. It is good to keep in mind that while Google Sheets is very similar to Microsoft Excel, it relies on a browser, and therefore has certain limitations. For example, large quantities of data can be slow and difficult to manage. A file with a million rows of data could cause your browser to crash. That said, once you understand the high-level concepts in this guide, you should not have too much trouble applying them to different software.
First add the extension “.txt” to the file, so that Excel can know that it is a file type that it can read. You will also notice that the first 75 or so lines of the file are just notes and not analyzable data. It is a good idea to copy and paste those lines into a different text file for reference, and delete them from the original file.
Open a clean Excel workbook and click File > Import. Select “Text File” and hit Import.
Import Screenshot by Lylla Younes
Select the file that you downloaded (it will only be selectable after you add the appropriate extension).
Import4 Screenshots by Lylla Younes
Select “Delimited”. Hit Next and then Finish. Accept the default cell location for the data. The result should be a clean Excel file with your delimited data – ready to analyze.
Import6 Screenshots by Lylla Younes
You can repeat this process with any type of delimited text file. It is also possible to import a tab-separated file (.tsv), and save it as a comma-separated file (.csv). Spend some time opening different types of delimited files between your text editor and Excel to get familiar with their formatting.
Like delimited files, JSON files are extremely easy to use. They are human-readable and can be edited in a text editor. They can be compressed to a single line, making them lightweight.
Here, “lightweight” refers to the fact that the file is easier (and therefore faster) for a computer program or software to read, either because of the way it is formatted, or because it is simply not a large file.
You will probably encounter a shapefile if you are working on an investigation that involves mapping or any type of geospatial analysis. To borrow from ESRI, one of the world’s largest suppliers of geographic information systems (GIS) software, a shapefile is a “format for storing geometric location and attributing information for geographic features.” These features may be represented as points (usually latitude/longitude coordinates), lines, or polygons (areas). A shapefile, which has the file extension “.shp,” usually comes in a folder of files. This means that when you download a shapefile, you will end up downloading a folder, but the file in that folder that you will want to pull into your chosen software will end with extension “.shp.” We will discuss in detail the various methods of opening and manipulating shapefiles in a different guide of the Kit.
Other File Types
The above sections contain only a few of the dozens of possible file types you may encounter while working with data. Since these are very common and standard file types, it is easy to convert between them and other types of files.
Scraping Lists and Tables with Google Sheets
data acquisition illustration
In some cases, the data that we are interested in is displayed in a table on a webpage. What should we do if we want to download that data for manipulation or analysis, but there’s no download button? We could try to highlight the whole table and copy and paste it into Microsoft Excel or Google Sheets. That works some of the time, but not all. Let’s look at a simple example.
This Wikipedia page has a list of historical nuclear weapons stockpiles and nuclear tests by country. When I highlight all the contents of the table and copy it (Command+C on Mac / Ctrl+C on Windows), and then paste it into an empty, new spreadsheet in Google Sheets (Command+V on Mac / Ctrl+V on Windows), I get a bunch of jumbled text on 2 rows:
attempt2 Screenshots by Lylla Younes
Fortunately, in a case like this, we have a plan B. Google Sheets has a built-in function that enables users to scrape data from tables and lists on web pages into Google Sheets spreadsheets. From there, you can download the spreadsheets, and continue your work offline in whatever software or program you choose, or you could continue your analysis in Google Sheets. Let’s take a look at how it works.
This web page details the Google Sheets InnerHTML function, which imports data from a table or list within an HTML page. If you are not familiar with how Google Sheets functions work, this support document is a good place to start.
To use InnerHTML, type a function into a cell in the spreadsheet. Every function starts with an equals (=) sign. The InnerHTML document explains that the function takes three inputs:
The URL of the web page the data is on. It’s important that the URL is in quotation (” “) marks.
The type of structure that the data will be contained in. For this function, the input will be either “list” or “table.” Again, don’t forget the quotation marks.
The number of the table or list on the page. For example, if there are 3 tables on a page and you want to import data from the second one, you would put 2. No need for quotation marks here.
For our example, we are interested in the first table on the page. Therefore, our function will look like this:
It is important to follow this exact punctuation format. Pay close attention to where and when you add your commas and quotation marks.
When I type this into the first cell of an empty Google Spreadsheet and hit enter, I see my data appear in clean format, ready to be analyzed:
attempt3 Screenshot by Lylla Younes
In this example, we used a fairly small table. Keep in mind that you can use this function even on very large tables or lists, as long as they’re on the same webpage.
Wrangling Data from PDFs
Data wrangling is the process of converting data from its raw, unstructured form to a form that is analyzable by computer software and programming languages.
There might be times when you have data in a local file on your computer, but it is not in machine-readable format. One of the most common non-machine-readable formats for data is a PDF file. Have you ever seen pages and pages of tables in a PDF file and thought, “what a complete waste?” If only that data were in a spreadsheet so that you could easily search, sort, and filter it.
A common solution to this problem is writing some Python code to scrape the data from the PDF file. But this takes a level of computer programming knowledge that not everyone has. Luckily, there are a number of different options for non-coders trying to get data from a PDF into machine-readable format. In this guide, we will explore just one of those options, a free and open source software called Tabula. Tools like Comet Docs are really effective, but require a paid subscription to use them. That said, every software has its limitations, and there are times when what will work in one software won’t work in another. The aim of this section is to introduce you to an excellent open source tool and expose you to the high-level concepts that will enable you to seek out other tools for similar purposes in the future, should you need to.
Tabula’s explicit purpose is “liberating data tables locked inside PDF files.” Let’s see how this works in practice.
First, download the version of Tabula compatible with your operating system. You’ll find installation instructions about halfway down the page. Once you’ve downloaded the software, open Tabula on your computer by double clicking the icon. You will notice that instead of opening an application on your computer, Tabula will open a new window in your internet browser. If that does not happen, go to http://localhost:8080. You should see a page that looks like this:
tabula0 Screenshot by Lylla Younes
Next, to practice a bit, download this PDF document, which contains a roster of attorneys in Maine’s district court system (we’re using this file for the sake of example).
Now let’s try importing the PDF file into Tabula. Click Browse, and locate and select the file on your computer.
tabula1 Screenshot by Lylla Younes
Either double click the file or select Open.
tabula2 Screenshot by Lylla Younes
Then click Import.
tabula3 Screenshot by Lylla Younes
Depending on the size of the PDF, it may take some time for the file to upload. Be patient!
tabula4 Screenshot by Lylla Younes
Once the file is uploaded, you should see a preview of your PDF in the browser. Now it’s time to extract the tables.
tabula5 Screenshot by Lylla Younes
The easiest way to extract data tables from a PDF using Tabula is with the AutoDetect Tables feature. You can use this feature when many or all the tables in the PDF document are formatted in the same way, as is typically the case. Luckily, our example file fits this description. Click AutoDetect Tables.
tabula6 Screenshot by Lylla Younes
You should see a semi-transparent red rectangle highlighting the entire area of the table. If it’s a little off, you can manually click a corner of the rectangle and drag it so that it covers the entire table.
tabula7 Screenshot by Lylla Younes
Once you’re comfortable with the area you’ve selected, click Repeat this Selection to copy the selection area to every page in the PDF document.
tabula8 Screenshot by Lylla Younes
At this point, you should hopefully be able to see your data removed from the tables and in tabular form. Click Export to download the data into a CSV format that you can drag into Excel.
tabula9 Screenshot by Lylla Younes
Retrieving Data from APIs
Data illustration cik
One of the most important skills in data acquisition is learning to request information from APIs.
The term API is an acronym that stands for Application Programming Interface. You can think of it as a post office on the internet where you can send a request and, in return, receive a parcel that you’re asking for. To better understand this metaphor, let’s step back and consider how the internet works. The World Wide Web is essentially a large network of connected servers.
A server is a computer program or a hardware device that provides a service to another computer program and its user (known as the client). An example of a typical server service is providing data to other computers.
Every page on the internet is stored somewhere in the world on a remote server, or a remotely located computer designed to process requests. When you type www.twitter.com into your browser, a request is sent to Twitter’s remote server. The server processes the request, and sends back a response. Your browser then processes the response information and displays the page.
In this example, your browser interacted with Twitter’s API — the part of Twitter’s remote server that receives requests and sends responses. In most cases, when we load a new page on the internet, we don’t know what’s happening in the background. But some companies and organizations have made their APIs freely available to the public, and have published detailed documentation that enables developers to request data and understand the way the API’s information is structured. One popular example is Google’s Geocoding API, which allows users to send an address to Google’s server and in turn receive the location’s latitude and longitude. As you might imagine, this functionality can be useful for mapping projects in which a researcher wants to plot points across a geography. Another example is the Spotify API, which returns metadata about artists, songs, and albums from Spotify’s Data Catalogue.
In many cases, companies place limitations on what people can do with their APIs for free. For example, the company behind this weather API only allows users to gather 3 days of forecasts, rather than the 10 they would be able to request if they paid for premium API access.
In other cases, companies charge high sums for developers to use their APIs. In fact, there are companies whose main product is their API! Take, for example, PredictHQ, a company with a detailed weather forecasting API that is used by companies such as Uber and Alaska Airlines. TomTom’s Traffic API has information on traffic flow and incidents in over 77 countries. These examples illustrate how governments and companies rely on APIs to acquire detailed real time data to inform logistics operations.
There are many well-documented, open source APIs that can be useful for journalists and investigators. The Twitter API, for example, allows developers to request tweets in a given timeframe, with a particular hashtag. Many journalists have used this API to understand the dynamics of public opinion and the spread of misinformation.
Let’s practice with APIs
Let’s find a basic API to practice with. The Holiday API contains data on holidays in dozens of countries around the world. As with almost all professionally developed APIs, you have to sign up to receive a unique API key to access the data. You will use this key to request data from the API.
Digital safety and privacy
Whenever you sign up for an API key or any other online access/services for that matter, it is recommended you create an email address that is not connected to your personal or regular work email.
Read about essential digital safety awareness and recommended measures when conducting online research in the Exposing the Invisible Kit’s “Safety First!” guide.
api Screenshot of https://holidayapi.com/. Source: Lylla Younes
Sign up for the API here and click the Get Your Free API Key button, in the center of the home page (it should show up after you’ve made an account). Halfway down the resulting page, you should see a dashboard containing basic account information, including your API key. You will use this key shortly, when you make your first request to the API.
api 2 Screenshot by Lylla Younes
It’s always a good idea to spend some time exploring an API before requesting data from it. If you click on the Countries tab at the top of the page of the Holiday API, you will see all of the countries for which the API has holiday data. If you click on a country name, you can see what type of holiday information is available for you, for that place. Spend some time clicking through the countries and seeing what kind of information you can learn from this API. Once you feel like you have an idea of the information available to you, let’s figure out how to request data.
Ideally, the API you are using has some sort of documentation that you can read in order to understand how to access its data. Click the Developers tab at the top of the Holiday API page. The name of the tab should give you a clue that this is where you can find technical specifications about data requests.
A well-developed API will provide several essential pieces of information that you will need to make a data request. First of all, it should provide a base URL, from which you will build your request. This URL is typically just the URL of the API itself. You add parameters to the URL, to specify the information that you want the API to send you. Certain parameters are required in order to make an API request. For example, if you had to make an API key to access the API, that key will almost certainly be a required parameter in your request. As you can see on the Developers page, there are three parameters that you must provide when making a request to this API:
api3 Screenshot by Lylla Younes
Most API documentation includes some example code showing developers how to query, or request data, from the API:
api4 Screenshot by Lylla Younes
Clicking on the downward caret (circled in red) reveals a dropdown containing the name of different programming languages. Clicking on any given language displays example code for querying the API in that language. What we need to pay attention to is the base URL, or the root URL from which we will build our request. When you request data from an API, your requested parameters will be appended to the end of the API’s base URL. We’ll see this process in action momentarily.
While most developers write computer programs to query APIs, you don’t have to know how to code to access an API’s data. We will be working with Postman, an interface that takes user input to format and send requests to an API. Let’s get set up with this service to see how it works.
First, create an account on Postman, and then download the desktop application for your operating system. It is possible to use Postman in the browser without downloading the application, but I have found that the company places significant limitations on the types of requests you can make in the browser.
Once you open the desktop application and sign in, you should see a page like this:
api5 Screenshot by Lylla Younes
Let’s make our first request! At the top left-hand corner of the page, click the button + New:
api6 Screenshot by Lylla Younes
In the menu that shows up, click Request:
api7 Screenshot by Lylla Younes
Type in a name for your request (use something that relates to the API that you’re querying). You may want to also add a short description detailing the request you want to make.
api8 Screenshot by Lylla Younes
You have to add your request to a Collection, which is basically just a folder containing requests. Click + Create Collection, and give your collection a name. Then click the check mark and select Save.
api9 Screenshot by Lylla Younes
Now it is time to build your request! Let’s refer back to the API’s developer’s documentation, which lists the parameters that are required for any request to the API: the country, the API key, and the year. As with any API, there is a particular way that users are required to input these parameters. For example, the documentation indicates that the country parameter must be in ISO 3166-1 alpha-2 or ISO 3166-1 alpha-3 format, which are standardized two- and three-letter codes for representing countries. States / provinces should be in ISO 3166-2 format. Clicking any of the hyperlinked format types opens a Wikipedia page with the format’s specifications. Clicking on the API’s list of available countries leads to a table of all the countries for which the API has data, including the country codes in their appropriate formats. The United Arab Emirates, for example, could be requested with code AE or ARE.
Let’s request holiday data from Brazil for the year 2019. Under the Query Params header, start by inputting your API key. You will type in “key” as your parameter and the API key itself for the value. It should look something like this:
api10 Screenshot by Lylla Younes
Next, add the country and year:
api11 Screenshot by Lylla Younes
Finally, click send. Your request will be sent to the API, and you should receive a response in JSON form, as shown below:
Screenshot by Lylla Younes
You can save the response as a JSON file on your computer by clicking the Save Response button on the top right-hand corner of the response box.
Let’s take a more detailed look at the information and structure of the response. JSON data is organized by key-value pairs. The “keys” are the elements on the left side of the “:” and the values are on the right. So for example, in the response above, the first key is “status” and the value of that key is “200” which indicates that the request was successful. The information we are most interested in is contained within the key “holidays.” The value of that key is a list of all the holidays in Brazil during the year 2019. Each holiday in the list is structured as an individual JSON object with metadata on the holiday. The first holiday, for example, is New Year’s Day. Scroll through all the holidays in the list and skim their metadata to understand the layout of the information.
Of course, in your future investigative work, you will likely be interested in APIs containing different sorts of information. Shipping routes, geographical locations, and social media activity are all examples of data points that can be gleaned through APIs. Getting familiar with these important “internet post offices” will help you gather information and understand how the powerful corporations and governing bodies of the world streamline their own data acquisition practices.
Published March 2021
Articles and Guides
Tools and Databases
Comet Docs. An online document management system.
Python. A programming language that allows developers to write various programs such as web applications, websites, data analysis tools.
SublimeText. Text editor for code, markup.
Tabula. A tool for extracting data tables locked inside PDF files.
Tableau. A suite of software applications for data visualisation and analysis.
API – standing for “application programming interface”, is a software tool that facilitates communication between a user and a dataset, amongst other things. Through an API a platform (for instance, Facebook) can make its data accessible to external developers for free or under some conditions or fees.
Data acquisition - the process by which you find and gather information.
Data wrangling – the process of converting data from its raw, unstructured form to a form that is analyzable by computer software and programming languages.
Hashtag – symbol introduced by the number sign, or hash symbol, #, is a type of metadata tag used on social networks such as Twitter and other microblogging services. It lets users apply dynamic, user-generated tagging that helps other users easily find messages with a specific theme or content. (source Wikipedia)
Machine-readable data – information that is in a form that can be processed by computer software and programming languages. Machine-readable data should be organized in a file format like CSV or JSON.
Metadata – information that describes properties of a file, be it image, document, sound recording, map etc. For example the contents of an image are the visible elements in it, while the date the image was taken, the location and device it was taken on, are called metadata.
Programming language – a formal language made up of a set of instructions for producing output. An example of a programming language is Python. Like a human language, programming languages have “syntax,” or rules for how the user should write code to achieve their desired output.
Python - a programming language that allows developers to write various programs such as web applications, websites, data analysis tools: https://www.python.org/.
Python editor - application to navigate, debug, compile and run scripts in the Python language.
Scraping – the process of extracting data/information from human-readable content, like that on a webpage, into a machine-readable format, like a CSV file.
Server – a computer program or a hardware device that provides a service to another computer program and its user (known as the client). An example of a typical server service is providing data to other computers.