Almost all the times we com across data that is not in the format we require it to be. Sometimes it has some information we need and other times we require additional information that can be retrieved using the data at hand. In the current tutorial we will explore the issue related to the case where data is present but we need some more information.
The motivation behind this post is very simple. I was able to find data related to all the subway stops in New York City (NYC). Hence, i knew the latitude and longitude for each stop. What i was missing was the actual physical address. The task was to use Google Geocode API and extract the data for physical address of the subway post.
The obvious question is why do i require this sort of data? I am trying to create a visualization where i require this data. For now, lets us concentrate on the task at hand: how do we extract the data using Google API service.
If you are new to the idea of web scraping then you may not have heard about API. API is an acronym for Application Program Interface (API). Many websites provide API services and the idea is simply to give access to their data. It is surprising to see that nowadays how many websites provide API services. Some of the well know websites are New York Times, Twitter, Facebook, Uber etc. The API services help the developer to construct apps based on the information provided by the API. You can read more on this topic by simply typing “API” in google.
For the current project we will need a google ID, which you might already have. Now to extract the data from API service you will need to create an API key. This key will be used as an input while constructing the link.
Go to google developer console and click on Credential under APIs & Auth. Now click on Add credential -> API Key ->browser key -> “give it some name”-> click create. This will create an API key. Note that every API service will need a separate key and the calls made to each API service is limited. The google developer console will help you track the # of free calls made to the service. To understand on what information you can extract from any of the API service please refer to the respective API documentation found on google.
You are now all set for accessing the google API service !!! YAY !!!
LINK, XML, JSON and More… :
To extract data using the API service we need to create a simple HTML link wherein some information is standard API information and some is custom based on users requirement. The following is a break down of the Geocode API link.
If we paste this in our browser we do not get back anything. It will simply error out for us. The reason being we are missing an API key and Parameters. Following is the link with all the parameters:
If you have your API key, you can use the link mentioned above along with your key. Paste the entire link in your favorite browser and you should be able to get XML output. Now try the same procedure but change the xml to json and you will see JSON format too. For our task we will use the xml format to get the data and filter it. XML output in your browser has a lot of information and the first task for us is to find if the information we are looking for is present or not.
We need reverse geocode since we will be supplying Lat and Long for the subway stops as input and getting back the address for that location as XML output. The following link is an example of reverse geocode with parameter latlng.
Try changing the 40.714224 and -73.961452 to some other lat and lng and you will see that your xml output will get updated.The xml output that you will observe is a long list of street names, zip codes, lat and long, neighborhood etc. You may or may not need all this information. We will learn about parsing the XML tree and filtering the data in a way that it can be used efficiently.
In the next Part 2 we will study to create a link using the paste() function in R and further use the readLines() function in R to extract information.