Government of India has an initiative where they are making data publicly available and easily accessible. The website is called data.gov.in and it hosts a large number of datasets. However, my point of writing this article this post is to encourage these government departments that are a part of open data initiative to adopt R to enhance the quality of their data and reports. Not that i am an expert myself but i do want to bring to light the possibility of relying on open source technologies such as R and how by making a small change would not only help these agencies publish high quality data but also help users of these datasets.
R is widely used open source software. A lot of financial institutions have been using R and Python for developing financial models and testing them. Start up companies like Airbnb and Twitter have been using R for analyzing data. Government agencies such as Bank of England have started using R to generate reports.
Whenever i am directed from Open Data website to the government agencies website i observe that the quality reports, data and visualizations are poor. I always thought life could be made much easier if Govt. of India adopts open source technologies such as R or Python. I am an avid user of R so i feel stating the few obvious advantages of using R may be useful.
- Easy installation
- Easy to learn
- Data analysis
- Reproducible research
- Can generate high quality visualizations
- Easy integration with webpages
- High quality reports can be generated
- Can be used to generate in house R packages
- Presentations can be prepared quickly
Growing up in India you get used to the bureaucracy. So its understandable that the government employee a.k.a babus are reluctant to install or download an open source technology such as R. In that regards the installation of R is quick and most important its free. R installation is incomplete without the best IDE Rstudio. If anyone of the government employees stumbles on this post then here are the links
You should have R and R studio installation complete within 3 minutes.
Easy to learn:
I am sure government employees in various departments are familiar with Word and Excel and they may not want to learn a new software. Learning R is easy and if you just spend 30 mins of your day on learning R you will be a pro within no time. One of the best ways to learn R is to start with questions like – how do i import data, how do i calculate mean, how do i add additional columns, how do i filter a column, how do i generate a line chart or a scatter plot etc.
Asking questions in R is just the beginning, the next step should be going to the internet and getting the answers. Learning R is bit frustrating at first but remember internet is your friend so ask google these questions and you will be amazed at how many people out there have the same exact question. R has a very active community and so the best way to learn is trial and error.
I usually spend 15 mins on Rblogger which collects blog articles on R. One can easily keep a track of what people around the world and R community are working on. The code and data is open so you can try and replicate their results and learn R.
The primary reason Govt. of India should rely on R is that the amount of data generated these days is huge. Data stored in a spreadsheet will not resolve any issue it should be carefully analyzed. One of the primary reason R should be preferred is that its easy to perform data analysis on large data sets. I have seen a million times Excel freezing and crashing but never with R. R also has many packages which can be used for data quality testing, exploratory analysis and statistical analysis.
One of the issues i face with Govt. of India open data initiative is that the data is not well structured. Lot of data is in pdf format which is useless for an individual who wants to perform data analysis. For him he has to first extract the data in csv or excel. The hard work does not end here. The data in these PDF has merged column headers which when pasted in excel will ruin the formatting. Learning R will also teach you how best to publish data so that its easier for the public to access it. R packages like dplyr, tidyverse and ggplot can make life easier for everyone.
Reproducing results generated in the government reports is like a sanity check. Currently there is no way to ascertain the validity of these results. R allows generating code and reproducing the results easy. If the analysis is performed in R its easy to share the code along with the reports for the readers to understand how the conclusion was reached. Any individual can easily install R and recreate the analysis, this will also increase the trust in the Govt. agencies.
Generating high quality visuals:
The reason i learned R was to generate visuals. One can easily generate visuals such as scatter plots, line charts and histograms in R using a few lines of code. These can be integrated in reports without loss of quality. One of the best package to use is ggplot2, it will also force you to tidy your data.
I have come across a number of reports published by government agencies which have really bad style. I think this topic in itself deserves a post of its own.
Automation is the word of 2017. What i mean by automation is to generate reports and perform daily routine tasks automatically using R. Automation can help check for errors, generate plots of new data received from census bureau, or can send emails when some thresholds or benchmarks are broken. Automating reports and data quality checks will save a lot of time and will assure the government employees that the data being published is free from any human error.
Easy Integration with Webpages:
Since i started following all the developments undertaken by Govt. of India in the field of open data i realized almost every department and state has a website. So i started thinking the technology and hardware is present all one needs is a way to generate interactive graphics on these websites.
Why interactive graphics? allowing visitors to play with the data and only download the data they need can save a lot of effort. Interactive graphics will also allow visitors to explore the data and simply download the plot.
High quality reports can be generated:
The reports published by the Govt. of India are in PDF formats and if i want to read them without downloading them they still open up as a pdf within a webpage. R markdwon package can help resolve this issue. R markdown allows you to create high quality reports based on your style and also integrate shiny and ggplot2 visuals within it.
R markdowns will also allow government employees to integrate code they used and the data files they used making reproducibility of results easier.
Can be used to generate in house R packages:
Companies like twitter and Airbnb have been building R packages for use in house for performing various tasks. Given the large datasets Govt. of India can develop R packages to be shared within departments or across departments. Once a package reaches a mature stage it could also be made public and open via Github or CRAN.Packages can also be developed for sharing data or connecting to data servers for ease of data extraction.
R also allows generating presentations via rmarkdown files. Using presentations slides developed using slidify package will allow integrating R code, visuals, animation etc much more seamless.
Using R on a daily basis and seeing Indian government taking right steps in making data open i felt i should enhance the initiative. I just think using R and python are steps that will make this initiative much more meaningful.