I have met individuals who think Nobel Prize is a big deal and also met individuals who think its another prize. I think its kind of a big deal. The issue for this post is completely different though. I came across following visualization created by Nature magazine and tweeted by Nessa Carey.
Also it was the time of the year when Nobel prizes were being announced. The Nature magazine does bring out a good point and i thought what about other categories which are not a part of STEM? What about economics or Peace or Literature?
The main objective of this post is to generate similar visual for Economics and specifically learn to generate dot plot in R.
The data for the plot is downloaded from kaggle wesbsite. However, the data is from 1901 to 2016. In case we want to use the most recent data we can fill in the data using the Nobel Prize website. A good way to use this post is to reproduce the plot in R and then play around to fit arguments of function to fit in your data.
Data transformation / Cleaning:
I did observe that the data file has some duplicates. So i used excel to remove duplicates. If we want to remove duplicates in R, we can use the unique() function.
All code is available at the end of this page. If you are new to R it would be great to know the details regarding functions and arguments.
We will use two packages dplyr and ggplot2. The dplyr package is used to perform data manipulation and ggplot2 is a library used to generate plots in R. If we are using these packages for the first time we need to install these packages in R using the following: lines of code:
Following code is used to load the libraries in R once they are installed:
Now we can import the data using read.csv() function. The first argument in the function is name of the file that holds data and the second argument is used to ensure that R does not convert strings into factors. The %>% is a syntax used to make readability of code. The best way to understand this syntax is replace it with the word “then”. So here we are saying “load the data using read.csv() then select”.
data <- read.csv("archive.csv", stringsAsFactors = FALSE) %>% select(Year,Category,Sex) -> nbdata
The select() function from dplyr package is used to select only the rows we would need for our dot plot. We will then use “->” operator assignment to create a new data called nbdata which only stores our selected columns.
An important point to understand here is that what is that we are visualizing in dot plot. We need to show the number of people who won the Noble Prize each year in each category. So in order to do that we create a dummy variable column. The column is “1” if its a male and “2” if its a female. Following code will create a new column in our nbdata.
nbdata %>% mutate(dummy = if_else(Sex=="Male" ,1,2)) -> nbdata
The mutate() function from dplyr package creates a new column called dummy. In the code above we are creating a new column called dummy which uses if_else() function to classify the Sex column in nbdata.
Finally, we can plot the data using the geom_dotplot() function from ggplot2 package. The entire section of code might look intimidating but its simple if you read it closely. we will break the section into 3 parts.
- ggplot() :
This function lays the foundation of the plot. Basically we are instructing R to use the data and assign variable to be plotted on x axis. The first argument in the ggplot() is the data. But, we would like to plot only the data related to category Economics. Hence, we use the filter() from dplyr package to filter the nbdata for Category == “Economics”.
ggplot(filter(nbdata, Category=="Economics"),aes(x = Year ,fill = factor(Sex)))
The aes() is assigning the aesthetics for the plot. Here we will assign the x axis for our dot plot. We want the “year” variable to be plotted on x axis. The fill argument is used to color dots based on “Sex” variable.
To learn more about these functions simply type ?filter() or ?ggplot() in R console window.
The dot plot can be generated by using the geom_dotplot() function from ggplot2 package.
geom_dotplot(binwidth = 1,method = "histodot", stackgroups = TRUE)
In the above code we are instructing R to generate a dot plot by stacking dots. A very detailed explanation of this available here.
As a last step we will add some theme to our plot by cleaning the background, removing Y axis labels and adding headers. The readers can learn more about the theme and its elements here.
library(dplyr) library(ggplot2) data <- read.csv("archive.csv", stringsAsFactors = FALSE) %>% select(Year,Category,Sex) -> nbdata nbdata %>% mutate(dummy = if_else(Sex=="Male" ,1,2)) -> nbdata eco <- ggplot(filter(nbdata, Category=="Economics"),aes(x = Year ,fill = factor(Sex)))+ geom_dotplot(binwidth = 1,method = "histodot", stackgroups = TRUE)+ labs(y = "", x = "Years", title = "Gender Imbaance in Economics Winners", subtitle = "1901-2016")+ theme_minimal()+ theme(axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank())+ guides(fill=guide_legend(title="")) plot(eco)
As a result of the above code we get the following plot.
- As a next step we can save the plot as a pdf, jpeg or png. We can then import the plot to Adobe Illustrator or inkscape (open source) and add elements like text. For e.g. we can add text like the name of the one Nobel Prize Winner who was a female and also we can remove the legend since we now have a text annotation identifying the female who won the prize.
- We can also show more years on the x axis. Instead of 10 years gap we can show 5 years or every year.
- Play around with legends and move it inside the plot.
I have skipped this section since i wanted to keep this plot limited to what could be achieved in R and ggplot2 package.
Where to go from here:
We can plot all other categories like Literature or Peace prize winners. We can also generate a visual which plots all the Categories and show a stark difference between STEM fields and other Categories.
If you happen to use the code to replicate or do something new do leave a comment and link.