This article provides an example of how to use
owidR to create a country level dataset consisting of multiple variables. It does this in the context of trying to answer the research question: does higher internet use lead to higher levels of democracy? This is based on research done by (citation) We will do a similar analysis using data gathered from Our World in Data. This analysis is only intended to be an example of how to use owidR and not a robust research paper. The article assumes some basic knowledge of the
ggplot2. If you aren’t familiar with either of these you should still be able to follow along but I would recommend reading R for Data Science. An understanding of basic R is essential.
To begin we’ll load
owidR using the
library() function. We’ll also load
plm as well as
texreg which we’ll be using to do the analysis.
library(owidR) library(dplyr) library(ggplot2) library(plm) library(texreg)
Searching for and importing data using
owidR is very easy. First, we can search for data on a topic using
owid_search(). We’ll start by searching for data to use as our outcome variable: internet To do this I enter the keyword “internet” as the argument in
owid_search("internet") #> No internet connection available: returning blank tibble
When running this line of code around 10 datasets about the internet are returned. Let’s use the dataset with the title: Share of the population using the Internet. The corresponding chart_id to this data is: “Share of the population using the Internet”. Using the chart_id as an argument to
owid() imports that data into R, assigning it to an object called internet. We use the rename argument to give the value column a shorter a clean name.
<- owid("share-of-individuals-using-the-internet", rename = "internet_use") internet #> No internet connection available: returning blank tibble internet#> # A tibble: 1 × 3 #> entity year value #> <lgl> <lgl> <lgl> #> 1 NA NA NA
We can find information about the source of data by using
owid_source(), with the owid dataset object as the argument. This gives us the original publisher of the data as well as a link to the data. For some datasets additional information about how the variables is calculated is also provided. Using
view_chart() takes you to the Our World in Data webpage for that dataset, where there is also additional information and a pretty graph.
owid_source(internet) #> owid object had not connected to ourworldindata.org #> NULL
To create simple plots to see how internet use has changed over time simply use
owid_plot(), filtering to give the World total. Given that this function is a wrapper around
ggplot2 you can use normal
ggplot2 functions to further manipulate the graph. I’m going to add a title using
labs() and change the y axis scale so that it starts from 0 (this makes the graph clearer to interpret given that the value is a percentage, otherwise small variations can appear large).
owid_plot(internet, filter = "World") + labs(title = "Share of the World Population using the Internet") + scale_y_continuous(limits = c(0, 100)) #> owid object had not connected to ourworldindata.org
We can see how internet use varies between countries by creating a choropleth map using
owid_map. It shows that, in 2018, there is still a large variation in the level of internet use in countries, with many African countries having particularly low use.
owid_map(internet, year = 2017) + labs(title = "Share of Population Using the Internet, 2017") #> owid object had not connected to ourworldindata.org
It’s also possible to compare countries level of internet use across time, again using
owid_plot(). By using the argument
summarise = FALSE,
owid_plot() will show individual countries instead of aggregating them into the total. You can then use the
filter argument to select which countries you want to be displayed.
owid_plot(internet, summarise = FALSE, filter = c("United Kingdom", "Spain", "Russia", "Egypt", "Nigeria")) + labs(title = "Share of Population with Using the Internet") + scale_y_continuous(limits = c(0, 100), labels = scales::label_number(suffix = "%")) # The labels argument allows you to make it clear that the value is a percentage #> owid object had not connected to ourworldindata.org
Now let’s get data on democracy, first searching for a data source and then importing it using
owid(). Using that data we’ll do some similar exploration to what we did with internet use data.
<- owid("electoral-democracy") democracy #> No internet connection available: returning blank tibble democracy#> # A tibble: 1 × 3 #> entity year value #> <lgl> <lgl> <lgl> #> 1 NA NA NA owid_source(democracy) #> owid object had not connected to ourworldindata.org #> NULL owid_map(democracy, palette = "YlGn") + labs(title = "Electoral Democracy") #> owid object had not connected to ourworldindata.org
So we’ve done some nice exploratory analysis and produced some pretty graphs, but now let’s get into some more in depth analysis. We’ll use a fixed effect (FE) regression analysis to estimate the average effect that an increase in internet use has on democracy within a country. If you aren’t familiar with FE regression this article explain its purpose well and this chapter from Introduction to Econometrics with R shows how to implement it in R. To estimate the effect of internet use we’re going to use a within-unit fixed effects model.
This model will require us to adjust for confounding factors, so we’ll need some extra data. I’m going to use data on variables that I think might be confounding the relationship between internet use and democracy. These are: GDP per Capita, Government Expenditure, Age Dependency and Unemployment. There are almost certainly more confounding factors so feel free to use
owid_search() to find data on other variables you think might be confounders and add them to the analysis.
<- owid("gdp-per-capita-worldbank", rename = "gdp") gdp #> No internet connection available: returning blank tibble <- owid("total-gov-expenditure-gdp-wdi", rename = "gov_exp") gov_exp #> No internet connection available: returning blank tibble <- owid("age-dependency-ratio-of-working-age-population", rename = "age_dep") age_dep #> No internet connection available: returning blank tibble <- owid("unemployment-rate", rename = "unemp") unemployment #> No internet connection available: returning blank tibble
In order to create an FE model, all these separate dataframes now need to combined into one. To do this I’m going to use the
left_join() function from
dplyr and create a new dataframe called
data that combines all the other dataframes.
<- internet %>% data left_join(democracy) %>% left_join(gdp) %>% left_join(gov_exp) %>% left_join(age_dep) %>% left_join(unemployment) #> Joining, by = c("entity", "year", "value") #> Joining, by = c("entity", "year", "value") #> Joining, by = c("entity", "year", "value") #> Joining, by = c("entity", "year", "value") #> Joining, by = c("entity", "year", "value")
Now that we have a combined dataset we can get to the analysis. First, let’s use
ggplot2 create a graph to see the correlation between internet access and democracy in 2015.
%>% data filter(year == 2015) %>% ggplot(aes(internet_use, electdem_vdem_owid)) + geom_point(colour = "#57677D") + geom_smooth(method = "lm", colour = "#DC5E78") + labs(title = "Relationship Between Internet Use and Polity IV Score", x = "Internet Use", y = "Polity IV") + theme_owid() #> Loading required namespace: showtext #> Warning in theme_owid(): importing fonts requires an internet connection please #> use import_fonts = FALSE #> Error in FUN(X[[i]], ...): object 'internet_use' not found
There appears to be some relationship but this could easily be explained by countries with higher development also being more democratic and not actually the result of internet access. That’s why we control for GDP and the other confounders. Next, we’ll create two models, one with just internet use and democracy, and the other with the confounders added.
<- plm(electdem_vdem_owid ~ internet_use, data, fe_model effect = c("individual"), index = "entity") #> Error in 1:Ti[i]: NA/NaN argument <- plm(electdem_vdem_owid ~ internet_use + gdp + gov_exp + age_dep + unemp, data, fe_model_2 effect = c("individual"), index = "entity") #> Error in 1:Ti[i]: NA/NaN argument htmlreg(list(fe_model, fe_model_2)) #> Error in "list" %in% class(l): object 'fe_model' not found
You can see that internet use has a significant positive effect in the first model, but once the confounders are added the effect is insignificant. This means that our model provides no evidence that internet use has an effect on democracy. However, feel free to play around with this data yourself and see if you get a different result when other variables are used.