Link to the slides here

]]>We can do it in R. Code by Keith Mcnulty

```
# libraries needed
library(tidyverse)
library(ggplot2)
library(viridis)
library(gganimate)
library(wbstats)
# pull the country data down from the World Bank - three indicators
wbstats::wb(indicator = c("SP.DYN.LE00.IN", "NY.GDP.PCAP.CD", "SP.POP.TOTL"),
country = "countries_only", startdate = 1960, enddate = 2018) %>%
# pull down mapping of countries to regions and join
dplyr::left_join(wbstats::wbcountries() %>%
dplyr::select(iso3c, region)) %>%
# spread the three indicators
tidyr::pivot_wider(id_cols = c("date", "country", "region"), names_from = indicator, values_from = value) %>%
# plot the data
ggplot2::ggplot(aes(x = log(`GDP per capita (current US$)`), y = `Life expectancy at birth, total (years)`,
size = `Population, total`)) +
ggplot2::geom_point(alpha = 0.5, aes(color = region)) +
ggplot2::scale_size(range = c(.1, 16), guide = FALSE) +
ggplot2::scale_x_continuous(limits = c(2.5, 12.5)) +
ggplot2::scale_y_continuous(limits = c(30, 90)) +
viridis::scale_color_viridis(discrete = TRUE, name = "Region", option = "viridis") +
ggplot2::labs(x = "Log GDP per capita",
y = "Life expectancy at birth") +
ggplot2::theme_classic() +
ggplot2::geom_text(aes(x = 7.5, y = 60, label = date), size = 14, color = 'lightgrey', family = 'Oswald') +
# animate it over years
gganimate::transition_states(date, transition_length = 1, state_length = 1) +
gganimate::ease_aes('cubic-in-out')
```

]]>In 1986, a group of urologists in London published a research paper in **The British Medical Journal** that compared the effectiveness of two different methods to remove kidney stones. Treatment A was open surgery (invasive), and treatment B was percutaneous nephrolithotomy (less invasive). When they looked at the results from 700 patients, treatment B had a higher success rate. However, when they only looked at the subgroup of patients different kidney stone sizes, treatment A had a better success rate. What is going on here? This known statistical phenomenon is called Simpon’s paradox. Simpon’s paradox occurs when trends appear in subgroups but disappear or reverse when subgroups are combined.THis tutorial will explore Simpon’s paradox using multiple regression and other statistical tools.

```
# packages
library(data.table) #for Data Manipulation and for fast reading and writing data
library(dplyr) #for Data Manipulation
library(pander) #for nicer output
library(ggplot2) #for visualizations
library(broom) #to tidy up
# Reading datasets kidney_stone_data.csv into data
data <- fread("Data/kidney_stone_data.csv")
# Taking a look at the first few rows of the dataset
pander(head(data))
```

treatment | stone_size | success |
---|---|---|

B | large | 1 |

A | large | 1 |

A | large | 0 |

A | large | 1 |

A | large | 1 |

B | large | 1 |

The data contains three columns: `treatment`

(A or B), `stone_size`

(large or small) and `success`

(0 = Failure or 1 = Success). To start, we want to know which treatment had a higher success rate regardless of stone size. Let’s create a table with the number of successes and frequency of success by each treatment using the tidyverse syntax.

```
# Calculating the number and frequency of success and failure of each treatment
data %>%
group_by(treatment, success) %>%
summarise(N = n()) %>%
mutate(Freq = round(N/sum(N), 3))
```

```
## # A tibble: 4 x 4
## # Groups: treatment [2]
## treatment success N Freq
## <chr> <int> <int> <dbl>
## 1 A 0 77 0.22
## 2 A 1 273 0.78
## 3 B 0 61 0.174
## 4 B 1 289 0.826
```

From the treatment and success rate descriptive table, we saw that treatment B performed better on average compared to treatment A (82% vs. 78% success rate). Now, let’s consider stone size and see what happens. We are going to stratify the data into small vs. large stone subcategories and compute the same success count and rate by treatment like we did in the previous task.

The final table will be treatment X stone size X success.

```
# Calculating number and frequency of success and failure by stone size for each treatment
sum_data <-
data %>%
group_by(treatment, stone_size, success) %>%
summarise(N = n()) %>%
mutate(Freq = round(N/sum(N),3))
# Printing out the data frame we just created
pander(sum_data)
```

treatment | stone_size | success | N | Freq |
---|---|---|---|---|

A | large | 0 | 71 | 0.27 |

A | large | 1 | 192 | 0.73 |

A | small | 0 | 6 | 0.069 |

A | small | 1 | 81 | 0.931 |

B | large | 0 | 25 | 0.312 |

B | large | 1 | 55 | 0.688 |

B | small | 0 | 36 | 0.133 |

B | small | 1 | 234 | 0.867 |

What is going on here? When stratified by stone size, treatment A had better results for both large and small stones compared to treatment B (i.e., 73% and 93% v.s. 69% and 87%). Sometimes a plot is a more efficient way to communicate hidden numerical information in the data. In this task, we are going to apply a plotting technique to reveal the hidden information.

```
# Creating a bar plot to show stone size count within each treatment
sum_data %>%
ggplot(aes(x = treatment, y = N)) +
geom_bar(aes(fill = stone_size), stat="identity")
```

From the bar plot, we noticed an unbalanced distribution of kidney stone sizes in the two treatment options. Large kidney stone cases tended to be in treatment A, while small kidney stone cases tended to be in treatment B. Can we confirm this hypothesis with statistical testing?

Analizing the association between stone size (i.e., case severity) and treatment assignment using a statistical test called **Chi-squared**. The **Chi-squared** test is appropriate to test associations between two categorical variables. This test result, together with the common knowledge that a more severe case would be more likely to fail regardless of treatment, will shed light on the root cause of the paradox.

```
# Run a Chi-squared test
trt_ss <- chisq.test(data$treatment, data$stone_size)
# Print out the result in tidy format
tidy(trt_ss)
```

```
## # A tibble: 1 x 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 189. 4.40e-43 1 Pearson's Chi-squared test with Yates' cont~
```

Now,we are confident that stone size/case severity is indeed the lurking variable (aka, confounding variable) in this study of kidney stone treatment and success rate. The good news is that there are ways to get rid of the effect of the lurking variable.

Let’s practice using multiple logistic regression to remove the unwanted effect of stone size, and then tidy the output with a function from the `broom`

package

```
# Running a multiple logistic regression
m <- glm(data = data, success ~ stone_size + treatment, family = "binomial")
# Print out model coefficient table in tidy format
pander(tidy(m))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 1.033 | 0.1345 | 7.684 | 1.546e-14 |

stone_sizesmall | 1.261 | 0.239 | 5.274 | 1.333e-07 |

treatmentB | -0.3572 | 0.2291 | -1.559 | 0.1189 |

We successfully fit a multiple logistic regression and pulled out the model coefficient estimates! Typically (and arbitrarily), P-values below 0.05 indicate statistical significance. Another way to examine whether a significant relationship exists or not is to look at the 95% confidence interval (CI) of the estimate. In our example, we are testing to see:

- if the effect of a small stone is the same as a big stone
- if treatment A is as effective as treatment B.

If the 95% CI for the coefficient estimates cover zero, we cannot conclude that one is different from the other. Otherwise, there is a significant effect.

```
# Save the tidy model output into an object
tidy_m <- tidy(m)
# Plot the coefficient estimates with 95% CI for each term in the model
tidy_m %>%
ggplot(aes(x = term, y = estimate)) +
geom_pointrange(aes(ymin = estimate - 1.96 * std.error,
ymax = estimate + 1.96 * std.error)) +
geom_hline(yintercept = 0)
```

Based on the coefficient estimate plot and the model output table, there is enough information to generate insights about the study. Is treatment A superior to B after taking into account the effect of stone size/severity level?

Recall, a coefficient represents the effect size of the specific model term. A positive coefficient means that the term is positively related to the outcome. For categorical predictors, the coefficient is the effect on the outcome relative to the reference category. In the study, stone size large and treatment A are the reference categories.

Is small stone more likely to be a success after controlling for treatment option effect?

**Yes**

Is treatment A significantly better than B?

**No**

In 1973, The University of California - Berkeley was one of the top-ranked universities in the United States. As such Berkeley attracted thousands of applicants to its graduate school. But how many made the cut?

We will start off by loading the `UCBAdmissions`

dataset, which is included in base R. This shows the number of students male and female who were admitted or rejected from the six largest departments at Berkeley. Since the dataset takes the unwieldly form of a three-dimensional array, we will convert it to tidy format using the `tidy`

function from the `tidytext`

package. Then we will be ready to start doing some analysis.

```
# Loading UCBAdmissions dataset
data(UCBAdmissions)
```

Printing the dataset to the console:

Dept | A | B | C | D | E | F | ||

Admit | Gender | |||||||

Admitted | Male | 512 | 353 | 120 | 138 | 53 | 22 | |

Female | 89 | 17 | 202 | 131 | 94 | 24 | ||

Rejected | Male | 313 | 207 | 205 | 279 | 138 | 351 | |

Female | 19 | 8 | 391 | 244 | 299 | 317 |

```
# Loading broom package
library(broom)
# Converting UCBAdmissions to tidy format
ucb_tidy <- tidy(UCBAdmissions)
```

Printing the dataset to the console:

Admit | Gender | Dept | n |
---|---|---|---|

Admitted | Male | A | 512 |

Rejected | Male | A | 313 |

Admitted | Female | A | 89 |

Rejected | Female | A | 19 |

Admitted | Male | B | 353 |

Rejected | Male | B | 207 |

Admitted | Female | B | 17 |

Rejected | Female | B | 8 |

Admitted | Male | C | 120 |

Rejected | Male | C | 205 |

Admitted | Female | C | 202 |

Rejected | Female | C | 391 |

Admitted | Male | D | 138 |

Rejected | Male | D | 279 |

Admitted | Female | D | 131 |

Rejected | Female | D | 244 |

Admitted | Male | E | 53 |

Rejected | Male | E | 138 |

Admitted | Female | E | 94 |

Rejected | Female | E | 299 |

Admitted | Male | F | 22 |

Rejected | Male | F | 351 |

Admitted | Female | F | 24 |

Rejected | Female | F | 317 |

The data is more readable now that it is in tidy format, but since it is split by department and displays raw counts, it is difficult for us to infer any kind of gender bias. To do that, we need to aggregate over department and ask ourselves, overall, what proportion of men and women were accepted into Berkeley in 1973.

Here we make use of the `dplyr`

package for all of our data manipulation tasks. We aggregate over department using the `group_by`

function to get the total number of men and women who were accepted into or rejected from Berkeley that year, as well as the proportion accepted in each case. That will leave us in a better place to understand any accusations of gender bias.

```
# Loading the dplyr library
library(dplyr)
# Aggregate over department
ucb_tidy_aggregated <- ucb_tidy %>%
group_by(Admit, Gender) %>%
summarize(n = sum(n)) %>%
ungroup() %>%
group_by(Gender) %>%
mutate(prop = n/sum(n)) %>%
filter(Admit == "Admitted")
# Print aggregated dataset
ucb_tidy_aggregated
```

```
## # A tibble: 2 x 4
## # Groups: Gender [2]
## Admit Gender n prop
## <chr> <chr> <dbl> <dbl>
## 1 Admitted Female 557 0.304
## 2 Admitted Male 1198 0.445
```

From the previous task, we can see that *44.5% of male applicants* were accepted into Berkeley, as opposed to *30.4% of female applicants.* Now we can start to see the problem. Did Berkeley’s graduate admissions department really discriminate against women in 1973?

Before we consider alternative explanations, let’s visualize the discrepancy through a simple bar chart using `ggplot2`

. This won’t add much to our understanding of the problem right now, but it will act as a useful reference point later on.

For clarity, we will format acceptance rate as a percentage using the percent function from the `scales`

package.

```
# Load the ggplot2 and scales packages
library(ggplot2)
library(scales)
# Prepare the bar plot
gg_bar <- ucb_tidy_aggregated %>%
ggplot(aes(x = Gender, y = prop, fill = Gender)) +
geom_col() +
geom_text(aes(label = percent(prop)), vjust = -1) +
labs(title = "Acceptance rate of male and female applicants",
subtitle = "University of California, Berkeley (1973)",
y = "Acceptance rate") +
scale_y_continuous(labels = percent, limits = c(0,0.5)) +
guides(fill = FALSE)
# Print the bar plot
print(gg_bar)
```

The bar plot confirms what we already knew, a higher proportion of men were accepted than women. But what happens when we separate the graph out by department?

Now we can return to our `ucb_tidy`

dataset. After calculating the proportion of acceptances/rejections, we will plot a separate chart for each department using the `facet_wrap()`

function in `ggplot2`

This will give us an idea of how acceptance rates differ by department, as well as by gender.

```
# Calculate acceptance/rejection rate
ucb_by_dept <- ucb_tidy %>%
group_by(Gender, Dept) %>%
mutate(prop = n/sum(n)) %>%
filter(Admit == "Admitted")
# Print the dataset
print(ucb_by_dept)
```

```
## # A tibble: 12 x 5
## # Groups: Gender, Dept [12]
## Admit Gender Dept n prop
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Admitted Male A 512 0.621
## 2 Admitted Female A 89 0.824
## 3 Admitted Male B 353 0.630
## 4 Admitted Female B 17 0.68
## 5 Admitted Male C 120 0.369
## 6 Admitted Female C 202 0.341
## 7 Admitted Male D 138 0.331
## 8 Admitted Female D 131 0.349
## 9 Admitted Male E 53 0.277
## 10 Admitted Female E 94 0.239
## 11 Admitted Male F 22 0.0590
## 12 Admitted Female F 24 0.0704
```

```
# Prepare the bar plot for each department
gg_bar_faceted <- ucb_by_dept %>%
ggplot(aes(Gender, prop, fill = Gender)) +
geom_col() +
geom_text(aes(label = percent(prop)), vjust = -1) +
labs(title = "Acceptance rate of male and female applicants",
subtitle = "University of California, Berkeley (1973)",
y = "Acceptance rate") +
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
facet_wrap(~Dept) +
guides(fill = FALSE)
# Print the bar plot for each department
print(gg_bar_faceted)
```

## Alternative explanations Now that we have separated out our analysis by department, the interpretation has changed rather dramatically. Although men were indeed more likely to be admitted into Departments C and E, women were more likely to be admitted into all other departments. So what’s really going on here?

If you turn your attention to the first two plots, you can see that Department A and B were quite easy to get into. However, relatively few women applied to these departments – only 108 women applied to Department A, as opposed to 825 men!

At this stage, we can hypothesise that the effect of gender on acceptance is null when you control for department. We can test that hypothesis using **binary logistic regression,** but first we need to de-aggregate the dataset so that each row represents one student. That should leave us with 4,526 rows – one row for each student who applied to Berkeley that year.

```
# Define function that repeats each row in each column n times
multiply_rows <- function(column, n) {
rep(column, n)
}
# Create new de-aggregated data frame using the multiply_rows function
ucb_full <- data.frame(Admit = multiply_rows(ucb_tidy$Admit, ucb_tidy$n),
Gender = multiply_rows(ucb_tidy$Gender, ucb_tidy$n),
Dept = multiply_rows(ucb_tidy$Dept, ucb_tidy$n))
# Check the number of rows equals the number of students
nrow(ucb_full)
```

`## [1] 4526`

The data is now in the right format for us to do some hypothesis testing. Great! But first let’s try to predict admittance using gender alone. We will use the built-in `glm()`

function to fit a generalised linear model, making sure to set `family = "binomial"`

because the outcome variable is binary (`Admitted`

or `Rejected`

).

By default, Admit is coded such that Admitted is level 1 and Rejected is level 2 (because of their alphabetical order). Since glm() will assume that level 2 represents ‘success’, we will reverse the coding of Admit so we are predicting the probability of admittance rather than rejection.

To change the coding of a variable, you can use the `fct_relevel()`

function from the `forcats`

package.

```
# Load the forcats library
library(forcats)
# Reverse the coding of the Admit variable
ucb_full$Admit <- fct_relevel(ucb_full$Admit, "Rejected", "Admitted")
# Run the regression
glm_gender <- glm(Admit ~ Gender, data = ucb_full, family = "binomial")
# Summarize the results
summary(glm_gender)
```

```
##
## Call:
## glm(formula = Admit ~ Gender, family = "binomial", data = ucb_full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0855 -1.0855 -0.8506 1.2722 1.5442
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.83049 0.05077 -16.357 <2e-16 ***
## GenderMale 0.61035 0.06389 9.553 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6044.3 on 4525 degrees of freedom
## Residual deviance: 5950.9 on 4524 degrees of freedom
## AIC: 5954.9
##
## Number of Fisher Scoring iterations: 4
```

Sure enough, when you predict the probability of admission as a function of gender alone, the effect is statistically significant (p < 0.01). Specifically, you are `exp(0.61035) = 1.84`

times more likely to be admitted if you are a man. However, what happens if we control for department?

```
# Run the regression, including Dept as an explanatory variable
glm_genderdept <- glm(Admit ~ Gender + Dept, data = ucb_full, family = "binomial")
# Summarize the results
summary(glm_genderdept)
```

```
##
## Call:
## glm(formula = Admit ~ Gender + Dept, family = "binomial", data = ucb_full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4773 -0.9306 -0.3741 0.9588 2.3613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.68192 0.09911 6.880 5.97e-12 ***
## GenderMale -0.09987 0.08085 -1.235 0.217
## DeptB -0.04340 0.10984 -0.395 0.693
## DeptC -1.26260 0.10663 -11.841 < 2e-16 ***
## DeptD -1.29461 0.10582 -12.234 < 2e-16 ***
## DeptE -1.73931 0.12611 -13.792 < 2e-16 ***
## DeptF -3.30648 0.16998 -19.452 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6044.3 on 4525 degrees of freedom
## Residual deviance: 5187.5 on 4519 degrees of freedom
## AIC: 5201.5
##
## Number of Fisher Scoring iterations: 5
```

Finally, we can see Simpson’s paradox at play – when you control for the effect of department on the probability of admission, the effect of gender disappears. In fact, it even reverses, suggesting that – controlling for department – you were actually more likely to be admitted as a woman! However, this effect is not statistically significant (p > 0.05), so we conclude that there was not a campus-wide bias against applicants of either gender in 1973.

That said, individual departments often handle their own admissions processes, so it is plausible that bias exists in one department but not another. Let’s take a look at Department A, where 82.4% of women were admitted but only 62.1% of men. Is the difference statistically significant?

```
# Filter for Department A
dept_a <- ucb_full %>% filter(Dept == "A")
# Run the regression
glm_gender_depta <- glm(Admit ~ Gender, data = dept_a, family = "binomial")
# Summarize the results
summary(glm_gender_depta)
```

```
##
## Call:
## glm(formula = Admit ~ Gender, family = "binomial", data = dept_a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8642 -1.3922 0.9768 0.9768 0.9768
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5442 0.2527 6.110 9.94e-10 ***
## GenderMale -1.0521 0.2627 -4.005 6.21e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1214.7 on 932 degrees of freedom
## Residual deviance: 1195.7 on 931 degrees of freedom
## AIC: 1199.7
##
## Number of Fisher Scoring iterations: 4
```

Well then! If we take Department A in isolation, we find there is a statistically significant bias in favour of women. So does that mean that the department discriminated against men?

Not necessarily. After all, the bias might exist simply because the female applicants to Department A were better qualified that year. In their article dealing with this issue, Bickel, Hammel & O’Connell (1975) define discrimination as “the exercise of decision influenced by the sex of the applicant when that is immaterial to the qualifications for entry”. Since we do not have any data on the respective qualifications of the candidates, we cannot say whether any gender bias in their admissions process amounted to discrimination.

Although now more than 40 years old, the Berkeley problem is a useful reminder about the dangers of aggregation and omitted variable bias, especially in relation to matters of such legal and ethical importance as discrimination. Where bias does exist – as it does in the case of Department A – it is always worth considering whether there are any other factors that could explain the discrepancy.

```
# Define bias
bias <- "A pattern of association between a particular decision and a particular sex of applicant, of sufficient strength to make us confident that it is unlikely to be the result of chance alon"
# Define discrimination
discrimination <- "The exercise of decision influenced by the sex of the applicant when that is immaterial to the qualifications for entry"
# Is bias equal to discrimination?
bias == discrimination
```

`## [1] FALSE`

This project was completed as part of Datacamp projects

Life expectancy at birth is a measure of the average a living being is expected to live. It takes into account several demographic factors like gender, country, or year of birth.

Life expectancy at birth can vary along time or between countries because of many causes: the evolution of medicine, the degree of development of countries, or the effect of armed conflicts. Life expectancy varies between gender, as well. The data shows that women live longer that men. Why? Several potential factors, including biological reasons and the theory that women tend to be more health conscious.

Let’s create some plots to explore the inequalities about life expectancy at birth around the world. We will use a dataset from the United Nations Statistics Division, which is available here

```
# This sets plot images to a nice size
options(repr.plot.width = 6, repr.plot.height = 6)
# Loading packages
library(dplyr)
library(tidyr)
library(ggplot2)
# Loading data
life_expectancy <- read.csv("Data/UNdata.csv")
# Taking a look at the first few rows
head(life_expectancy)
```

```
## Country.or.Area Subgroup Year
## 1 Afghanistan Female 2000-2005
## 2 Afghanistan Female 1995-2000
## 3 Afghanistan Female 1990-1995
## 4 Afghanistan Female 1985-1990
## 5 Afghanistan Male 2000-2005
## 6 Afghanistan Male 1995-2000
## Source Unit
## 1 UNPD_World Population Prospects_2006 (International estimate) Years
## 2 UNPD_World Population Prospects_2006 (International estimate) Years
## 3 UNPD_World Population Prospects_2006 (International estimate) Years
## 4 UNPD_World Population Prospects_2006 (International estimate) Years
## 5 UNPD_World Population Prospects_2006 (International estimate) Years
## 6 UNPD_World Population Prospects_2006 (International estimate) Years
## Value Value.Footnotes
## 1 42 NA
## 2 42 NA
## 3 42 NA
## 4 41 NA
## 5 42 NA
## 6 42 NA
```

Life expectancy of men vs. women by country Let’s manipulate the data to make our exploration easier. We will build the dataset for our first plot in which we will represent the average life expectancy of men and women across countries for the last period recorded in our data (2000-2005).

```
# Subsetting and reshaping the life expectancy data
subdata <- life_expectancy %>% filter(Year == "2000-2005") %>%
select(c(Country.or.Area, Subgroup, Value)) %>% spread(Subgroup, Value)
# Taking a look at the first few rows
head(subdata)
```

```
## Country.or.Area Female Male
## 1 Afghanistan 42 42
## 2 Albania 79 73
## 3 Algeria 72 70
## 4 Angola 43 39
## 5 Argentina 78 71
## 6 Armenia 75 68
```

Visualize I A scatter plot is a useful way to visualize the relationship between two variables. It is a simple plot in which points are arranged on two axes, each of which represents one of those variables.

Let’s create a scatter plot using `ggplot2`

to represent life expectancy of males (on the x-axis) against females (on the y-axis). We will create a straightforward plot in this task, without many details. We will take care of these kinds of things shortly.

```
# Plotting male and female life expectancy
ggplot(subdata, aes(x= Male, y = Female)) + geom_point()
```

Reference lines I
A good plot must be easy to understand. There are many tools in `ggplot2`

to achieve this goal and we will explore some of them now. Starting from the previous plot, let’s set the same limits for both axes as well as place a diagonal line for reference. After doing this, the difference between men and women across countries will be easier to interpret.

After completing this task, we will see how most of the points are arranged above the diagonal and how there is a significant dispersion among them. What does this all mean?

```
# Adding an abline and changing the scale of axes of the previous plots
ggplot(subdata, aes(x= Male, y = Female)) +
geom_point() +
scale_x_continuous(limits=c(35,85)) +
scale_y_continuous(limits=c(35,85)) +
geom_abline(intercept = 0, slope = 1, linetype = 2)
```

Plot titles and axis labels A key point to make a plot understandable is placing clear labels on it. Let’s add titles, axis labels, and a caption to refer to the source of data. Let’s also change the appearance to make it clearer.

```
# Adding labels to previous plot
ggplot(subdata, aes(x=Male, y=Female))+
geom_point(colour="white", fill="chartreuse3", shape=21, alpha=.55, size=5)+
geom_abline(intercept = 0, slope = 1, linetype=2)+
scale_x_continuous(limits=c(35,85))+
scale_y_continuous(limits=c(35,85))+
labs(title="Life Expectancy at Birth by Country",
subtitle="Years. Period: 2000-2005. Average.",
caption="Source: United Nations Statistics Division",
x="Males",
y="Females")
```

Highlighting remarkable countries I Now, we will label some points of our plot with the name of its corresponding country. We want to draw attention to some special countries where the gap in life expectancy between men and women is significantly high. These will be the final touches on this first plot.

```
# Subseting data to obtain countries of interest
top_male <- subdata %>% arrange(Male-Female) %>% head(3)
top_female <- subdata %>% arrange(Female-Male) %>% head(3)
# Adding text to the previous plot to label countries of interest
ggplot(subdata, aes(x=Male, y=Female, label = Country.or.Area))+
geom_point(colour="white", fill="chartreuse3", shape=21, alpha=.55, size=5)+
geom_abline(intercept = 0, slope = 1, linetype=2)+
scale_x_continuous(limits=c(35,85))+
scale_y_continuous(limits=c(35,85))+
labs(title="Life Expectancy at Birth by Country",
subtitle="Years. Period: 2000-2005. Average.",
caption="Source: United Nations Statistics Division",
x="Males",
y="Females") +
geom_text(data=top_male, size = 3)+
geom_text(data = top_female, size=3) +
theme_bw()
```

How has life expectancy by gender evolved? Since our data contains historical information, let’s see now how life expectancy has evolved in recent years. Our second plot will represent the difference between men and women across countries between two periods: 2000-2005 and 1985-1990.

Let’s start building a dataset called `subdata2`

for our second plot.

```
# Subsetting, mutating and reshaping the life expectancy data
subdata2 <- life_expectancy %>%
filter(Year %in% c("1985-1990", "2000-2005")) %>%
mutate(Sub_Year=paste(Subgroup, Year, sep="_")) %>%
mutate(Sub_Year=gsub("-", "_", Sub_Year)) %>%
select(-Subgroup, -Year) %>%
spread(Sub_Year, Value) %>%
mutate(diff_Female = Female_2000_2005 - Female_1985_1990) %>%
mutate(diff_Male = Male_2000_2005 - Male_1985_1990)
# Taking a look at the first few rows
head(subdata)
```

```
## Country.or.Area Female Male
## 1 Afghanistan 42 42
## 2 Albania 79 73
## 3 Algeria 72 70
## 4 Angola 43 39
## 5 Argentina 78 71
## 6 Armenia 75 68
```

Visualize II Now let’s create our second plot in which we will represent average life expectancy differences between “1985-1990” and “2000-2005” for men and women.

```
# Doing a nice first version of the plot with abline, scaling axis and adding labels
ggplot(subdata2, aes(x=diff_Male, y=diff_Female, label=Country.or.Area))+
geom_point(colour="white", fill="chartreuse3", shape=21, alpha=.55, size=5)+
geom_abline(intercept = 0, slope = 1, linetype=2)+
scale_x_continuous(limits = c(-25,25)) +
scale_y_continuous(limits = c(-25,25))
```

```
labs(title="Life Expectancy at Birth by Country in Years",
subtitle="Difference between 1985-1990 and 2000-2005. Average.",
caption="Source: United Nations Statistics Division",
x="Males",
y="Females")+
theme_bw()
```

`## NULL`

Reference lines II Adding reference lines can make plots easier to understand. We already added a diagonal line to visualize differences between men and women more clearly. Now we will add two more lines to help to identify in which countries people increased or decreased their life expectancy in the period analyzed.

```
# Adding an hline and vline to previous plots
ggplot(subdata2, aes(x=diff_Male, y=diff_Female, label=Country.or.Area))+
geom_point(colour="white", fill="chartreuse3", shape=21, alpha=.55, size=5)+
geom_abline(intercept = 0, slope = 1, linetype=2)+
scale_x_continuous(limits=c(-25,25))+
scale_y_continuous(limits=c(-25,25))+
geom_hline(yintercept=0)+
geom_vline(xintercept=0)+
labs(title="Life Expectancy at Birth by Country",
subtitle="Years. Difference between 1985-1990 and 2000-2005. Average.",
caption="Source: United Nations Statistics Division",
x="Males",
y="Females")+
theme_bw()
```

Highlighting remarkable countries II As we did in the first plot, let’s label some points. Concretely, we will point those three where the aggregated average life expectancy for men and women increased most and those three where decreased most in the period.

```
# Subseting data to obtain countries of interest
top <- subdata2 %>% arrange(diff_Male+diff_Female) %>% head(3)
bottom <- subdata2 %>% arrange(-(diff_Male+diff_Female)) %>% head(3)
# Adding text to the previous plot to label countries of interest
ggplot(subdata2, aes(x=diff_Male, y=diff_Female, label=Country.or.Area), guide=FALSE)+
geom_point(colour="white", fill="chartreuse3", shape=21, alpha=.55, size=5)+
geom_abline(intercept = 0, slope = 1, linetype=2)+
scale_x_continuous(limits=c(-25,25))+
scale_y_continuous(limits=c(-25,25))+
geom_hline(yintercept=0, linetype=2)+
geom_vline(xintercept=0, linetype=2)+
labs(title="Life Expectancy at Birth by Country",
subtitle="Years. Difference between 1985-1990 and 2000-2005. Average.",
caption="Source: United Nations Statistics Division",
x="Males",
y="Females")+
geom_text(data=top, size = 3)+
geom_text(data = bottom, size=3) +
theme_bw()
```

]]>```
# Load relevant packages
library(dplyr)
library(ggplot2)
library(corrplot)
library(MatchIt)
library(psych)
library(GGally)
library(gridExtra)
```

```
#read the dataset
High_Note <- read.csv("Data/peer_influence.csv", header = TRUE)
```

`describeBy(High_Note, group = High_Note$adopter, mat = FALSE, digits=2)`

```
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed
## ï..ID 1 40300 20150.50 11633.75 20150.50 20150.50
## age 2 40300 23.95 6.37 23.00 23.09
## male 3 40300 0.62 0.48 1.00 0.65
## friend_cnt 4 40300 18.49 57.48 7.00 10.28
## avg_friend_age 5 40300 24.01 5.10 23.00 23.40
## avg_friend_male 6 40300 0.62 0.32 0.67 0.65
## friend_country_cnt 7 40300 3.96 5.76 2.00 2.66
## subscriber_friend_cnt 8 40300 0.42 2.42 0.00 0.13
## songsListened 9 40300 17589.44 28416.02 7440.00 11817.64
## lovedTracks 10 40300 86.82 263.58 14.00 36.35
## posts 11 40300 5.29 104.31 0.00 0.23
## playlists 12 40300 0.55 1.07 0.00 0.45
## shouts 13 40300 29.97 150.69 4.00 8.84
## adopter 14 40300 0.00 0.00 0.00 0.00
## tenure 15 40300 43.81 19.79 44.00 43.72
## good_country 16 40300 0.36 0.48 0.00 0.32
## mad min max range skew kurtosis se
## ï..ID 14937.19 1 40300 40299 0.00 -1.20 57.95
## age 4.45 8 79 71 1.97 6.80 0.03
## male 0.00 0 1 1 -0.50 -1.75 0.00
## friend_cnt 7.41 1 4957 4956 32.67 2087.42 0.29
## avg_friend_age 3.95 8 77 69 1.84 7.15 0.03
## avg_friend_male 0.35 0 1 1 -0.52 -0.72 0.00
## friend_country_cnt 1.48 0 129 129 4.74 38.29 0.03
## subscriber_friend_cnt 0.00 0 309 309 72.19 8024.62 0.01
## songsListened 10576.87 0 1000000 1000000 6.05 105.85 141.55
## lovedTracks 20.76 0 12522 12522 13.12 335.93 1.31
## posts 0.00 0 12309 12309 73.92 7005.34 0.52
## playlists 0.00 0 98 98 28.21 1945.28 0.01
## shouts 4.45 0 7736 7736 22.53 779.12 0.75
## adopter 0.00 0 0 0 NaN NaN 0.00
## tenure 22.24 1 111 110 0.05 -0.70 0.10
## good_country 0.00 0 1 1 0.59 -1.65 0.00
## --------------------------------------------------------
## group: 1
## vars n mean sd median trimmed
## ï..ID 1 3527 42064.00 1018.30 42064.00 42064.00
## age 2 3527 25.98 6.84 24.00 25.05
## male 3 3527 0.73 0.44 1.00 0.79
## friend_cnt 4 3527 39.73 117.27 16.00 23.69
## avg_friend_age 5 3527 25.44 5.21 24.36 24.83
## avg_friend_male 6 3527 0.64 0.25 0.67 0.65
## friend_country_cnt 7 3527 7.19 8.86 4.00 5.36
## subscriber_friend_cnt 8 3527 1.64 5.85 0.00 0.84
## songsListened 9 3527 33758.04 43592.73 20908.00 25811.69
## lovedTracks 10 3527 264.34 491.43 108.00 161.68
## posts 11 3527 21.20 221.99 0.00 1.44
## playlists 12 3527 0.90 2.56 1.00 0.59
## shouts 13 3527 99.44 1156.07 9.00 23.89
## adopter 14 3527 1.00 0.00 1.00 1.00
## tenure 15 3527 45.58 20.04 46.00 45.60
## good_country 16 3527 0.29 0.45 0.00 0.23
## mad min max range skew kurtosis se
## ï..ID 1307.65 40301 43827 3526 0.00 -1.20 17.15
## age 4.45 8 73 65 1.68 4.39 0.12
## male 0.00 0 1 1 -1.03 -0.94 0.01
## friend_cnt 17.79 1 5089 5088 26.04 1013.79 1.97
## avg_friend_age 3.91 12 62 50 1.68 5.05 0.09
## avg_friend_male 0.25 0 1 1 -0.54 -0.05 0.00
## friend_country_cnt 4.45 0 136 136 3.61 24.53 0.15
## subscriber_friend_cnt 0.00 0 287 287 34.05 1609.52 0.10
## songsListened 23276.82 0 817290 817290 4.71 46.64 734.03
## lovedTracks 140.85 0 10220 10220 6.52 80.96 8.27
## posts 0.00 0 8506 8506 26.52 852.38 3.74
## playlists 1.48 0 118 118 28.84 1244.31 0.04
## shouts 11.86 0 65872 65872 52.52 2969.09 19.47
## adopter 0.00 1 1 0 NaN NaN 0.00
## tenure 20.76 0 111 111 0.02 -0.62 0.34
## good_country 0.00 0 1 1 0.94 -1.12 0.01
```

```
#Take log of variables where values are too large compared to the others
High_Note <- High_Note %>% mutate(ln_songsListened = log(songsListened + 1))
High_Note <- High_Note %>% mutate(ln_lovedTracks = log(lovedTracks + 1))
```

```
#Start with some visualizations
ggcorr(High_Note, palette = "RdBu", label = TRUE)
```

`pairs(~age+friend_cnt+ln_songsListened+posts, col=High_Note$adopter, data=High_Note, main="Scatterplot Matrix")`

```
#Demographics
Age_plot <- ggplot(High_Note, aes(x = factor(adopter), y = age)) + geom_boxplot(aes(fill = factor(adopter))) + labs(x = "Adopter")
Age_plot
```

```
Age_plotbygender <- ggplot(High_Note, aes(x=factor(adopter), y=age, color = factor(male))) + geom_boxplot() + labs(x = "Adopter")
Age_plotbygender
```

```
#Peer influence
friend_cnt_plot <- ggplot(High_Note, aes(x = factor(adopter), y = friend_cnt)) + geom_boxplot(aes(fill = factor(adopter))) +ylim(0, 1250) +
labs(x = "Adopter")
friend_cnt_plot
```

`## Warning: Removed 13 rows containing non-finite values (stat_boxplot).`

```
subscriber_friend_cnt_plot <- ggplot(High_Note, aes(x = factor(adopter), y = subscriber_friend_cnt)) +
geom_boxplot(aes(fill = factor(adopter))) + ylim(0, 50) + labs(x = "Adopter")
subscriber_friend_cnt_plot
```

`## Warning: Removed 9 rows containing non-finite values (stat_boxplot).`

```
#user engagement
ln_songsListened_plot <- ggplot(High_Note, aes(x = factor(adopter), y = ln_songsListened)) + geom_boxplot(aes(fill = factor(adopter)))
ln_songsListened_plot
```

```
ln_lovedTracks_plot <- ggplot(High_Note, aes(x = factor(adopter), y = ln_lovedTracks)) + geom_boxplot(aes(fill = factor(adopter)))
ln_lovedTracks_plot
```

##Propensity Score Matching

```
#Transform subscriber friend count into the Treatment and control variable with Treatment being 1 and Control being 0
High_Note$Treatment <- ifelse(High_Note$subscriber_friend_cnt >= 1, 1, 0)
```

```
#For those who have 1 or more subscriber friends, on average 18% of them are premuim subsribers,
# while those who have 0 subscriber friends, on average 5% of them are premium subscribers.
High_Note %>%
group_by(Treatment) %>%
summarise(mean_adopter = mean(adopter))
```

```
## # A tibble: 2 x 2
## Treatment mean_adopter
## <dbl> <dbl>
## 1 0 0.0524
## 2 1 0.178
```

`with(High_Note, t.test(adopter ~ Treatment))`

```
##
## Welch Two Sample t-test
##
## data: adopter by Treatment
## t = -30.961, df = 11815, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1330281 -0.1171869
## sample estimates:
## mean in group 0 mean in group 1
## 0.05243501 0.17754250
```

```
High_Note_cov <- c("age","male","friend_cnt","avg_friend_age","avg_friend_male","friend_country_cnt","ln_songsListened",
"ln_lovedTracks","posts","playlists","shouts", "tenure","good_country")
High_Note %>%
group_by(Treatment) %>%
select(one_of(High_Note_cov)) %>%
summarise_all(funs(mean(., na.rm = T)))
```

`## Adding missing grouping variables: `Treatment``

```
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
```

```
## # A tibble: 2 x 14
## Treatment age male friend_cnt avg_friend_age avg_friend_male
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 23.7 0.629 10.4 23.8 0.613
## 2 1 25.4 0.636 54.0 25.4 0.636
## # ... with 8 more variables: friend_country_cnt <dbl>,
## # ln_songsListened <dbl>, ln_lovedTracks <dbl>, posts <dbl>,
## # playlists <dbl>, shouts <dbl>, tenure <dbl>, good_country <dbl>
```

```
#T-test
list_of_var <- c("age","male","friend_cnt","avg_friend_age","avg_friend_male","friend_country_cnt","ln_songsListened",
"ln_lovedTracks","posts","playlists","shouts", "tenure","good_country")
lapply(list_of_var, function(v) {
t.test(High_Note[, v] ~ High_Note$Treatment)
})
```

```
## [[1]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -20.841, df = 14645, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.778544 -1.472749
## sample estimates:
## mean in group 0 mean in group 1
## 23.74756 25.37321
##
##
## [[2]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -1.3459, df = 15986, p-value = 0.1784
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.018236129 0.003388028
## sample estimates:
## mean in group 0 mean in group 1
## 0.6288378 0.6362618
##
##
## [[3]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -33.707, df = 9903.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -46.12459 -41.05469
## sample estimates:
## mean in group 0 mean in group 1
## 10.43133 54.02097
##
##
## [[4]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -27.658, df = 15667, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.744514 -1.513611
## sample estimates:
## mean in group 0 mean in group 1
## 23.76137 25.39043
##
##
## [[5]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -7.7114, df = 23020, p-value = 1.294e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02846397 -0.01692672
## sample estimates:
## mean in group 0 mean in group 1
## 0.6131124 0.6358077
##
##
## [[6]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -65.05, df = 10372, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.861271 -6.459857
## sample estimates:
## mean in group 0 mean in group 1
## 2.725062 9.385626
##
##
## [[7]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -72.169, df = 24545, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.702958 -1.612902
## sample estimates:
## mean in group 0 mean in group 1
## 7.944104 9.602034
##
##
## [[8]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -64.938, df = 15507, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.557214 -1.465962
## sample estimates:
## mean in group 0 mean in group 1
## 2.446598 3.958186
##
##
## [[9]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -7.3649, df = 9933.6, p-value = 1.914e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -22.76492 -13.19424
## sample estimates:
## mean in group 0 mean in group 1
## 2.543377 20.522956
##
##
## [[10]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -10.492, df = 11238, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2546958 -0.1745100
## sample estimates:
## mean in group 0 mean in group 1
## 0.5294671 0.7440700
##
##
## [[11]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -11.426, df = 9888.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -100.04703 -70.74591
## sample estimates:
## mean in group 0 mean in group 1
## 16.42304 101.81951
##
##
## [[12]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = -14.696, df = 15805, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.792309 -2.899752
## sample estimates:
## mean in group 0 mean in group 1
## 43.20268 46.54871
##
##
## [[13]]
##
## Welch Two Sample t-test
##
## data: High_Note[, v] by High_Note$Treatment
## t = 2.0956, df = 16030, p-value = 0.03613
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0007383591 0.0220968020
## sample estimates:
## mean in group 0 mean in group 1
## 0.3546936 0.3432760
```

Propensity score estimation We estimate the propensity score by running a logit model (probit also works) where the outcome variable is a binary variable indicating treatment status. what covariates should we include? For the matching to give you a causal estimate in the end, you need to include any covariate that is related to both the treatment assignment and potential outcomes. Therefore at this moment I choose to include all variables

```
m_ps <- glm(Treatment ~ age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt +
ln_songsListened + ln_lovedTracks + posts + playlists + shouts + tenure + good_country,
family = binomial(), data = High_Note)
```

`## Warning: glm.fit: algorithm did not converge`

`## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred`

`summary(m_ps)`

```
##
## Call:
## glm(formula = Treatment ~ age + male + friend_cnt + avg_friend_age +
## avg_friend_male + friend_country_cnt + ln_songsListened +
## ln_lovedTracks + posts + playlists + shouts + tenure + good_country,
## family = binomial(), data = High_Note)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1675 -0.5870 -0.4007 -0.1806 6.8050
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.7989512 0.1066723 -63.737 < 2e-16 ***
## age 0.0211928 0.0028963 7.317 2.53e-13 ***
## male 0.0127828 0.0303040 0.422 0.67316
## friend_cnt 0.0281816 0.0010230 27.549 < 2e-16 ***
## avg_friend_age 0.0836847 0.0036034 23.224 < 2e-16 ***
## avg_friend_male 0.2322512 0.0524685 4.426 9.58e-06 ***
## friend_country_cnt 0.1021891 0.0046711 21.877 < 2e-16 ***
## ln_songsListened 0.1700614 0.0088009 19.323 < 2e-16 ***
## ln_lovedTracks 0.1282609 0.0076995 16.658 < 2e-16 ***
## posts 0.0006760 0.0002331 2.900 0.00373 **
## playlists -0.0108768 0.0131725 -0.826 0.40897
## shouts -0.0004121 0.0001525 -2.703 0.00688 **
## tenure -0.0031864 0.0007966 -4.000 6.33e-05 ***
## good_country 0.0032863 0.0293230 0.112 0.91077
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 46640 on 43826 degrees of freedom
## Residual deviance: 33588 on 43813 degrees of freedom
## AIC: 33616
##
## Number of Fisher Scoring iterations: 25
```

Using this model, we can now calculate the propensity score for each user It is simply the user’s predicted probability of being Treated, given the estimates from the logit model.

```
prs_df <- data.frame(pr_score = predict(m_ps, type = "response"),
Treatment = m_ps$model$Treatment)
head(prs_df)
```

```
## pr_score Treatment
## 1 0.12150210 0
## 2 0.03447076 0
## 3 0.04815195 0
## 4 0.22831497 1
## 5 0.65092056 0
## 6 0.18205340 0
```

Examining the region of common support After estimating the propensity score, it is useful to plot histograms of the estimated propensity scores by treatment status

```
labs <- paste("Treatment type:", c("1 or more friends", "0 friends"))
prs_df %>%
mutate(SubscriberType = ifelse(Treatment == 1, labs[1], labs[2])) %>%
ggplot(aes(x = pr_score)) +
geom_histogram(color = "white", binwidth = 0.025) +
facet_wrap(~SubscriberType) +
xlab("Probability of Treatment") +
theme_bw()
```

The method we use below is to find pairs of observations that have very similar propensity scores, but that differ in their treatment status. We use the package MatchIt for this. This package estimates the propensity score in the background and then matches observations based on the method of choice (“nearest” in this case)

```
High_Note_nomiss <- High_Note %>% # MatchIt does not allow missing values
select(adopter, Treatment, one_of(High_Note_cov)) %>%
na.omit()
mod_match <- matchit(Treatment ~ age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt +
ln_songsListened + ln_lovedTracks + posts + playlists + shouts + tenure + good_country,
method = "nearest", data = High_Note_nomiss)
```

`## Warning: glm.fit: algorithm did not converge`

`## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred`

We can get some information about how successful the matching was using summary(mod_match) and plot(mod_match)

`summary(mod_match)`

```
##
## Call:
## matchit(formula = Treatment ~ age + male + friend_cnt + avg_friend_age +
## avg_friend_male + friend_country_cnt + ln_songsListened +
## ln_lovedTracks + posts + playlists + shouts + tenure + good_country,
## data = High_Note_nomiss, method = "nearest")
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff
## distance 0.4689 0.1532 0.1496 0.3157
## age 25.3732 23.7476 6.2245 1.6256
## male 0.6363 0.6288 0.4831 0.0074
## friend_cnt 54.0210 10.4313 15.2769 43.5896
## avg_friend_age 25.3904 23.7614 5.0577 1.6291
## avg_friend_male 0.6358 0.6131 0.3343 0.0227
## friend_country_cnt 9.3856 2.7251 3.1024 6.6606
## ln_songsListened 9.6020 7.9441 2.7002 1.6579
## ln_lovedTracks 3.9582 2.4466 1.9781 1.5116
## posts 20.5230 2.5434 33.7947 17.9796
## playlists 0.7441 0.5295 0.9673 0.2146
## shouts 101.8195 16.4230 79.7381 85.3965
## tenure 46.5487 43.2027 19.7212 3.3460
## good_country 0.3433 0.3547 0.4784 -0.0114
## eQQ Med eQQ Mean eQQ Max
## distance 0.2772 0.3157 0.6429
## age 1.0000 1.6296 5.0000
## male 0.0000 0.0074 1.0000
## friend_cnt 22.0000 43.5838 4794.0000
## avg_friend_age 1.5909 1.6369 11.5000
## avg_friend_male 0.0738 0.0958 0.3636
## friend_country_cnt 5.0000 6.6598 95.0000
## ln_songsListened 1.2880 1.6583 6.0283
## ln_lovedTracks 1.5315 1.5115 2.8332
## posts 0.0000 17.8829 9535.0000
## playlists 0.0000 0.2092 26.0000
## shouts 15.0000 85.1764 59168.0000
## tenure 3.0000 3.3473 10.0000
## good_country 0.0000 0.0114 1.0000
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff
## distance 0.4689 0.3151 0.1854 0.1538
## age 25.3732 25.7794 7.5478 -0.4062
## male 0.6363 0.6585 0.4743 -0.0222
## friend_cnt 54.0210 21.5264 23.3770 32.4946
## avg_friend_age 25.3904 26.0051 6.4510 -0.6146
## avg_friend_male 0.6358 0.6478 0.2587 -0.0120
## friend_country_cnt 9.3856 5.0678 4.6172 4.3178
## ln_songsListened 9.6020 9.4839 1.6281 0.1181
## ln_lovedTracks 3.9582 3.7265 1.8645 0.2317
## posts 20.5230 6.2683 60.7389 14.2546
## playlists 0.7441 0.6678 0.9905 0.0762
## shouts 101.8195 36.7657 134.5236 65.0539
## tenure 46.5487 47.1905 19.1378 -0.6418
## good_country 0.3433 0.3602 0.4801 -0.0169
## eQQ Med eQQ Mean eQQ Max
## distance 0.1147 0.1538 0.4184
## age 0.0000 0.4599 6.0000
## male 0.0000 0.0222 1.0000
## friend_cnt 12.0000 32.4946 4794.0000
## avg_friend_age 0.3333 0.8619 14.0000
## avg_friend_male 0.0119 0.0258 0.1463
## friend_country_cnt 2.0000 4.3178 95.0000
## ln_songsListened 0.1447 0.1653 2.1972
## ln_lovedTracks 0.3081 0.2664 0.8255
## posts 0.0000 14.2546 9535.0000
## playlists 0.0000 0.1194 95.0000
## shouts 9.0000 65.0539 59168.0000
## tenure 1.0000 0.9592 3.0000
## good_country 0.0000 0.0169 1.0000
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 51.2952 58.6333 51.2929 34.9201
## age 75.0137 100.0000 71.7766 -20.0000
## male -198.9313 0.0000 -198.6301 0.0000
## friend_cnt 25.4535 45.4545 25.4436 0.0000
## avg_friend_age 62.2701 79.0476 47.3451 -21.7391
## avg_friend_male 47.2679 83.8776 73.0291 59.7722
## friend_country_cnt 35.1733 60.0000 35.1656 0.0000
## ln_songsListened 92.8737 88.7676 90.0310 63.5514
## ln_lovedTracks 84.6695 79.8827 82.3770 70.8626
## posts 20.7178 0.0000 20.2893 0.0000
## playlists 64.4694 0.0000 42.9197 -265.3846
## shouts 23.8214 40.0000 23.6246 0.0000
## tenure 80.8203 66.6667 71.3452 70.0000
## good_country -48.0096 0.0000 -48.2143 0.0000
##
## Sample sizes:
## Control Treated
## All 34004 9823
## Matched 9823 9823
## Unmatched 24181 0
## Discarded 0 0
```

`plot(mod_match)`

To create a dataframe containing only the matched observations, use the match.data() function

```
dta_m <- match.data(mod_match)
dim(dta_m)
```

`## [1] 19646 17`

Examining covariate balance in the matched sample Visual Inspection

```
fn_bal <- function(dta, variable) {
dta$variable <- dta[, variable]
dta$Treatment <- as.factor(dta$Treatment)
support <- c(min(dta$variable), max(dta$variable))
ggplot(dta, aes(x = distance, y = variable, color = Treatment)) +
geom_point(alpha = 0.2, size = 1.3) +
geom_smooth(method = "loess", se = F) +
xlab("Propensity score") +
ylab(variable) +
theme_bw() +
ylim(support)
}
grid.arrange(
fn_bal(dta_m, "age"),
fn_bal(dta_m, "male") + theme(legend.position = "none"),
fn_bal(dta_m, "friend_cnt"),
fn_bal(dta_m, "avg_friend_age") + theme(legend.position = "none"),
fn_bal(dta_m, "friend_country_cnt"),
fn_bal(dta_m, "ln_songsListened") + theme(legend.position = "none"),
fn_bal(dta_m, "ln_lovedTracks"),
fn_bal(dta_m, "playlists") + theme(legend.position = "none"),
fn_bal(dta_m, "tenure"),
fn_bal(dta_m, "good_country") + theme(legend.position = "none"),
nrow = 6, widths = c(1, 0.8)
)
```

Difference of Means

```
dta_m %>%
group_by(Treatment) %>%
select(one_of(High_Note_cov)) %>%
summarise_all(funs(mean))
```

`## Adding missing grouping variables: `Treatment``

```
## # A tibble: 2 x 14
## Treatment age male friend_cnt avg_friend_age avg_friend_male
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 25.8 0.658 21.5 26.0 0.648
## 2 1 25.4 0.636 54.0 25.4 0.636
## # ... with 8 more variables: friend_country_cnt <dbl>,
## # ln_songsListened <dbl>, ln_lovedTracks <dbl>, posts <dbl>,
## # playlists <dbl>, shouts <dbl>, tenure <dbl>, good_country <dbl>
```

```
lapply(High_Note_cov, function(v) {
t.test(dta_m[, v] ~ dta_m$Treatment)
})
```

```
## [[1]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = 3.9186, df = 19521, p-value = 8.937e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2030131 0.6093660
## sample estimates:
## mean in group 0 mean in group 1
## 25.77940 25.37321
##
##
## [[2]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = 3.2559, df = 19640, p-value = 0.001132
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.008832644 0.035552982
## sample estimates:
## mean in group 0 mean in group 1
## 0.6584546 0.6362618
##
##
## [[3]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = -24.769, df = 10477, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -35.06619 -29.92292
## sample estimates:
## mean in group 0 mean in group 1
## 21.52642 54.02097
##
##
## [[4]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = 7.3709, df = 18749, p-value = 1.765e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4511959 0.7780907
## sample estimates:
## mean in group 0 mean in group 1
## 26.00507 25.39043
##
##
## [[5]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = 3.428, df = 19374, p-value = 0.0006094
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.005124659 0.018810817
## sample estimates:
## mean in group 0 mean in group 1
## 0.6477754 0.6358077
##
##
## [[6]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = -38.82, df = 13820, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.535843 -4.099808
## sample estimates:
## mean in group 0 mean in group 1
## 5.067800 9.385626
##
##
## [[7]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = -4.8926, df = 19535, p-value = 1.003e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1654825 -0.0708156
## sample estimates:
## mean in group 0 mean in group 1
## 9.483885 9.602034
##
##
## [[8]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = -8.294, df = 19474, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2864991 -0.1769691
## sample estimates:
## mean in group 0 mean in group 1
## 3.726452 3.958186
##
##
## [[9]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = -5.6784, df = 11062, p-value = 1.394e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.175269 -9.333944
## sample estimates:
## mean in group 0 mean in group 1
## 6.26835 20.52296
##
##
## [[10]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = -3.4421, df = 14535, p-value = 0.0005789
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.11967088 -0.03282836
## sample estimates:
## mean in group 0 mean in group 1
## 0.6678204 0.7440700
##
##
## [[11]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = -8.5779, df = 10471, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -79.91979 -50.18792
## sample estimates:
## mean in group 0 mean in group 1
## 36.76565 101.81951
##
##
## [[12]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = 2.3025, df = 19612, p-value = 0.02132
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.09544044 1.18807784
## sample estimates:
## mean in group 0 mean in group 1
## 47.19047 46.54871
##
##
## [[13]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$Treatment
## t = 2.4805, df = 19642, p-value = 0.01313
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.003545363 0.030252866
## sample estimates:
## mean in group 0 mean in group 1
## 0.3601751 0.3432760
```

Estimating treatment effects

Estimating the treatment effect is simple once we have a matched sample that we are happy with. We can use a t-test:

`with(dta_m, t.test(adopter ~ Treatment))`

```
##
## Welch Two Sample t-test
##
## data: adopter by Treatment
## t = -16.725, df = 18453, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09098592 -0.07189711
## sample estimates:
## mean in group 0 mean in group 1
## 0.09610099 0.17754250
```

Or we can use OLS with or without covariates:

```
glm_treat1 <- glm(adopter ~ Treatment, data = dta_m, family = "binomial")
summary(glm_treat1)
```

```
##
## Call:
## glm(formula = adopter ~ Treatment, family = "binomial", data = dta_m)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6252 -0.6252 -0.4495 -0.4495 2.1644
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.24132 0.03423 -65.48 <2e-16 ***
## Treatment 0.70823 0.04323 16.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15683 on 19645 degrees of freedom
## Residual deviance: 15404 on 19644 degrees of freedom
## AIC: 15408
##
## Number of Fisher Scoring iterations: 4
```

```
glm_treat2 <- glm(adopter ~ age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt +
ln_songsListened + ln_lovedTracks + posts + playlists + shouts + tenure + good_country + Treatment,
data = dta_m, family = "binomial")
summary(glm_treat2)
```

```
##
## Call:
## glm(formula = adopter ~ age + male + friend_cnt + avg_friend_age +
## avg_friend_male + friend_country_cnt + ln_songsListened +
## ln_lovedTracks + posts + playlists + shouts + tenure + good_country +
## Treatment, family = "binomial", data = dta_m)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8606 -0.5791 -0.4414 -0.3078 3.0933
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.432e+00 2.287e-01 -28.118 < 2e-16 ***
## age 1.562e-02 4.257e-03 3.670 0.000243 ***
## male 2.999e-01 4.886e-02 6.138 8.38e-10 ***
## friend_cnt 2.092e-04 2.780e-04 0.752 0.451845
## avg_friend_age 2.881e-02 5.681e-03 5.071 3.96e-07 ***
## avg_friend_male 9.103e-02 9.660e-02 0.942 0.346005
## friend_country_cnt -4.700e-03 3.701e-03 -1.270 0.204087
## ln_songsListened 2.035e-01 1.928e-02 10.554 < 2e-16 ***
## ln_lovedTracks 2.672e-01 1.342e-02 19.916 < 2e-16 ***
## posts 1.653e-04 9.384e-05 1.762 0.078131 .
## playlists 3.776e-02 1.315e-02 2.872 0.004084 **
## shouts 9.645e-05 7.059e-05 1.366 0.171798
## tenure -3.914e-03 1.262e-03 -3.101 0.001928 **
## good_country -3.986e-01 4.780e-02 -8.339 < 2e-16 ***
## Treatment 6.379e-01 4.628e-02 13.783 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15683 on 19645 degrees of freedom
## Residual deviance: 14366 on 19631 degrees of freedom
## AIC: 14396
##
## Number of Fisher Scoring iterations: 5
```

In a logistic regression the response being modeled is the log(odds) that Y = 1 The regression coefficient gives the change in log(odds) in the response for a unit change in the predictor variable because log(odds) are difficult to interpret, we can exponentiate them

We can see that the odds of being a premium subscriber are increased by a factor of 1.499 when and individual is male

`exp(coef(glm_treat2))`

```
## (Intercept) age male
## 0.001609923 1.015744376 1.349707685
## friend_cnt avg_friend_age avg_friend_male
## 1.000209180 1.029229423 1.095303373
## friend_country_cnt ln_songsListened ln_lovedTracks
## 0.995310871 1.225697366 1.306303144
## posts playlists shouts
## 1.000165322 1.038485243 1.000096459
## tenure good_country Treatment
## 0.996093907 0.671239518 1.892417465
```

```
# Load relevant packages
library(car)
library(dplyr)
library(tidyr)
library(igraph)
library(ggplot2)
library(corrplot)
```

`head(products)`

```
## id title
## 1 1 Patterns of Preaching: A Sermon Sampler
## 2 2 Candlemas: Feast of Flames
## 3 3 World War II Allied Fighter Planes Trading Cards
## 4 4 Life Application Bible Commentary: 1 and 2 Timothy and Titus
## 5 5 Prayers That Avail Much for Business: Executive
## 6 6 How the Other Half Lives: Studies Among the Tenements of New York
## group salesrank review_cnt downloads rating
## 1 Book 396585 2 2 5.0
## 2 Book 168596 12 12 4.5
## 3 Book 1270652 1 1 5.0
## 4 Book 631289 1 1 4.0
## 5 Book 455160 0 0 0.0
## 6 Book 188784 17 17 4.0
```

`head(copurchase)`

```
## Source Target
## 1 1 2
## 2 1 4
## 3 1 5
## 4 1 15
## 5 2 11
## 6 2 12
```

```
#We are onluy interested in a subset of this dataset, we want to look for Books
books.products <- filter(products, group == "Book"
& salesrank <= 150000 & salesrank >= 0)
books.copurchase <- filter(copurchase, Source %in% books.products$id
& Target %in% books.products$id)
```

```
#We want to find the ID with the highest indegree (For a vertex, the number of head ends adjacent to a vertex is called the indegree of the vertex)
indegree.df <- summarize(group_by(books.copurchase, Target), indegree = n()) %>% arrange(desc(indegree))
names(indegree.df)[1]<-"id"
head(indegree.df)
```

```
## # A tibble: 6 x 2
## id indegree
## <int> <int>
## 1 4429 46
## 2 33 44
## 3 244 35
## 4 302 22
## 5 5913 20
## 6 626 16
```

```
##We want to find the ID with the highest outdegree (the number of tail ends adjacent to a vertex is its outdegree)
outdegree.df <- summarize(group_by(books.copurchase, Source), outdegree = n()) %>% arrange(desc(outdegree))
names(outdegree.df)[1]<-"id"
head(outdegree.df)
```

```
## # A tibble: 6 x 2
## id outdegree
## <int> <int>
## 1 126396 5
## 2 151687 5
## 3 4993 4
## 4 26268 4
## 5 28040 4
## 6 29680 4
```

```
books.graph <- merge(books.products, indegree.df, by="id", all.x = TRUE) #Merge the Dataframes
books.graph <- merge(books.graph, outdegree.df, by="id", all.x = TRUE) #Merge the Dataframes
books.graph$indegree[is.na(books.graph$indegree)] <- 0 #ssign 0 to those who are na
books.graph$outdegree[is.na(books.graph$outdegree)] <- 0 #assign 0 to those who are na
books.graph <- mutate(books.graph, degree = indegree + outdegree) # sum indegree and outdegree to create degree
head(books.graph)
```

```
## id
## 1 12
## 2 33
## 3 39
## 4 45
## 5 74
## 6 77
## title
## 1 Fantastic Food with Splenda : 160 Great Recipes for Meals Low in Sugar, Carbohydrates, Fat, and Calories
## 2 Double Jeopardy (T*Witches, 6)
## 3 Night of Many Dreams : A Novel
## 4 Beginning ASP.NET Databases using C#
## 5 Service Delivery (It Infrastructure Library Series)
## 6 Water Touching Stone
## group salesrank review_cnt downloads rating indegree outdegree degree
## 1 Book 24741 12 12 4.5 5 1 6
## 2 Book 97166 4 4 5.0 44 0 44
## 3 Book 57186 22 22 3.5 4 0 4
## 4 Book 48408 4 4 4.0 0 0 0
## 5 Book 27507 2 2 4.0 1 1 2
## 6 Book 27012 11 11 4.5 3 1 4
```

```
#We are looking for the book with highest degree
filter(books.graph, degree == max(books.graph$degree))
```

```
## id title group salesrank review_cnt
## 1 4429 Harley-Davidson Panheads, 1948-1965/M418 Book 147799 3
## downloads rating indegree outdegree degree
## 1 3 4.5 46 1 47
```

```
g <- graph_from_data_frame(books.copurchase, directed = TRUE) #we creathe a directed graph
sg <- induced_subgraph(g, subcomponent(g, "4429", "all"), impl = "auto") #we are only interested in id "4429"
sg <- simplify(sg, remove.multiple = F, remove.loops = T)
V(sg)
```

```
## + 756/756 vertices, named, from 2bed862:
## [1] 77 130 148 187 193 224 321 322 422 556
## [11] 577 626 724 1051 1644 1817 1822 1851 1971 2071
## [21] 2210 2279 2285 2326 2330 2332 2343 2423 2470 2501
## [31] 2505 2558 2572 2657 2658 2806 2807 2959 3032 3119
## [41] 3191 3217 3306 3588 3670 3737 3861 3909 4002 4014
## [51] 4068 4099 4140 4174 4184 4185 4222 4223 4345 4429
## [61] 4977 4993 4994 5018 5163 5164 5293 5355 5388 5623
## [71] 5638 5639 5655 5670 5821 5851 5875 6012 6014 6392
## [81] 6411 6445 6546 6711 6713 6817 6942 7196 7198 7222
## [91] 7233 7325 7376 7406 7544 7743 7754 7775 7839 7841
## + ... omitted several vertices
```

`E(sg)`

```
## + 986/986 edges from 2bed862 (vertex names):
## [1] 77 ->422 130 ->78 148 ->302 187 ->321 187 ->322 187 ->78
## [7] 193 ->224 224 ->193 224 ->33 321 ->187 321 ->322 321 ->78
## [13] 322 ->187 322 ->321 322 ->78 422 ->77 422 ->1644 556 ->78
## [19] 577 ->33 626 ->33 724 ->302 1051->302 1644->422 1644->5293
## [25] 1817->976 1822->193 1822->724 1851->78 1971->193 2071->3155
## [31] 2210->2279 2210->2285 2279->2210 2279->2326 2285->2330 2326->193
## [37] 2326->2210 2330->2343 2330->2345 2332->4140 2343->2285 2343->2330
## [43] 2423->5410 2470->556 2501->3588 2505->2501 2558->33 2572->4184
## [49] 2572->4185 2657->2658 2658->77 2806->2807 2807->302 2959->1673
## [55] 3032->2558 3119->976 3191->2279 3217->4319 3306->2071 3306->4345
## + ... omitted several edges
```

```
diameter <- get_diameter(sg)
diameter
```

```
## + 10/756 vertices, named, from 2bed862:
## [1] 37895 27936 21584 10889 11080 14111 4429 2501 3588 6676
```

```
#Plot the graph for id "4429"
V(sg)$color <- ifelse(V(sg)$name %in% diameter$name, "red", "lightblue")
V(sg)["4429"]$color <- "green"
V(sg)["33"]$color <- "gold"
E(sg)$color <- "darkgray"
E(sg,path=diameter)$color <- "red"
E(sg)$width <- 1
E(sg,path=diameter)$width <- 3
options(repr.plot.width = 100, repr.plot.height = 100)
plot(sg, layout=layout_with_fr, vertex.size=1, vertex.label=NA, edge.arrow.size=0.05)
```

]]>