B Interdependencies
B.1 Variance and covariance
Social scientists try to explain variance and covariance. It is therefore a good idea to learn by heart the formula for variance and covariance. The sample variance of a random continuous variable X, VAR(X), is as follows:
\[VAR(X)=s_{xx}^2 = s_x^2= \frac{\Sigma^n_{i=1}(X_i-\overline{X})(X_i-\overline{X})}{n-1}= \frac{\Sigma^n_{i=1}(X_i-\overline{X})^2}{n-1}\]
The sample standard deviation is given by:
\[STD(X)=\sqrt{s_x^2}=s_x\]
The sample covariance of two random continuous variables X and Y, COV(X,Y) is as follows:
\[COV(X,Y)=s_{xy}^2 = \frac{\Sigma^n_{i=1}(X_i-\overline{X})(Y_i-\overline{Y})}{n-1}\]
The sample correlation coefficient between two random continuous variables X and Y, COR(X,Y), is a covariance on standardized variables (\(z_x=X_{sd}=(X-\overline{X})/s_x\)) and hence:
\[COR(X,Y)=r_{xy} = \frac{s_{xy}^2}{s_x s_y}= \frac{\Sigma^n_i(X_i-\overline{X})(Y_i-\overline{Y})}{\sqrt{\Sigma^n_i(X_i-\overline{X})^2}\sqrt{\Sigma^n_i(Y_i-\overline{Y})^2}}\] Just to be complete. The population equivalent of the covariance:
\[\sigma _{xy}^2 = \frac{\Sigma^n_i(X_i - \mu_x)(Y_i-\mu_y)}{N},\] with \(\mu\) the population mean. And the correlation within the population is:
\[\rho_{xy} = \frac{\sigma_{xy}^2}{\sigma_x \sigma_y}\]
B.1.1 Want to learn more?!
I strongly recommend you to read the online book on probability by Pishro-Nik (2016)!
B.2 Intraclass correlation
B.2.1 Dyadic data
Let us suppose we have dyadic data. For example on the political opinion of two marriage partners. We want to know if these data are interdependent.
Run the code chunck below to simulate some data.
require(MASS)
set.seed(9864) # We set a seed. In this we the random numbers we will generate be the same and we thus end up with the same dataset. Please not that to be absolutely sure to get the same dataset, we need to run the same R version (and packages).
# let us start with simulating the opinion of both partners.
Sigma <- matrix(c(10, 4, 4, 5), 2, 2)
opinions <- mvrnorm(n = 1000, mu = c(4, 5), Sigma)
opinion_W <- opinions[, 1]
opinion_M <- opinions[, 2]
dyad_id <- 1:1000
# and let's put everything together
data <- data.frame(dyad_id, opinion_W, opinion_M)
# add some description to the data
attr(data, "description") <- "This is a simulated dataset to illustrate interdependencies of observations within dyads (i.e. heterosexual couples). The dataset is in wide-format: one row refers to one couple. Variables with \"_W\" refer to women,\"_M\" refer to men."
# I don't think the variables need any further description.
B.2.1.1 Describe data
Lets have a look at our data.
#> dyad_id opinion_W opinion_M
#> 1 1 1.180285 3.651525
#> 2 2 9.930618 7.117465
#> 3 3 4.022491 2.205877
#> 4 4 2.990720 7.485650
#> 5 5 3.024059 8.292194
#> 6 6 8.408048 4.720610
#> 'data.frame': 1000 obs. of 3 variables:
#> $ dyad_id : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ opinion_W: num 1.18 9.93 4.02 2.99 3.02 ...
#> $ opinion_M: num 3.65 7.12 2.21 7.49 8.29 ...
#> - attr(*, "description")= chr "This is a simulated dataset to illustrate interdependencies of observations within dyads (i.e. heterosexual cou"| __truncated__
#> dyad_id opinion_W opinion_M
#> Min. : 1.0 Min. :-5.337 Min. :-2.992
#> 1st Qu.: 250.8 1st Qu.: 2.141 1st Qu.: 3.517
#> Median : 500.5 Median : 4.222 Median : 5.013
#> Mean : 500.5 Mean : 4.201 Mean : 5.006
#> 3rd Qu.: 750.2 3rd Qu.: 6.170 3rd Qu.: 6.545
#> Max. :1000.0 Max. :14.476 Max. :11.670
#> [1] "This is a simulated dataset to illustrate interdependencies of observations within dyads (i.e. heterosexual couples). The dataset is in wide-format: one row refers to one couple. Variables with \"_W\" refer to women,\"_M\" refer to men."
#> vars n mean sd median trimmed mad min max range skew kurtosis se
#> dyad_id 1 1000 500.50 288.82 500.50 500.50 370.65 1.00 1000.00 999.00 0.00 -1.20 9.13
#> opinion_W 2 1000 4.20 3.18 4.22 4.19 3.01 -5.34 14.48 19.81 0.05 0.01 0.10
#> opinion_M 3 1000 5.01 2.27 5.01 5.02 2.25 -2.99 11.67 14.66 -0.06 -0.03 0.07
B.2.1.2 Interdependencies: correlation
There are different (naive and less naive) ways to check for interdependence.
For more background information see this page by David A. Kenny.
Also check out paragraph 3.3 of the book by (Snijders and Bosker 1999).
Lets us start with something that pops up in your mind immediately…a correlation.
cov(data$opinion_M, data$opinion_W) # the covariance between the two variables. Have a look at the simulation. This is indeed what we have put into the data.
#> [1] 4.154203
cov(scale(data$opinion_M), scale(data$opinion_W)) # the covariance between the two standardized variables. That is the correlation.
#> [,1]
#> [1,] 0.5741921
#>
#> Pearson's product-moment correlation
#>
#> data: data$opinion_M and data$opinion_W
#> t = 22.156, df = 998, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.5311041 0.6143179
#> sample estimates:
#> cor
#> 0.5741921
This would indicate a strong and significant correlation. Remember that our data is in wide format. A better way is to calculate the correlation on a long dataset. This method is called the double entry method. Why is this a better way? It takes into account that the variance and mean of the opinions of men and women may be different. The endresult will more closely resemble the ICC we will encounter later.
var1 <- c(data$opinion_M, data$opinion_W)
var2 <- c(data$opinion_W, data$opinion_M)
cor.test(var1, var2)
#>
#> Pearson's product-moment correlation
#>
#> data: var1 and var2
#> t = 26.576, df = 1998, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.4779248 0.5427243
#> sample estimates:
#> cor
#> 0.5110503
Lower, but still significant.
It is even possible that to now see a negative (and significant) correlation. For example try to repeat the above with a dataset we would get after running the following simulation.
require(MASS)
set.seed(9864) # We set a seed. In this we the random numbers we will generate be the same and we thus end up with the same dataset. Please not that to be absolutely sure to get the same dataset, we need to run the same R version (and packages).
# let us start with simulating the opinion of both partners.
Sigma <- matrix(c(10, 4, 4, 5), 2, 2)
opinions <- mvrnorm(n = 1000, mu = c(20, 25), Sigma)
opinion_W <- opinions[, 1]
opinion_M <- opinions[, 2]
dyad_id <- 1:1000
# and let's put everything together
data <- data.frame(dyad_id, opinion_W, opinion_M)
# add some description to the data
attr(data, "description") <- "This is a simulated dataset to illustrate interdependencies of observations within dyads (i.e. heterosexual couples). The dataset is in wide-format: one row refers to one couple. Variables with \"_W\" refer to women,\"_M\" refer to men."
# I don't think the variables need any further description.
B.2.1.3 ICC
The intraclass correlation is the correlation between two random subjects of the same cluster. There are many mathematical definitions of the ICC. Let us start with a definition from the ANOVA tradition:
\[ ICC = \frac{(MS_B - MS_W)}{(MS_B + MS_W)} \] where,
\[ MS_B = VAR(\bar{X}_{dyad}) * 2 \]
and
\[ MS_W = \sum(X_{ego} - X_{alter})^2 / (2* N_{dyads}) \]
Let’s have a go!
MSB <- var((data$opinion_M + data$opinion_W)/2) * 2
MSW <- (sum((data$opinion_M - data$opinion_W)^2))/(2 * length(data$opinion_W))
ICC_anova <- (MSB - MSW)/(MSB + MSW)
ICC_anova
#> [1] 0.5114198
Do you see that the ICC is very close to the correlation based on a dataset in long format (double entry method)? Thus in practice, the double entry method is very convenient to check for interdependencies if you are working on dyadic data.
Most of you are probably more familiar with definitions of the ICC as provided within textbooks on multi-level analysis. Where the intraclass correlation - at least for continuous dependent variables - is defined as the between variance (i.e. the variance in dyad means) divided by the total variance (i.e. the sum of the between and within variance). There is only one problem, we need these variances present in the ‘real population’. In our data we only observe the variances present in our sample. The observed between variance needs to be corrected. Below I will show you how to do that.
First make a dataset in longformat.
# first make a dataset in longformat.
dyadmean <- (data$opinion_M + data$opinion_W)/2
data_long <- rbind(data, data)
data_long$partner_id <- rep(1:2, each = 1000)
data_long$dyad_id <- rep(1:1000, times = 2)
data_long$dyadmean <- c(dyadmean, dyadmean)
# lets the first dyad entry refer to the women and the second to the men
data_long$opinion <- ifelse(data_long$partner_id == 1, data_long$opinion_W, data_long$opinion_M)
# also define the opinion of the partner
data_long$opinion_P <- ifelse(data_long$partner_id == 2, data_long$opinion_W, data_long$opinion_M)
head(data_long)
#> dyad_id opinion_W opinion_M partner_id dyadmean opinion opinion_P
#> 1 1 1.180285 3.651525 1 2.415905 1.180285 3.651525
#> 2 2 9.930618 7.117465 1 8.524041 9.930618 7.117465
#> 3 3 4.022491 2.205877 1 3.114184 4.022491 2.205877
#> 4 4 2.990720 7.485650 1 5.238185 2.990720 7.485650
#> 5 5 3.024059 8.292194 1 5.658127 3.024059 8.292194
#> 6 6 8.408048 4.720610 1 6.564329 8.408048 4.720610
With this dataset in longformat we can calculate the ICC.
# first calculate the between variance of our sample. Note that this we only need observations of
# the dyads once (thus N_dyads=1000)
S_B <- var(data_long$dyadmean[1:1000])
# within variance
SW <- sum((data_long$opinion - data_long$dyadmean)^2)/1000 # we divide by the number of dyads
# We now need to correct the observed between variance to reflect the population between variance.
S_B_E <- S_B - SW/2
ICC_ML <- S_B_E/(S_B_E + SW)
ICC_ML
#> [1] 0.5114198
Of course exactly similar to the ICC_anova. But this procedure is of course quite cumbersome. It may be a lot easier to estimate an empty multi-level model which also spits out the ICC (after some tweaking). See below.
require(nlme)
# estimate empty model with ML
mlme <- lme(opinion ~ 1, data = data_long, random = list(~1 | dyad_id), )
summary(mlme)
# Standard deviations are reported instead of variances. extract the variances.
VarCorr(mlme)
# the intercept variance is at the between-level. the residual variances are at the observation /
# within-level. thus based on these numbers we may calculate the ICC ourselves.
varests <- as.numeric(VarCorr(mlme)[1:2])
varests
ICC_MLb <- varests[1]/sum(varests)
ICC_MLb
#> Linear mixed-effects model fit by REML
#> Data: data_long
#> AIC BIC logLik
#> 9491.554 9508.355 -4742.777
#>
#> Random effects:
#> Formula: ~1 | dyad_id
#> (Intercept) Residual
#> StdDev: 1.998493 1.953361
#>
#> Fixed effects: opinion ~ 1
#> Value Std.Error DF t-value p-value
#> (Intercept) 4.603242 0.07682306 1000 59.92005 0
#>
#> Standardized Within-Group Residuals:
#> Min Q1 Med Q3 Max
#> -3.28830880 -0.51102270 0.01645208 0.57161606 2.75065678
#>
#> Number of Observations: 2000
#> Number of Groups: 1000
#> dyad_id = pdLogChol(1)
#> Variance StdDev
#> (Intercept) 3.993973 1.998493
#> Residual 3.815621 1.953361
#> [1] 3.993973 3.815621
#> [1] 0.5114188
In this course we will rely heavily on the Lavaan package. We can also calculate the ICC with Lavaan.
require("lavaan")
model <- "
level: 1
opinion ~ 1 #regression model
opinion ~~ opinion #variance
level: 2
opinion ~ 1
opinion ~~ opinion
"
fit <- lavaan(model = model, data = data_long, cluster = "dyad_id")
summary(fit)
#> lavaan 0.6.17 ended normally after 7 iterations
#>
#> Estimator ML
#> Optimization method NLMINB
#> Number of model parameters 3
#>
#> Number of observations 2000
#> Number of clusters [dyad_id] 1000
#>
#> Model Test User Model:
#>
#> Test statistic 0.000
#> Degrees of freedom 0
#>
#> Parameter Estimates:
#>
#> Standard errors Standard
#> Information Observed
#> Observed information based on Hessian
#>
#>
#> Level 1 [within]:
#>
#> Intercepts:
#> Estimate Std.Err z-value P(>|z|)
#> opinion 0.000
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|)
#> opinion 3.816 0.171 22.361 0.000
#>
#>
#> Level 2 [dyad_id]:
#>
#> Intercepts:
#> Estimate Std.Err z-value P(>|z|)
#> opinion 4.603 0.077 59.950 0.000
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|)
#> opinion 3.988 0.277 14.391 0.000
#> opinion
#> 0.511
The take home message is that the two observations for each dyad are indeed interrelated. Is this a lot? Is this significant?
B.2.2 Egonet / Socionet data
The above procedure to calculate the ICC correlation can also be used for egonet data.
Let us suppose we have egonet data. For example on the political opinions of you and your friends. We want to know if these data are interdependent.
Run the code chunck below to simulate some data.
require(MASS)
set.seed(9864) # We set a seed. In this we the random numbers we will generate be the same and we thus end up with the same dataset. Please not that to be absolutely sure to get the same dataset, we need to run the same R version (and packages).
# let us start with simulating the opinion of ego and its alters.
Sigma <- matrix(sample(c(1, 2, 3), 36, replace = T), 6, 6)
Sigma[lower.tri(Sigma)] <- t(Sigma)[lower.tri(Sigma)]
diag(Sigma) <- c(5, 4, 6, 3, 7, 6)
# Sigma
opinions <- mvrnorm(n = 1000, mu = c(4, 4, 4, 4, 4, 4), Sigma)
opinion_ego <- opinions[, 1]
opinion_alter1 <- opinions[, 2]
opinion_alter2 <- opinions[, 3]
opinion_alter3 <- opinions[, 4]
opinion_alter4 <- opinions[, 5]
opinion_alter5 <- opinions[, 6]
egonet_id <- 1:1000
# and let's put everything together
data <- data.frame(egonet_id, opinion_alter1, opinion_alter2, opinion_alter3, opinion_alter4, opinion_alter5)
# I don't think the variables need any further description.
B.2.2.1 Describe data
Lets have a look at our data.
#> egonet_id opinion_alter1 opinion_alter2 opinion_alter3 opinion_alter4 opinion_alter5
#> 1 1 -0.002812297 2.341753 3.184144 0.06355581 0.7949200
#> 2 2 6.420180903 2.757233 5.955405 5.78889612 7.6351737
#> 3 3 4.035976025 1.730848 2.447139 2.77713143 0.9910096
#> 4 4 1.572971112 4.217697 4.451395 0.28006793 3.0813234
#> 5 5 1.847221600 3.205552 4.836978 0.37768238 4.1432356
#> 6 6 5.896886119 5.500884 4.302935 3.28402390 -2.2955251
#> 'data.frame': 1000 obs. of 6 variables:
#> $ egonet_id : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ opinion_alter1: num -0.00281 6.42018 4.03598 1.57297 1.84722 ...
#> $ opinion_alter2: num 2.34 2.76 1.73 4.22 3.21 ...
#> $ opinion_alter3: num 3.18 5.96 2.45 4.45 4.84 ...
#> $ opinion_alter4: num 0.0636 5.7889 2.7771 0.2801 0.3777 ...
#> $ opinion_alter5: num 0.795 7.635 0.991 3.081 4.143 ...
#> egonet_id opinion_alter1 opinion_alter2 opinion_alter3 opinion_alter4
#> Min. : 1.0 Min. :-1.903 Min. :-6.030 Min. :-1.951 Min. :-4.307
#> 1st Qu.: 250.8 1st Qu.: 2.744 1st Qu.: 2.508 1st Qu.: 2.958 1st Qu.: 2.205
#> Median : 500.5 Median : 4.035 Median : 4.169 Median : 4.118 Median : 4.038
#> Mean : 500.5 Mean : 4.052 Mean : 4.125 Mean : 4.121 Mean : 3.948
#> 3rd Qu.: 750.2 3rd Qu.: 5.435 3rd Qu.: 5.825 3rd Qu.: 5.286 3rd Qu.: 5.753
#> Max. :1000.0 Max. :10.038 Max. :12.676 Max. :11.346 Max. :11.918
#> opinion_alter5
#> Min. :-3.012
#> 1st Qu.: 2.540
#> Median : 4.128
#> Mean : 4.153
#> 3rd Qu.: 5.762
#> Max. :12.227
#> NULL
#> vars n mean sd median trimmed mad min max range skew kurtosis
#> egonet_id 1 1000 500.50 288.82 500.50 500.50 370.65 1.00 1000.00 999.00 0.00 -1.20
#> opinion_alter1 2 1000 4.05 1.99 4.04 4.06 1.99 -1.90 10.04 11.94 0.01 -0.21
#> opinion_alter2 3 1000 4.12 2.45 4.17 4.12 2.46 -6.03 12.68 18.71 -0.04 0.28
#> opinion_alter3 4 1000 4.12 1.74 4.12 4.13 1.73 -1.95 11.35 13.30 -0.03 0.35
#> opinion_alter4 5 1000 3.95 2.63 4.04 3.97 2.63 -4.31 11.92 16.22 -0.09 -0.11
#> opinion_alter5 6 1000 4.15 2.49 4.13 4.14 2.38 -3.01 12.23 15.24 0.04 -0.11
#> se
#> egonet_id 9.13
#> opinion_alter1 0.06
#> opinion_alter2 0.08
#> opinion_alter3 0.06
#> opinion_alter4 0.08
#> opinion_alter5 0.08
Reshape into long format.
require("tidyverse")
data_long <- tidyr::pivot_longer(data = data, cols = everything()[-1], names_to = "alter", values_to = "opinion")
head(data_long)
#> # A tibble: 6 × 3
#> egonet_id alter opinion
#> <int> <chr> <dbl>
#> 1 1 opinion_alter1 -0.00281
#> 2 1 opinion_alter2 2.34
#> 3 1 opinion_alter3 3.18
#> 4 1 opinion_alter4 0.0636
#> 5 1 opinion_alter5 0.795
#> 6 2 opinion_alter1 6.42
B.2.2.2 ICC via ML
require(nlme)
# estimate empty model with ML
mlme <- lme(opinion ~ 1, data = data_long, random = list(~1 | egonet_id), )
summary(mlme)
# Standard deviations are reported instead of variances. extract the variances.
VarCorr(mlme)
# the intercept variance is at the between-level. the residual variances are at the observation /
# within-level. thus based on these numbers we may calculate the ICC ourselves.
varests <- as.numeric(VarCorr(mlme)[1:2])
varests
ICC_MLb <- varests[1]/sum(varests)
ICC_MLb
#> Linear mixed-effects model fit by REML
#> Data: data_long
#> AIC BIC logLik
#> 21909.51 21929.06 -10951.75
#>
#> Random effects:
#> Formula: ~1 | egonet_id
#> (Intercept) Residual
#> StdDev: 1.206648 1.941675
#>
#> Fixed effects: opinion ~ 1
#> Value Std.Error DF t-value p-value
#> (Intercept) 4.079877 0.04701084 4000 86.78588 0
#>
#> Standardized Within-Group Residuals:
#> Min Q1 Med Q3 Max
#> -3.76909452 -0.60344130 0.01233244 0.59998904 4.12122284
#>
#> Number of Observations: 5000
#> Number of Groups: 1000
#> egonet_id = pdLogChol(1)
#> Variance StdDev
#> (Intercept) 1.455998 1.206648
#> Residual 3.770102 1.941675
#> [1] 1.455998 3.770102
#> [1] 0.2786013
B.2.2.3 ICC via SEM/lavaan
require("lavaan")
model <- "
level: 1
opinion ~ 1 #regression model
opinion ~~ opinion #variance
level: 2
opinion ~ 1
opinion ~~ opinion
"
fit <- lavaan(model = model, data = data_long, cluster = "egonet_id")
summary(fit)
#> lavaan 0.6.17 ended normally after 11 iterations
#>
#> Estimator ML
#> Optimization method NLMINB
#> Number of model parameters 3
#>
#> Number of observations 5000
#> Number of clusters [egonet_id] 1000
#>
#> Model Test User Model:
#>
#> Test statistic 0.000
#> Degrees of freedom 0
#>
#> Parameter Estimates:
#>
#> Standard errors Standard
#> Information Observed
#> Observed information based on Hessian
#>
#>
#> Level 1 [within]:
#>
#> Intercepts:
#> Estimate Std.Err z-value P(>|z|)
#> opinion 0.000
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|)
#> opinion 3.770 0.084 44.721 0.000
#>
#>
#> Level 2 [egonet_id]:
#>
#> Intercepts:
#> Estimate Std.Err z-value P(>|z|)
#> opinion 4.080 0.047 86.829 0.000
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|)
#> opinion 1.454 0.100 14.514 0.000
#> opinion
#> 0.278