Analysis of global and local spatial autocorrelation based on the empirical Bayes index modification of Moran’s I
load("data/kreiseWithCovidMeldedatumWeekly.Rdata")
load("data/nationalAggCases.Rdata")
Preparation
Get the field names for the dates for which we have data - these will be used in the loops. Since all fields have the same prefix we can easily get their indices by a regular expression (with grep). These indices are then used to subset the field names vector of the data frame. For plot labeling we drop the prefix and store the names in a different vector of same size.
<- grep(pattern = "casesWeek_", x = names(kreiseWithCovidMeldeWeeklyCleaned))
idxCaseFieldPosition <- names(kreiseWithCovidMeldeWeeklyCleaned)[idxCaseFieldPosition]
namesCaseFieldPosition # strip prefix
<- gsub(pattern = "casesWeek_", replacement = "", x = namesCaseFieldPosition) dateStrVec
What are neighbors?
tm_shape(kreiseWithCovidMeldeWeeklyCleaned) + tm_borders()
Defining neighborhoods and the spatial weight Matrix W
Spatial autocorrelation analysis requires a neighborhood definition. A neighborhood (or contiguity) matrix \(C\) represents if pairs of spatial features are to be considered as neighbors or not. A spatial weight \(W\) matrix is a weighted form of such a neighborhood matrix. \(W\) represents the possible spatial interactions for the selected neighborhood and weighting approach.
In contrast to temporal autocorrelation there neighbors are defined based on the lag between observation the situation is more complex for spatial data. A couple of basic approaches are available in {spdep} to define neighbors:
- polygons that share an edge or a node
- points (centroids in our case) in a specific distance band
- k-nearest neighbors (distance based on centroids for polygons)
- geometric approaches such as the Delany triangulation, the gabriel graph or the sphere of interest can be used
The different approaches can also be combined by set operations such as union or difference.
It is also possible to include higher order neighbors, i.e. neighbors of neighbors.
In our case we could as well use more advanced approaches, e.g. based on driving time between the different population centers in the districts or based on mobility data that have been derived based on mobile phone data. We will stick to the basic approaches here due to time constraints.
To assess the usefulness of a neighborhood for the question at hand one might consider the following aspects in addition to a visual inspection:
- are there island (observations without neighbors)? Is it useful to consider these as real island or are there connections in the real world? An island might be connected by ferry, bridge or airplane to other spatial units so it might be good to incorporate that link manually. Island cause problems for the upcoming analysis steps and their interpretation so it is good to ensure that they are required.
- is the neighborhood definition symmetric? k-nearest neighborhood definitions frequently lead to asymmetric neighborhoodd definitions that are problematic for analysis and interpretation. It is possible to make asymmetric neighborhood definitions symmetric by a union of the upper and lower triangular part of the adjacency matrix.
It is possible to adjust the weight of neighbors based on additional information. Distance is frequently used for this purpose by defining an inversely weighted approach ( \(w_{i,j} = \frac{1}{dist_{i,j}^p}\) ) with p as a tuning parameter to specify how strongly weights drop with distance).
In an additional step we need to create the weight matrix \(W\) from the (potentially distance weighted) neighborhood matrix \(C\). In R \(W\) is represented as a weighted list. The following options exist:
- B is the basic binary coding
- W is row standardised (sums over all links to n)
- C is globally standardised (sums over all links to n)
- U is equal to C divided by the number of neighbours (sums over all links to unity)
- S is the variance-stabilizing coding scheme proposed by Tiefelsdorf et al. 1999, p. 167-168 (sums over all links to n).
For most situations row standardized coding is recommended. It ensures that the weights of all neighbors sum up to unity which is a useful property. The weighted sum of a lagged attribute (predictor or respone) is thereby simply the weighted mean. If binary coding would be used this would not be the case and observations with more neighbors would get more weights compared to observations with less neighbors.
If zero policy is set to TRUE, weights vectors of zero length are inserted for regions without neighbor in the neighbors list. These will in turn generate lag values of zero. The spatially lagged value of x for the zero-neighbor region will then be zero, which may (or may not) be a sensible choice.
Distance based neighborhood definition
<- st_centroid(kreiseWithCovidMeldeWeeklyCleaned, of_largest_polygon = TRUE)
centroidsKreise <- dnearneigh(centroidsKreise, d1 = 1, d2 = 70 * 10^3)
distnb table(card(distnb))
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
2 3 4 5 5 9 16 14 13 18 15 29 27 13 17 20 27 16 20 14 14 12 12 7 10 4
28 29 30 31 32 33 34 35 36 37 38
4 11 2 9 6 6 6 6 1 2 1
plot(distnb, coords = st_coordinates(centroidsKreise), col = "red", cex = 0.7)
Contiguity based neighborhood definition
<- poly2nb(kreiseWithCovidMeldeWeeklyCleaned)
polynb table(card(polynb))
1 2 3 4 5 6 7 8 9 10 11 12
27 26 44 47 60 74 68 30 15 6 2 1
plot(polynb, coords = st_coordinates(centroidsKreise), col = "red", cex = 0.7)
10-nearest neighbors
As an alternative we could use the 10 nearest neighbors
<- knn2nb(knearneigh(st_coordinates(centroidsKreise), k = 10))
k10nb table(card(k10nb))
10
400
plot(k10nb, coords = st_coordinates(centroidsKreise), col = "red", cex = 0.7)
The neighbor list is not symmetric.
is.symmetric.nb(k10nb)
[1] FALSE
Graph based neighborhoods
Minimum spanning tree: connects all nodes together while minimizing total edge length Relative neighborhood graph: all nodes connected for those the lens formed by the radii of their circles contains no other points Gabriel graph: all nodes connected if were is no other node inside a circle with their distance Delaunay triangulation: all nodes connected for which the circumcise around ABC contains no other nodes
The Gabriel graph is a subgraph of the delaunay triangulation and has the relative neighbor graph as a sub-graph.
require(tripack) # for triangulation, you might need to install this package
Graph based neighborhood can be defined as follows.
# delauny triangulation
<- tri2nb(st_coordinates(centroidsKreise))
delauny_nb # sphere of influence
<- graph2nb(soi.graph(delauny_nb, st_coordinates(centroidsKreise)))
soi_nb # gabriel graph
<- graph2nb(gabrielneigh(st_coordinates(centroidsKreise)))
gabriel_nb # relative graph
<- graph2nb(relativeneigh(st_coordinates(centroidsKreise))) rg_nb
table(card(delauny_nb))
3 4 5 6 7 8 9
2 29 120 131 95 20 3
table(card(soi_nb))
1 2 3 4 5 6 7 8
19 42 63 75 96 72 28 5
table(card(gabriel_nb))
0 1 2 3 4 5 6 7
62 79 97 74 63 16 7 2
table(card(rg_nb))
0 1 2 3 4
101 125 102 65 7
plot(delauny_nb, coords = st_coordinates(centroidsKreise), col = "red", cex = 0.7)
mtext("Delauny Triangulation")
plot(soi_nb, coords = st_coordinates(centroidsKreise), col = "red", cex = 0.7)
mtext("Sphere of influence")
plot(gabriel_nb, coords = st_coordinates(centroidsKreise), col = "red", cex = 0.7)
mtext("Gabriel graph")
plot(rg_nb, coords = st_coordinates(centroidsKreise), col = "red", cex = 0.7)
mtext("Relative graph")
Combination of contiguity and distance based neighborhood definitions
For the following analysis we will be using a combination of the contiguity and the distance based neighborhood definition.
<- union.nb(distnb, polynb)
unionedNb
table(card(unionedNb))
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
4 4 5 4 9 17 15 13 18 15 29 27 13 17 20 27 16 20 14 14 12 12 7 10 4 4
29 30 31 32 33 34 35 36 37 38
11 2 9 6 6 6 6 1 2 1
plot(unionedNb, coords = st_coordinates(centroidsKreise), col = "red", cex = 0.7)
We will consider the different distances between the centroids by an inverse distance relationship. Note that that we row standardize the weights afterwards.
<- nbdists(unionedNb, st_coordinates(centroidsKreise))
dlist <- lapply(dlist, function(x) 1/x)
dlist <- nb2listw(unionedNb, glist = dlist, style = "W")
unionedListw_d
save(unionedListw_d, file = "data/unionedListw_d_berlinNotSubdivided.Rdata")
hist(unlist(unionedListw_d$weights), las = 1, xlab = "weights", main = "unioned contiguity and 50km distance nb, idw, W")
Global Moran’s I
Let’s now calculate the global Moran’s I as a measur of global spatial autocorrelation.
\[I = \frac{n}{S_0} \frac{\sum_i \sum_j w_{ij} (x_i - \bar{x}) (x_j - \bar{x})) } {(\sum_i (x_i - \bar{x})^2)}\]
With \(i\) and \(j\) indices of the districts, \(S_0\) the global sum of the weights in weight matrix \(W\) and \(w_{ij}\) elements of \(W\).
\[ S_0 = \sum_i \sum_j w_{ij}\]
The range of Moran’s I depends on the largest and second largest eigenvalue of the weight matrix \(W\) Often the interval is in the range -0.5 to 1.15
Since we have count data and a varying population at risk in each district we are using the empirical index modification of Moran’s I for that. I loop through all weeks (stored in separate fields therefore, I am looping over the fields), calculate the empirical index modification of Moran’s I and storing the results in a list. Afterwards I extract the Moran’s I value and the p-value and store it together with the date in a data.frame that is then used for plotting.
The statistic used in the empirical index modification of Moran’s I is:
\[EBI = \frac{n}{S0} \frac{ \sum_i \sum_j w_{ij} z_i z_j} { \sum_i (z_i - \bar{z})^2}\]
- m is the number of observations
- n the number of cases (observed events)
- x the population
- \(S0 = \sum_i \sum_j w_{ij}\) the sum of the weights
- \(z_i = (p_i - b) / \sqrt(v_i)\) - the deviation of the estimated marginal mean standardized by an estimate of its standard deviation
- \(p_i = n_i / x_i\) - the estimated rate
- \(v_i = a + (b / x_i)\) - the marginal variance of \(p_i\)
- \(a = s^2 - b / (\sum_i x_i / m)\)
- \(s^2 = \sum_i x_i (p_i - b)^2 / \sum_i x_i\)
- b is the marginal expectation of \(p_i\)
The permutation test is based on an permutation of the vector \((z_1, z_2, ..., z_n)\). For each permuted map EBI is calculated and stored. Finally, the observed EBI is compared against vector of EBIs for the permutation to derive the p-value.
<- length(dateStrVec)
n <- vector("list", n)
mcEBayWeekVec
for (i in 1:n) {
<- paste0("casesWeek_", dateStrVec[i])
casesName <- EBImoran.mc(n = st_drop_geometry(kreiseWithCovidMeldeWeeklyCleaned) %>%
ebiMc pull(casesName), x = kreiseWithCovidMeldeWeeklyCleaned$EWZ, listw = unionedListw_d,
nsim = 999)
<- ebiMc
mcEBayWeekVec[[i]]
}
<- sapply(mcEBayWeekVec, FUN = function(x) x$statistic)
mcVals <- sapply(mcEBayWeekVec, FUN = function(x) x$p.value)
mcpVals
<- data.frame(MC = mcVals, p_value = mcpVals, date = lubridate::ymd(dateStrVec))
mcEBayesDf ggplot(mcEBayesDf, aes(x = MC, y = p_value)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0.05,
lty = 3)
<- ggplot(mcEBayesDf, aes(y = MC, x = date)) + geom_line() p1
Are the peaks in global spatial autocorrelation at the peaks of the pandemic waves?
<- ggplot(nationalAggCases, aes(x = week, y = incidence)) + geom_line() p2
ggarrange(p1, p2, ncol = 1, nrow = 2)
The last week might not complete. We see that the peaks in the incidence rate co-occur with high spatial autocorrelation. However, we also see that in the summer of 2020 we had a strong increase in spatial autocorrelation while the incidence rate was relatively low.
Local cluster
Local Moran’s I (also named LISA) calculates Moran’s I in a neighborhood, check’s significance and compares on the Moran’s I value with the values in the neighborhood.
\[ I_i = \frac{(x_i-\bar{x})}{{∑_{k=1}^{n}(x_k-\bar{x})^2}/(n-1)}{∑_{j=1}^{n}w_{ij}(x_j-\bar{x})} \] With \(i\) the observation (district) index \(w_{ij}\) the element of the spatial weight matrix \(W\), \(x\) the value of interest (here the empirical bayes index of the incidence rate) and \(k\) the number of neighbors of district \(i\).\(\bar{x}\) is the global average.
For the interpretation, the local Moran’s I is categorized (if significant) in comparison to the global mean as well. The following categories are used:
- High – high – value higher than the global mean, surrounded by similar values (cluster of high values)
- Low – low – value lower than the global mean, surrounded by dissimilar values (negative spatial autocorrelation) , i.e. low local outlier
- High –low – high value (compared to the mean) in a neighborhood of negative spatial autocorrelation, i.e. high local outlier
- Low - high – low value (compared to the mean) in a neighborhood of positive spatial autocorrelation, i.e. a cluster of low values
We will reuse the empirical bayes index calculated above. We can do this by extracting the values from the list.
<- sapply(mcEBayWeekVec, FUN = function(x) x$z) empBayIndexWeek
The following function is used to calculate local Moran’s I
<- function(x, listw, pval, quadr = "mean", p.adjust.method = "holm") {
getLocalMoranFactor if (!(quadr %in% c("mean", "median", "pysal")))
stop("getLocalMoranFactor: quadr needs to be one of the following values: 'mean', 'median', 'pysal'")
<- localmoran(x, listw = listw, p.adjust.method = p.adjust.method)
lMc <- attr(lMc, "quadr")
lMcQuadr
<- as.character(lMcQuadr[, quadr])
lMcFac <- which(lMc[, 5] > pval)
idx <- "Not sign."
lMcFac[idx] <- factor(lMcFac, levels = c("Not sign.", "Low-Low", "Low-High", "High-Low",
lMcFac "High-High"))
return(lMcFac)
}
Next we loop over the matrix that stores the empirical bayes index values for each week and calculate local Moran’s I. We get for each week one value that indicates if local Moran’s I is significant and in which category it belongs. As we added the district identifyer we can join the results to the sf object and plot the maps.
<- data.frame(RS = kreiseWithCovidMeldeWeeklyCleaned$RS)
lMcFac for (i in 1:n) {
<- paste0("lMcFacWeek_", dateStrVec[i])
fieldName <- getLocalMoranFactor(empBayIndexWeek[, i], listw = unionedListw_d, pval = 0.05)
res <- as.data.frame(res)
res names(res) <- fieldName
<- cbind(lMcFac, res)
lMcFac }
<- left_join(kreiseWithCovidMeldeWeeklyCleaned,
kreiseWithCovidMeldeWeeklyCleanedLMc by = c("RS")) lMcFac,
Plot maps with incidence rate and local Moran’s I
<- c("white", "midnightblue", "lightblue", "lightpink", "red") localMcPalette
Next we loop over all weeks and create tmap graphics that can then be used for plotting or to create an animation
ani.options("ffmpeg")
saveVideo({
ani.options(interval = 0.5, verbose = FALSE)
for (aMap in mapList4animation) {
print(aMap)
ani.pause()
}video.name = "covid19incWeekWithLMc.mp4", single.opts = "utf8: false", autoplay = FALSE,
}, ani.height = 1080, ani.width = 1920, title = "Temporal development of covid-19 incidence in Germany",
description = "Temporal development of covid-19 incidence in Germany, based on the Meldedatum in a similar way as the RKI maps. Based on a weekly aggregation. Together with local Moran's I.")
Let’s look at a few examples
ggplot(nationalAggCases, aes(x = week, y = incidence)) + geom_line() + geom_vline(xintercept = lubridate::ymd(dateStrVec[8]),
lty = 2, col = "red")
8]] mapList4animation[[
ggplot(nationalAggCases, aes(x = week, y = incidence)) + geom_line() + geom_vline(xintercept = lubridate::ymd(dateStrVec[12]),
lty = 2, col = "red")
12]] mapList4animation[[
ggplot(nationalAggCases, aes(x = week, y = incidence)) + geom_line() + geom_vline(xintercept = lubridate::ymd(dateStrVec[33]),
lty = 2, col = "red")
33]] mapList4animation[[
ggplot(nationalAggCases, aes(x = week, y = incidence)) + geom_line() + geom_vline(xintercept = lubridate::ymd(dateStrVec[50]),
lty = 2, col = "red")
50]] mapList4animation[[
ggplot(nationalAggCases, aes(x = week, y = incidence)) + geom_line() + geom_vline(xintercept = lubridate::ymd(dateStrVec[58]),
lty = 2, col = "red")
58]] mapList4animation[[
ggplot(nationalAggCases, aes(x = week, y = incidence)) + geom_line() + geom_vline(xintercept = lubridate::ymd(dateStrVec[93]),
lty = 2, col = "red")
93]] mapList4animation[[
All contents are licensed under GNU General Public License v3.0