Data Analysis with QGIS#
đ§This training platform and the entire content is under â ď¸constructionâ ď¸ and may not be shared or published! đ§
Competences:
This Module covers a general understanding of data analysis, how to create statistics, create buffers, heatmaps, and how to break rivers, roads, or areas into segments. The following modules will cover more complex analysis methods.
Data & Spatial Analysis#
Even in a single layer, a lot of analysis is possible. However, sometimes the things we want to analyse are split across multiple layers. In order to get these insights we use the spatial and non spatial GIS-processing tools we learned in the previous modules. In this module, we will look how to apply these tools, collect and work with data to create meaningful insights. We will go over a few examples of data analysis that are common in humanitarian work.
Spatial analysis#
A spatial analysis can be a result of combining several layers with different information in a single map.
Geographic analysis helps us answer questions like:
What patterns are in the data?
How can we summarise any trends?
Whatâs nearby?
Which areas are affected?
Whatâs inside or outside a boundary?
How do phenomenon change with location?
How do locations change over time?
Before doing any sort of processing, you need to familiarise yourself with the data and understand it.
The first step is to read the metadata from the source and understand what data was collected, who collected the data, and how the data was collected.
Next, open the attribute table and look at the different features and attributes available. What do the attributes show and what are they called?
Now you can start visualising the data:
You can visualize the data cartographically by assigning or categorizing the data using symbols
You can create charts from the attribute table
You can look for patterns, averages, outliers
We are usually looking for ways to describe our data to an audience in some ways. Sometimes spatial analysis will be used to provide recommendations for activities. Considering the amount of data available online, it is always important to take a step back and gain perspective when facing this knowledge, these capacities, as well as the data itself before rushing in to manipulate it:
Reliability: Can I trust this data?
Interest: Do I need this data?
Usage: Am I able to use this data?
Comprehensiveness: Is this data complete?
Date: How old is this data?
Sensitivity: Is this data sensitive?
With spatial analysis, you can build predictive models to plan ahead of disasters. BUT: Not all analysis is complex! Just knowing how many features are in a layer is useful. Simple analysis includes:
Ranking
Categorizing
Above/below threshold
Affected Areas
Population distribution
It is important to know the limitations of the data at your disposal - donât try to use unsuitable data for analysis (e.g. if you now a survey sample is not representative)
Attention
Spatial Representation and Analysis There are some spatial analysis problems that are difficult to avoid completely. For example the Modifiable Areal Unit Problem, where the results look different depending on the unit of analysis.
There are two main types of data analysis:
Thematic analyses focus on visual variation according to a given attribute of the data (one of its characteristics). They are performed on a specific field of the attribute table for the layer, whether textual or numerical. The graphical representation (symbology) changes according to the attribute.
For instance: variations in size depending on population numbers in a refugee camp area.
Spatial analyses are performed on spatialized phenomena such as: presence/absence of the phenomenon, its relationship with other phenomena or entities, distribution in space. They are performed on the geometry and position of elements, as well as on their relationship with other elements. Spatial analyses can create new values or elements.
For example: crossing two satellite images to extract flooded areas between two dates; or crossing latrine and water catchment areas in a refugee camp; using a digital elevation model to determine which buildings have a high flooding risk.
Length, Surface, Circumference#
Knowing how big an area is, or how long road sections is already an importand analysis. For example, you can know how much of a road network is inaccessible, or how much area is affected by flooding.
These geometrical attributes can be calculated using the field calculator or the processing tool âAdd geometry attributesâ.
The field calculators has the following functions to calculate geometry attributes as new fields in the attribute table:
Function |
Description |
---|---|
|
Returns the area of the current feature. The area calculated by this function respects both the current projectâs ellipsoid setting and area unit settings. |
|
Returns the length of a linestring. If you need the length of a border of a polygon, use $perimeter instead. The length calculated by this function respects both the current projectâs ellipsoid setting and distance unit settings. |
|
Returns the perimeter length of the current feature. The perimeter calculated by this function respects both the current projectâs ellipsoid setting and distance unit settings. |
For example, to calculate the area of polygons:
Open the attribute table
Open the field calculator
Check the box
Create a new field
Enter a
Output field name
: âAreaâSelect the
Output field type
: In this case we want a number with decimals. So we select either âdecimal numberâ.Enter
$Area
into the expression window.Select
OK
.
In the attribute table, you will find a new column called Area
with the respective area for each feature.
Note
The unit of measurement of the calculated area depends on the distance unit settings of the current projectâs CRS (metrical or geographic). In most cases you want metres or kilometers. Make sure the units of your CRS are metres to get the correct values.
You can check this by opening the CRS selector (bottom right corner) and reading the information of your selected CRS.
Example: Calculating the length of roads
Basic statistics#
In the field calculator, we can calculate the length, area, perimeter for each feature of a dataset. However, we might want to have aggregate statistics on a dataset (average length/area, total length/area).
QGIS comes with two basic processing tools to generate statistics:
Processing tool |
Description |
---|---|
âBasic statistics for fieldsâ |
This algorithm generates basic statistics (count, sum, mean, median, standard deviation, quartiles, âŚ) from the analysis of a values in a field in the attribute table of a vector layer. Numeric, date, time and string fields are supported. The statistics returned will depend on the field type. Statistics are generated as an HTML file. |
âStatistics by categoriesâ |
This algorithm calculates statistics of fields depending on a parent class. In the option |
Example: Statistics by categories
In this example we have a road network which has been intersected with a flood extent. A new field (âFloodâ) has been calculated containing information wether the road is flooded or not (Y=flooded, N=not flooded). The length of each road has been calculated using the $length
function in the field calculator as new column called âLengthâ.
We want to calculate the total length of flooded and unflooded road respectively.
Open the âStatistics by categoryâ-tool
Select the road network layer.
Under
Field to calculate statistics on
, selectLength
Under
Field(s) with categories
, selectFlood
Specify a location to save the statistics file.
Click
Run
After completion, a new layer will appear in your layer tab. This will not contain spatial attributes and is a simple attribute table with the statistics. In our case, we will have the basic statistics (min, max, range, sum, median, sd, etc.) for all the road features with the flooding value âYâ, and âNâ.
Tip
You can add a table of the statistics to your print layout by using the âAdd attribute tableâ-tool in the print layout composer
Insert statistics examples
Buffer analysis#
Creating a buffer is a helpful analysis to determine what lies in proximity of, for example, a contaminated water source or other hazards and determine vulnerability. Buffer analysis is often used to map the riparian zones along rivers to devise environmental protection zones or estimate vulnerability.
Proximity analysis
Estimated vulnerability analysis
Density Map Analysis#
Density maps are very useful in communicating the intensity of a phenomenon in an area. Point data is spatially aggregated to show the amount of incidents in that area. For example, number of schools or number of disease cases.
It is important to consider that most demographic or economic data needs to be normalized (e.g. number of inhabitants). To assess the significance of the number of schools, you will need to know how the population of the area; so the amount of schools per 1,000 inhabitants, or the number of disease cases per 100 persons, for example.
There are a few different types of density maps. The most common are heatmaps and hexagon grid maps. In both cases, the intensity of a phenomenon is calculcated with point data (rarely with lines or polygons).
discrete vs. continuous?
Heatmaps#
Heat maps use features in a dataset to calculate the relative density of points on a map. The density is displayed as a colour ramp with colors ranging from âcoolâ (low density) to âhotâ (high density). Heatmaps are useful when you have a large number of features covering an area with areas where these features cluster together and help us visualize spatial patterns of a layer.
To create a heatmap you first need a layer containing data points or âsamplesâ. These points are distributed in an area with some areas containing more than others. The density of the points in space determines the intensity of the color on the heat map.
In QGIS, there are two methods to create heatsmaps. The first method uses the symbology tab and is generally a lot faster. The second method uses the interpolation tool âHeatmap (Kernel Density Estimation)â and offers more parameters to adjust. The advantage of the processing tool is that you can set a radius using metric units (for example the number of points in a 100 Meters radius compared to using the millimeters or pixels of your computer screen) and set a variable radius that is determined by another attribute. The next section will discuss the creation of heatmaps using the symbology tab. A guide on how to create a heatmap using the processing tool can be found here
insert link
Using the symbology tab to create a heatmap#
You can create a heatmap in the symbology tab of a point or polyline layer. Navigate to the symbology tab and select the Heatmap
symbolization method. Here, you can adjust the color ramp, radius, and maximum value. The radius (in Millimeters on your screen) determines the size of the circle that is used to aggregate the points. If it gets bigger more points can be aggregated and the âheatâ increases. The maximum value determines the value that is given the âhottestâ color. By default, it is set to the highest number of aggregated points. For example, you can set a threshold above which everything has the âhottestâ color. Reducing it changes the visualization drastically.
As you can see, the information communicated through the different maps changes drastically. This is why you need to be transparent on what parameters you have set to create the heatmap.
Assigning a weight to the samples
Assigning weight to samplepoints can be useful when your dataset has additional information (such as the type of incident, or sampled amount of rainfall) and you want to integrate this information into your heatmap.
Hex Maps (Hexagon Grids)#
Hexagon grids are used to aggregate point incidents in order to normalize geographic data or to mitigate the modifiable area unit problem (problems arising from using irregular shaped polygons). In GIS, we commonly use rectangles (e.g. raster data) or hexagons, as these geometries can be repeat in an evenly spaced grid without leaving gaps.
The advantage of using hexagons is that it is a polygon that closely resembles a circle (where the distance to the centre is equal at every point along the outline), but still leaves no gaps when placed as a grid. This means that it is also possible to use absolute values (no normalizations), since the spatial units have the same size.
Another advantage is that you can use the hexagon grid as spatial units and combine multiple variables (for example, number of incidents per population size) or calculate indexes.
WIKI: Hexagon grids are especially useful for density maps. For example, the number of conflict events or water points in an area.
To create a hexagon grid map, you will first need to create a hexagon grid, by using the âCreate Gridâ vector tool.
Next, you will need to join the point data with the hexagon grid. We want to know the amount of points that are inside of a hexagon cell. To count the number of points, we need to use the vector tool âCount points in polygonâ. The result will be a hexagonal grid where each polygon has the a value for the number of points in that area.
The final step will be to visualize the data by assigning a graduated symbology to the polygons. You can play around with the transparency of your layers to make more information visible.
Example video: Creating a hex map
Create a Hexagon grid with the tool âCreate Gridâ.
Select
Hexagon (Polygon)
as Grid type.The Grid extent should be set to the layer/area of interest.
Select the horizontal and vertical spacing according to the scale of your map.
Optional: Remove the unnecessary Polygons.
Use the tool âCount points in Polygonâ to add an attribute field with the number of points that are inside each hexagon cell. The Polygons field should be your reference layer. The new layer will have an attribute field called âNUMPOINTSâ
Assign a graduated symbology to the
Count
-layer. Select âNUMPOINTSâ as the value and categorize the classes as you wish.
Tip
You can remove the hexagon cells that are not overlapping with the reference layer:
Select by location all the cells that intersect with your reference polygon/layer.
Invert the selection.
Delete the selected hexagon cells.
Save the changes you made to the layer.
Analysis by joining attributes#
Joining datasets is a common and useful way to get new insights by adding the information of one table to the other, taking into account key attributes that are used to identify the features that are to be joined. For example: the population size of a district is in one table, and the number of hospitals in a district is in a second table and you wish to combine the two tables to know how many hospitals per population size are in the respective districts.
You have two separate data tables with information you wish to aggregate (join);
Both tables share key identifiers;
CNTY_NAME
in this exampleThe key identifiers serve as the relationship between the two tables
The tables will be combined via the key identifiers
Joining tables will create a new table where the attribute values are added to the key identifiers
Pivoting tables#
Sometimes, the tables are in a format that is not suitable to join. For example, you have multiple zones per land, making the field CNTRY_NAME
not suitable for aggregation. In this case, it is useful to pivot the table. This means that the fields for the zones and their respective area size are aggregated under the country. The values of the column ZONE
will be turned into Columns with the values for the area in these columns. Now you can aggregate this table with additional information that has data on countries.
Selecting appropriate locations according to a set of criteria#
Interpolation#
insert links
Spatial interpolation uses point data to estimate values at other unknown points. This is extremely useful for spatial pheonomena that are continuous, such as rainfall or temperature. For example, you have point data of the temperatures at weather stations, but you want to estimate the temperatures in between these points. Spatial Interpolation can estimate the temperature in between those points. This form of interpolation is called a statistical surface. Interpolation can be used to calculate elevation data, precipitation, snow accumulation, water table and population density for example.
Interpolating data can be highly useful since an extensive data collection is costly and rarely possible. Data collection for continuous phenomena is usually conducted only at a small number of locations. Interpolation models use these points to calculate a raster surface with estimated values for each raster cell.
There are many differnt interpolation methods, each suited for another type of phenomenon or able to take into account different characteristics. In GIS, the most commonly used interpolation methods are Spline interpolation, Inverse Distance Weighted Interpolation (IDW), and Kriging. In the following subchapters, we will take a look at these methods, and discuss their strengths and shortcomings.
Note
Remember that there is no interpolation method that can be applied to every situation. Some methods are more useful for particular inquiries or certain types of data. The method of interpolation you use should always depend on the type of data, phenomenon, and research interest you have.
IDW-Interpolation (Inverse Distance Weighted)#
In the IDW interpolation method, the distance of a sample point to the point that is to be calculated dictates how much the value of the sample points influences the value of the unknown point. The weight of a point is assigned to sample points by assigning a weighting coefficient that dictates how much the influence of a sample point will drop as the distance increases. The further away the known sample point is located, the less influence it has on the point that is to be calculated.
Keep in mind that IDW interpolation has a few disadvantages. For example, the quality of the calculated statistical surface decreases, if the distribution of sample points is uneven. Additionally, the highest and lowest values in the interpolated surface only occur at sample points, which is probably not the case in the real world. This often results in peaks or pits aroung the sample data points (see IDW interpolatino example) (adopted from the QGIS documentation).
Spline Interpolation#
Triangulated Irregular Network#
TIN interpolation is commonly called Delauny triangulation. This interpolation methods creates a triangular surface with its nearest neighbour points. In order to achieve this, circles are added around known sample points and the intersection of these circles are used as corners of the triangle (see [figure]). TIN interpolation is usually used to compute digital elevation models (DEM).
The problem with TIN statistical surfaces is that the surfaces are not smooth and may seem jagged, since they are based on triangles of varying sizes. Furthermore, triangulation is no suited to extrapolate data beyond the area where sample points have been collected (adopted from the QGIS documentation)
Kriging#
Kriging is a method of geostatistics used to estimate values for spatial units where the phenomenon of interest has not been measured at every spatial points. Kriging integrates covariates into the interpolation. For example, it is not only the distance to measured temperatures that influences its weighting; temperature is also influenced by the altitude of the sample point.
Outlook#
There are many analysis methods in GIS. However, setting up an analysis method can be quite time consuming, and creating a new analasys method for every research question makes it hard to compare the results of different analyses. This is why, model building and automation are used frequently when working in GIS data. A model can be seen as a analysis blueprint, that only needs input data to perform a certain type of analysis. Since the parameters are the same, and similar datasets are needed for the model to work properly, the results can be compared. If you are interested in model building and automation, check out module 7.