Exercise: Data preparation (Importing PDF tables into QGIS)#

Characteristics of the exercise#

Instructions for the Trainers#

Attention

This exercise makes use of the tabula.technology tool, an open source application which let’s you easily extract tables from a PDF-file. Keep in mind that

Trainers Corner

Prepare the training

  • Take the time to familiarise yourself with the exercise and the provided material.

  • Prepare a white-board. It can be either a physical whiteboard, a flip-chart, or a digital whiteboard (e.g. Miro board) where the participants can add their findings and questions.

  • Before starting the exercise, make sure everybody has installed QGIS and has downloaded and unzipped the data folder.

  • Check out How to do trainings? for some general tips on training conduction

Conduct the training

Introduction:

  • Introduce the idea and aim of the exercise.

  • Provide the download link and make sure everybody has unzipped the folder before beginning the tasks.

Follow-along:

  • Show and explain each step yourself at least twice and slow enough so everybody can see what you are doing, and follow along in their own QGIS-project.

  • Make sure that everybody is following along and doing the steps themselves by periodically asking if anybody needs help or if everybody is still following.

  • Be open and patient to every question or problem that might come up. Your participants are essentially multitasking by paying attention to your instructions and orienting themselves in their own QGIS-project.

Wrap up:

  • Leave time for any issues or questions concerning the tasks at the end of the exercise.

  • Leave some time for open questions.

Exercise#

Available Data#

Download the datasets here and unzip them.

Dataset

Source

Description

Administrative boundaries for pakistan

HDX

The administrative boundaries (adm0-adm3) for pakistan can be accessed via the humanitarian data exchange. For this exercise, we are interested in the districts (adm2)

Percent distribution of households by material used for walls

Pakistan Bureau of Statistics

Table 7.7 in the Pakistan Social and living Standards Measurement Survey (2019-20) shows the materials used for walls per household in percents.

Task 1: Get the data from the PDF file into a CSV file#

  1. In your web-browser, go to Tabula.technology and download the application.

../../_images/en_tabula_website.png

The Tabula.technology website with the download links to the left#

  1. Unzip the downloaded file into a location of your choosing (e.g., Programs, Desktop, …).

  2. Open the folder where you unzipped the file and open the “Tabula” application

../../_images/en_tabula_folder.png
  1. A new browser window will open with this address: http://localhost:8080. This is the application. If the browser does not open automatically, you can open the browser and enter this go to this address manually.

../../_images/en_tabula_import.png
  1. Now let’s import the PDF file with the wall types into tabula:

    1. Click on Browse and navigate to the exercise data folder: ...\data\input and select the PDF “pakistan_wall_type7.7”. Click Open.

    2. Click Import and wait for the PDF to load. Once loaded, it will open automatically.

../../_images/en_Tabula_main_view.png
  1. Here we will select the portion of the PDF that contains the data table. Tabula expects a table with one row of headers at the top for each column, followed by the rows with the data. By dragging a rectangle on the PDF, we can create a selection where tabula should look for the data table. Drag a rectangle and adjust the boarders so the table fits as precisely as possible into selection. Make sure to only capture the relevant information. Since the headers in this table has an unconventional formatting, it should be left out so the resulting csv table is easier to adjust. We will add the headers manually once extracted. F

../../_images/en_tabula_selection.png
  1. Once you are satisfied with the selection, you can click on Repeat this Selection to duplicate the selection on the next pages.

  2. Take a look on the following pages and make sure the table is still fully contained by the selection.

  3. Click on Preview & Export Extracted Data on the top right of the window.

  4. A new window will appear where the data will show up. At first, nothing will be visible. First, click on Stream on the left.

../../_images/en_tabula_preview_extracted_1.png
  1. The data from the PDF table will appear in the main window. Review the table.

../../_images/en_tabula_data_preview.png
  1. Click on Export, this will save the .csv into your downloads folder.

  2. Move the file to the /data/interim/-folder.

Congratulations, the data from the PDF has been extracted into a CSV file!

Task 2: Clean the data from errors and unwanted entries#

Note

The necessary steps to filter the data might be different depending on the editor you use. In this exercise, we will go through the workflow with the free version of Microsoft Excel.

  1. Open the extracted CSV file in Excel. It might look like this:

../../_images/en_tabula_csv_excel.png
  1. Excel does not automatically recognise the comma delimited format. We can fix this by selecting the column A, navigating to Data > Text to Columns. A new window will open. Leave the settings as they are and click Ok.

Note

In the web version of excel, you can fix the columns by selecting column A, navigating to Data > Split Text to Columns

../../_images/en_m5_data_cleaning_ex_task2.png

The data table should now look like this. It is still missing column headers (red)#

  1. We have to add the column names back, as we did not extract them with tabular. Right-click on the first row and select Insert 1 Row Above. A new row should appear.

  2. Enter the column headings as they are in the PDF-file:

Province & District

Burnt Bricks/Blocks

Mud Bricks/Mud

Wood

Other

Total

  1. Now we can format the data into an excel table. This will allow us to apply filters to remove the unwanted entries. Select the columns A to F and navigate to Home > Format as Table and select a table styling of your choosing.

Now we have a usable .csv file with the information. However, there are still some unwanted entries and a few formatting mistakes. First, let’s remove all the rows with the data on rural and urban distribution. We want to visualise the distribution on district level, so a distinction between urban and rural is not necessary.

  1. Click on the small arrow next to the “Province & District” column. A dropdown-menu will open.

  2. In the dropdown-menu, uncheck the Select All-Box and scroll down to check all the entries with the value Urban and Rural. Also add the entries that have numerical values attached to Urban or Rural. Click Apply.

../../_images/en_m5_data_cleaning_ex_filter_excel.png
../../_images/en_m5_data_cleaning_ex_filter_excel_2.png
  1. The table should now only show the entries with rural or urban in the “Province & District”. Select them all and click on Remove selected rows.

  2. Now, we have to remove the filter we applied.

  3. Lets go through the resulting table and see if we need to fix more entries. There are a few entries where the the values for the percentages are not formatted correctly. Copy the values from the columns Burnt Bricks/Blocks, Mud Bricks/Mud, Wood, Other one cell to the right and enter the numerical value that mistakenly got added to the Provinces & District column.

  4. Save the formatted csv file in the input data folder (.../module_5_ex_7_data_cleansing/data/input/) under the name tabula-pakistan_wall_type7.7.

Great! Now we are ready to import the CSV-file into QGIS!

Task 3: Import the Data into QGIS#

Note

Fuzzy merge is a technique used in data processing to combine two datasets based on similar, but not necessarily identical, values. This is particularly useful when dealing with data that may have inconsistencies, such as typos or different formats.

Instead of looking for exact matches, fuzzy merge uses algorithms to compare values and determine how similar they are. For example, “Jon” and John” might be considered a match because how similar they are. Each comparison generates a similarity score, which indicates how closely the two values match.

  1. Open QGIS and create a new project.

  2. Save the project in the exercise folder.

  3. Import the CSV layer into QGIS.

  4. Import the administrative boundaries located in the data input folder: .../module_5_ex_7_data_cleansing/data/input/

  5. Now let’s perform a fuzzy merge: Open the field calculator for the layer called tabula-pakistan_wall_type7.7 and enter the following expression:

    array_first(aggregate(
    layer:= 'pak_admbnda_adm2_wfp_20220909',
    aggregate:='array_agg',
    expression:=ADM2_EN,
    filter:=levenshtein(ADM2_EN, attribute(@parent, 'Province & Disctrict')) <= 2,
    order_by:=levenshtein(ADM2_EN, attribute(@parent, 'Province & Disctrict'))
    ))
    
  6. Enter an Output field name, set the Output field type to Text (string) and increase the output field length to 40.

  7. Click Ok. The CSV file should now have a new column. In this column you will find the values from the adm2 polygon layer which have been matched using the fuzzy merge algorithm.

  8. We can now perform an attribute join by selecting the ADM2_EN and newly created fuzzy merge column as identifying column. In the processing toolbox, search for the tool “Join attributes by field value”. Double-click on it.

  9. A new window will open. Here, set the following parameters:

    1. Input layer: ADM2 polygon layer

    2. Table field: ADM2_EN

    3. Input layer 2: tabula-pakistan_wall_type7.7

    4. Table field 2: Fuzzy_match

    5. Layer 2 fields to copy: Burnt Bricks/Blocks, Mud Bricks/Mud, Wood, Other

../../_images/en_3.36_m5_ex_7_attr_join.png

The “Join attributes by field value” parameters#

  1. Click Run.

Congratulations, we have now successfully joined the information from the PDF file to a polygon layer!

Task 4: Visualisation of the Data#

  1. Open the layer styling panel and select the Graduated symbolisation method.

  2. Under Value, select Burnt Bricks/Blocks.

  3. Click on Classify.

The resulting symbolisation could look something like this:

../../_images/m5_ex_7_visualisation_result.png

The darker the colour, the higher the percentage of buildings having burnt bricks or blocks as wall type. The grey areas are the districts where we there is no data available.#