Exercise: Data preparation (Importing PDF tables into QGIS)#

Characteristics of the exercise#

Attention

This exercise has two different solution methods:

  • The first one uses Microsoft Excel with an active Microsoft 365 Subscription, making use of the advanced features “Get Data From PDF” and the Fuzzy Merge Function in the Power Query Editor.

  • The second solution uses Open-Source Software to achieve the same result. However, the workflow is quite different.

Instructions for the Trainers#

Attention

This exercise makes use of the tabula.technology tool, an open source application which let’s you easily extract tables from a PDF-file. Keep in mind that

Trainers Corner

Prepare the training

  • Take the time to familiarise yourself with the exercise and the provided material.

  • Prepare a white-board. It can be either a physical whiteboard, a flip-chart, or a digital whiteboard (e.g. Miro board) where the participants can add their findings and questions.

  • Before starting the exercise, make sure everybody has installed QGIS and has downloaded and unzipped the data folder.

  • Check out How to do trainings? for some general tips on training conduction

Conduct the training

Introduction:

  • Introduce the idea and aim of the exercise.

  • Provide the download link and make sure everybody has unzipped the folder before beginning the tasks.

Follow-along:

  • Show and explain each step yourself at least twice and slow enough so everybody can see what you are doing, and follow along in their own QGIS-project.

  • Make sure that everybody is following along and doing the steps themselves by periodically asking if anybody needs help or if everybody is still following.

  • Be open and patient to every question or problem that might come up. Your participants are essentially multitasking by paying attention to your instructions and orienting themselves in their own QGIS-project.

Wrap up:

  • Leave time for any issues or questions concerning the tasks at the end of the exercise.

  • Leave some time for open questions.

Exercise#

Available Data#

  • Administrative Boundaries for pakistan

  • Excel files

Workflow 1: Tabula and QGIS#


Task 1: Get the data from the PDF file into a CSV file#

  1. In your web-browser, go to Tabula.technology and download the application.

../../_images/en_tabula_website.png

The Tabula.technology website with the download links to the left#

  1. Unzip the downloaded file into a location of your choosing (e.g., Programs, Desktop, …).

  2. Open the folder where you unzipped the file and open the “Tabula” application

../../_images/en_tabula_folder.png
  1. A new browser window will open with this address: http://localhost:8080. This is the application. If the browser does not open automatically, you can open the browser and enter this go to this address manually.

../../_images/en_tabula_import.png
  1. Now let’s import the PDF file with the wall types into tabula:

    1. Click on Browse and navigate to the exercise data folder: ...\data\input and select the PDF “pakistan_wall_type7.7”. Click Open.

    2. Click Import and wait for the PDF to load. Once loaded, it will open automatically.

../../_images/en_Tabula_main_view.png
  1. Here we will select the portion of the PDF that contains the data table. Tabula expects a table with one row of headers at the top for each column, followed by the rows with the data. By dragging a rectangle on the PDF, we can create a selection where tabula should look for the data table. Drag a rectangle and adjust the boarders so the table fits as precisely as possible into selection. Make sure to only capture the relevant information. Since the headers in this table has an unconventional formatting, it should be left out so the resulting csv table is easier to adjust. We will add the headers manually once extracted. F

../../_images/en_tabula_selection.png
  1. Once you are satisfied with the selection, you can click on Repeat this Selection to duplicate the selection on the next pages.

  2. Take a look on the following pages and make sure the table is still fully contained by the selection.

  3. Click on Preview & Export Extracted Data on the top right of the window.

  4. A new window will appear where the data will show up. At first, nothing will be visible. First, click on Stream on the left.

../../_images/en_tabula_preview_extracted_1.png
  1. The data from the PDF table will appear in the main window. Review the table.

../../_images/en_tabula_data_preview.png
  1. Click on Export, this will save the .csv into your downloads folder.

  2. Move the file to the /data/interim/-folder.

Congratulations, the data from the PDF has been extracted into a CSV file!

Task 2: Clean the data from errors and unwanted entries#

Note

The necessary steps to filter the data might be different depending on the editor you use. In this exercise, we will go through the workflow with the free version of Microsoft Excel.

  1. Open the extracted CSV file in Excel. It might look like this:

../../_images/en_tabula_csv_excel.png
  1. Excel does not automatically recognise the comma delimited format. We can fix this by selecting the column A, navigating to Data > Text to Columns. A new window will open:



Workflow 2: Microsoft Excel and Power Query#

Task 1: Get the data from the PDF file into Excel#

Note

Before starting manipulating the data, make yourself familiar with the PDF-file. Open it and look at how the data is presented. How are the columns & rows organised? What kind of measurements units have been used? On how many pages is the data distributed?

  1. Open Microsoft Excel and open a new workbook.

  2. In the navigation bar, navigate to Data and click on the tool Get Data > From file > From PDF.

  3. A new window will open. Select the file from your download folder and click Open.

  4. Another window called “Navigator” will open. Here you can select the page from which excel will extract the data.

  5. Check the box Select Multiple Items and select the tables Select Page 1 and click Load.

  6. The data will be loaded onto a new Excel-sheet.

    Note

    You will notice that the table is not formatted correctly. This has to be fixed manually.

    Insert Image

  7. Give the columns the correct names, as they are presented in the PDF file (Province & District, Burnt Bricks/Blocks, Mud Bricks/Mud, Wood, Other, Total) on each page.

  8. Delete the first two rows so the table starts with the provinces and the percentage values.

  9. Save the file as a “Comma delimited text”-file (.csv). Click Yes.

Congratulations, the data from the PDF has been extracted and saved to a usable .csv-file!

Task 2: Prepare and clean the data for import into QGIS#

Caution

The key identifiers need to match exactly. Otherwise, QGIS will not be able to join two layers. We need to prepare the files so the attributes will match.

  1. Delete each row with the value “Urban” and “Rural” in the column “Province & District”. Check for district names where the “Urban” or “Rural” has been added at the end.

Image