Exercise: Data preparation (Importing PDF tables into QGIS)#
Characteristics of the exercise#
Attention
This exercise has two different solution methods:
The first one uses Microsoft Excel with an active Microsoft 365 Subscription, making use of the advanced features “Get Data From PDF” and the Fuzzy Merge Function in the Power Query Editor.
The second solution uses Open-Source Software to achieve the same result. However, the workflow is quite different.
Instructions for the Trainers#
Attention
This exercise makes use of the tabula.technology tool, an open source application which let’s you easily extract tables from a PDF-file. Keep in mind that
Trainers Corner
Prepare the training
Take the time to familiarise yourself with the exercise and the provided material.
Prepare a white-board. It can be either a physical whiteboard, a flip-chart, or a digital whiteboard (e.g. Miro board) where the participants can add their findings and questions.
Before starting the exercise, make sure everybody has installed QGIS and has downloaded and unzipped the data folder.
Check out How to do trainings? for some general tips on training conduction
Conduct the training
Introduction:
Introduce the idea and aim of the exercise.
Provide the download link and make sure everybody has unzipped the folder before beginning the tasks.
Follow-along:
Show and explain each step yourself at least twice and slow enough so everybody can see what you are doing, and follow along in their own QGIS-project.
Make sure that everybody is following along and doing the steps themselves by periodically asking if anybody needs help or if everybody is still following.
Be open and patient to every question or problem that might come up. Your participants are essentially multitasking by paying attention to your instructions and orienting themselves in their own QGIS-project.
Wrap up:
Leave time for any issues or questions concerning the tasks at the end of the exercise.
Leave some time for open questions.
Exercise#
Available Data#
Administrative Boundaries for pakistan
Excel files
Workflow 1: Tabula and QGIS#
Task 1: Get the data from the PDF file into a CSV file#
In your web-browser, go to Tabula.technology and download the application.

The Tabula.technology website with the download links to the left#
Unzip the downloaded file into a location of your choosing (e.g., Programs, Desktop, …).
Open the folder where you unzipped the file and open the “Tabula” application

A new browser window will open with this address: http://localhost:8080. This is the application. If the browser does not open automatically, you can open the browser and enter this go to this address manually.

Now let’s import the PDF file with the wall types into tabula:
Click on
Browse
and navigate to the exercise data folder:...\data\input
and select the PDF “pakistan_wall_type7.7”. ClickOpen
.Click
Import
and wait for the PDF to load. Once loaded, it will open automatically.

Here we will select the portion of the PDF that contains the data table. Tabula expects a table with one row of headers at the top for each column, followed by the rows with the data. By dragging a rectangle on the PDF, we can create a selection where tabula should look for the data table. Drag a rectangle and adjust the boarders so the table fits as precisely as possible into selection. Make sure to only capture the relevant information. Since the headers in this table has an unconventional formatting, it should be left out so the resulting csv table is easier to adjust. We will add the headers manually once extracted. F

Once you are satisfied with the selection, you can click on
Repeat this Selection
to duplicate the selection on the next pages.Take a look on the following pages and make sure the table is still fully contained by the selection.
Click on
Preview & Export Extracted Data
on the top right of the window.A new window will appear where the data will show up. At first, nothing will be visible. First, click on
Stream
on the left.

The data from the PDF table will appear in the main window. Review the table.

Click on
Export
, this will save the.csv
into your downloads folder.Move the file to the
/data/interim/
-folder.
Congratulations, the data from the PDF has been extracted into a CSV file!
Task 2: Clean the data from errors and unwanted entries#
Note
The necessary steps to filter the data might be different depending on the editor you use. In this exercise, we will go through the workflow with the free version of Microsoft Excel.
Open the extracted CSV file in Excel. It might look like this:

Excel does not automatically recognise the comma delimited format. We can fix this by selecting the column A, navigating to
Data
>Text to Columns
. A new window will open:
Workflow 2: Microsoft Excel and Power Query#
Task 1: Get the data from the PDF file into Excel#
Note
Before starting manipulating the data, make yourself familiar with the PDF-file. Open it and look at how the data is presented. How are the columns & rows organised? What kind of measurements units have been used? On how many pages is the data distributed?
Open Microsoft Excel and open a new workbook.
In the navigation bar, navigate to
Data
and click on the toolGet Data
>From file
>From PDF
.A new window will open. Select the file from your download folder and click
Open
.Another window called “Navigator” will open. Here you can select the page from which excel will extract the data.
Check the box
Select Multiple Items
and select the tables Select Page 1 and clickLoad
.The data will be loaded onto a new Excel-sheet.
Note
You will notice that the table is not formatted correctly. This has to be fixed manually.
Insert Image
Give the columns the correct names, as they are presented in the PDF file (Province & District, Burnt Bricks/Blocks, Mud Bricks/Mud, Wood, Other, Total) on each page.
Delete the first two rows so the table starts with the provinces and the percentage values.
Save the file as a “Comma delimited text”-file (
.csv
). ClickYes
.
Congratulations, the data from the PDF has been extracted and saved to a usable .csv
-file!
Task 2: Prepare and clean the data for import into QGIS#
Caution
The key identifiers need to match exactly. Otherwise, QGIS will not be able to join two layers. We need to prepare the files so the attributes will match.
Delete each row with the value “Urban” and “Rural” in the column “Province & District”. Check for district names where the “Urban” or “Rural” has been added at the end.
Image