Download latest OSM data as GeoPackage#

In this notebook we demonstrate how you can download the latest OSM data in GeoPackage format.

These are the steps you see further down:

  • Set the connection parameters.

  • Prepare your input parameters, e.g. define area of interest and OSM tag filter.

  • Download data using PyIceberg and DuckDB.

  • Filter and process data with DuckDB.

  • Export results into geopackage file with GeoPandas.

Getting started#

Set connection params.

from dotenv import load_dotenv
load_dotenv()
True
import os

s3_user = os.environ["S3_ACCESS_KEY_ID"]  # add your user here
s3_password = os.environ["S3_SECRET_ACCESS_KEY"]  # add your password here

Configure DuckDB.

!pip install duckdb==1.0.0
import duckdb

con = duckdb.connect(
    config={
        'threads': 8,
        'max_memory': '8GB',
    }
)
con.install_extension("spatial")
con.load_extension("spatial")

Set the connection params to Iceberg Rest Catalog.

!pip install "pyiceberg[s3fs,duckdb,sql-sqlite,pyarrow]"
from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(
    name="default",
    **{
        "uri": "https://sotm2024.iceberg.ohsome.org",
        "s3.endpoint": "https://sotm2024.minio.heigit.org",
        "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
        "s3.access-key-id": s3_user,
        "s3.secret-access-key": s3_password,
        "s3.region": "eu-central-1"
    }
)

Download data with PyIceberg table scan#

In this step we can already filter all OSM contributions by four major factors. We will perform more detailed filtering (e.g. for OSM tags values) later:

  • status (e.g. latest, historic or deleted OSM features)

  • location (using the bounding box coordinates of each OSM feature)

  • geometry type (e.g. for Polygons, Linestrings or Points)

  • time (e.g. the edit timestamp of each OSM contribution)

# Set iceberg table
namespace = 'geo_sort'
tablename = 'contributions'
icebergtable = catalog.load_table((namespace, tablename))

# Define status filter
status = 'latest'

# Define location filter
bboxes = {
    'heidelberg': (8.629761, 49.379556, 8.742371, 49.437890),
    'nairobi': (36.650938, -1.444471, 37.103887, -1.163522),
    'mannheim': (8.41416, 49.410362, 8.58999, 49.590489), 
    'berlin': (13.088345, 52.338271, 13.761161, 52.675509)
}

selected_region = 'nairobi'
xmin, ymin, xmax, ymax = bboxes[selected_region]

# Define geometry type filter
geometry_type = 'Polygon'

# Define time filter (optional)
min_timestamp = '2024-01-01T00:00:00'
max_timestamp = '2024-06-01T00:00:00'

Furthermore, we define which attributes / columns this download should contain. Check out the dataset description page to get an overview on all available columns.

Usually you rarely want to extract all available columns as this would reduce speed of the data download. Here we are going to download the following information:

  • user_id

  • osm_id

  • osm_version

  • valid_from

  • tags

  • geometry

import time
start_time = time.time()

icebergtable.scan(
    row_filter=(
        f"status = '{status}' "
        f"and geometry_type = '{geometry_type}' "
        f"and (bbox.xmax >= {xmin} and bbox.xmin <= {xmax}) "
        f"and (bbox.ymax >= {ymin} and bbox.ymin <= {ymax}) "
    ),
    selected_fields=(
        "user_id",
        "osm_id",
        "osm_version",
        "valid_from",
        "tags",
        "geometry",
    ),
).to_duckdb('raw_osm_data',connection=con)

download_time = round(time.time() - start_time, 3)
print(f"download took {download_time} sec.")
download took 15.965 sec.

Filter and process data with DuckDB#

Second, we use DuckDB to perform the more detailed filtering. In this step we can filter for:

  • tags

  • location (using the exact geometry of each OSM contribution)

import time
start_time = time.time()

query = f"""
DROP TABLE IF EXISTS osm_data;
CREATE TABLE osm_data AS
(
SELECT a.*  
FROM
    raw_osm_data as a,
WHERE 1=1
    and tags['building'][1] is not null
    and tags['building'][1] != 'no'
)
;
"""
con.sql(query)

processing_time = round(time.time() - start_time, 3)
print(f"processing took {processing_time} sec.")
processing took 0.552 sec.

Save data as GeoPackage#

Show the structure of the data we have just downloaded.

query = """
DESCRIBE
FROM osm_data;
"""
con.sql(query)
┌─────────────┬───────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │      column_type      │  null   │   key   │ default │  extra  │
│   varchar   │        varchar        │ varchar │ varchar │ varchar │ varchar │
├─────────────┼───────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ user_id     │ INTEGER               │ YES     │ NULL    │ NULL    │ NULL    │
│ valid_from  │ TIMESTAMP             │ YES     │ NULL    │ NULL    │ NULL    │
│ osm_id      │ VARCHAR               │ YES     │ NULL    │ NULL    │ NULL    │
│ osm_version │ INTEGER               │ YES     │ NULL    │ NULL    │ NULL    │
│ tags        │ MAP(VARCHAR, VARCHAR) │ YES     │ NULL    │ NULL    │ NULL    │
│ geometry    │ VARCHAR               │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴───────────────────────┴─────────┴─────────┴─────────┴─────────┘

Inspect a few features.

query = """
SELECT *
FROM osm_data
LIMIT 5;
"""
con.sql(query)
┌──────────┬─────────────────────┬────────────────┬─────────────┬──────────────────────┬───────────────────────────────┐
│ user_id  │     valid_from      │     osm_id     │ osm_version │         tags         │           geometry            │
│  int32   │      timestamp      │    varchar     │    int32    │ map(varchar, varch…  │            varchar            │
├──────────┼─────────────────────┼────────────────┼─────────────┼──────────────────────┼───────────────────────────────┤
│  1122708 │ 2021-09-07 23:04:37 │ way/500108893  │           5 │ {building=yes, rai…  │ POLYGON ((36.89974139999999…  │
│ 13366421 │ 2023-01-10 06:13:29 │ way/1130552834 │           1 │ {building=yes, add…  │ POLYGON ((36.8719435 -1.241…  │
│ 18306654 │ 2023-11-11 01:58:08 │ way/88406439   │          16 │ {building=yes, aer…  │ POLYGON ((36.9238579 -1.331…  │
│ 17770290 │ 2022-11-27 17:57:39 │ way/1117632750 │           1 │ {building=yes}       │ POLYGON ((36.6824404 -1.442…  │
│  3733993 │ 2023-04-11 09:02:45 │ way/1161291128 │           1 │ {building=yes}       │ POLYGON ((36.6824528 -1.438…  │
└──────────┴─────────────────────┴────────────────┴─────────────┴──────────────────────┴───────────────────────────────┘

Count the number of features.

query = f"""
SELECT count(*)
FROM osm_data
"""
con.sql(query)
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       527170 │
└──────────────┘

Export as GeoPackage via GeoPandas.

import geopandas as gpd

start_time = time.time()
query = f"""
    SELECT * FROM osm_data
"""
df = con.sql(query).df()

gdf = gpd.GeoDataFrame(
    df,
    geometry=gpd.GeoSeries.from_wkt(df['geometry'])
).set_crs('epsg:4326')

output_filename = f"../data/{selected_region}_osm_data.gpkg"
gdf.to_file(output_filename, driver='GPKG')
processing_time = round(time.time() - start_time, 3)
print(f"processing took {processing_time} sec.")
processing took 16.175 sec.

Work with the data in QGIS#

Add your geopackage file in QGIS, e.g. via drag-and-drop or through file manager.