Connect to Apache Iceberg#

What is Apache Iceberg?#

  • Iceberg brings together MinIO object store and things you are used to have from a database

Connect to Apache Iceberg Catalog via PyIceberg#

Adjust the code below and add your MinIO access keys in there:

import os

s3_user = os.environ["S3_ACCESS_KEY_ID"]  # add your user here
s3_password = os.environ["S3_SECRET_ACCESS_KEY"]  # add your password here

Run this line if you haven’t installed the python libraries yet, e.g. when you are running this in Google Colab.

!pip install "pyiceberg[s3fs,duckdb,sql-sqlite,pyarrow]"

Set up connection to Iceberg catalog.

from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(
    name="default",
    **{
        "uri": "https://sotm2024.iceberg.ohsome.org",
        "s3.endpoint": "https://sotm2024.minio.heigit.org",
        "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
        "s3.access-key-id": s3_user,
        "s3.secret-access-key": s3_password,
        "s3.region": "eu-central-1"
    }
)

Get an overview#

Find out what data exists and where to find it. Tables in Iceberg are organized in groups called NAMESPACES.

  1. List all existing namespaces

  2. List the tables that exist in a namespace

  3. Get some table metadata

Currently this catalog consists of only a single namespace. You can think of namespaces like a schema in postgres or other databases.

catalog.list_namespaces()
[('geo_sort',)]

In this step we list which tables are available in this namespace.

catalog.list_tables('geo_sort')
[('geo_sort', 'benni_test_heidelberg'),
 ('geo_sort', 'contributions'),
 ('geo_sort', 'contributions_germany')]

Let’s inspect a single Iceberg table and list all columns / attributes from this table. (We will explain these in detail again on the next page.)

iceberg_table = catalog.load_table(('geo_sort', 'contributions'))
display(iceberg_table)
contributions(
  1: user_id: optional int,
  2: valid_from: optional timestamp,
  3: valid_to: optional timestamp,
  4: osm_type: optional string,
  5: osm_id: optional string,
  6: osm_version: optional int,
  7: contrib_type: optional string,
  8: members: optional list<struct<32: type: optional string, 33: id: optional long, 34: role: optional string, 35: geometry: optional binary>>,
  9: status: optional string,
  10: changeset: optional struct<36: id: optional long, 37: timestamp: optional timestamp, 38: tags: optional map<string, string>, 39: hashtags: optional list<string>, 40: editor: optional string>,
  11: tags: optional map<string, string>,
  12: tags_before: optional map<string, string>,
  13: map_features: optional struct<48: aerialway: optional boolean, 49: aeroway: optional boolean, 50: amenity: optional boolean, 51: barrier: optional boolean, 52: boundary: optional boolean, 53: building: optional boolean, 54: craft: optional boolean, 55: emergency: optional boolean, 56: geological: optional boolean, 57: healthcare: optional boolean, 58: highway: optional boolean, 59: historic: optional boolean, 60: landuse: optional boolean, 61: leisure: optional boolean, 62: man_made: optional boolean, 63: military: optional boolean, 64: natural: optional boolean, 65: office: optional boolean, 66: place: optional boolean, 67: power: optional boolean, 68: public_transport: optional boolean, 69: railway: optional boolean, 70: route: optional boolean, 71: shop: optional boolean, 72: sport: optional boolean, 73: telecom: optional boolean, 74: water: optional boolean, 75: waterway: optional boolean>,
  14: area: optional long,
  15: area_delta: optional long,
  16: length: optional long,
  17: length_delta: optional long,
  18: xzcode: optional struct<76: level: optional int, 77: code: optional long>,
  19: country_iso_a3: optional list<string>,
  20: bbox: optional struct<79: xmin: optional double, 80: ymin: optional double, 81: xmax: optional double, 82: ymax: optional double>,
  21: xmin: optional double,
  22: xmax: optional double,
  23: ymin: optional double,
  24: ymax: optional double,
  25: centroid: optional struct<83: x: optional double, 84: y: optional double>,
  26: quadkey_z10: optional string,
  27: h3_r5: optional long,
  28: geometry_type: optional string,
  29: geometry_valid: optional boolean,
  30: geometry: optional string
),
partition by: [status, geometry_type],
sort order: [],
snapshot: Operation.APPEND: id=1440840715635230871, schema_id=0

Let’s dive deeper now into the data structure and what you can expect for your data analysis.