Connect to Apache Iceberg#
What is Apache Iceberg?#
Iceberg brings together MinIO object store and things you are used to have from a database
Connect to Apache Iceberg Catalog via PyIceberg#
Adjust the code below and add your MinIO access keys in there:
import os
s3_user = os.environ["S3_ACCESS_KEY_ID"] # add your user here
s3_password = os.environ["S3_SECRET_ACCESS_KEY"] # add your password here
Run this line if you haven’t installed the python libraries yet, e.g. when you are running this in Google Colab.
!pip install "pyiceberg[s3fs,duckdb,sql-sqlite,pyarrow]"
Set up connection to Iceberg catalog.
from pyiceberg.catalog.rest import RestCatalog
catalog = RestCatalog(
name="default",
**{
"uri": "https://sotm2024.iceberg.ohsome.org",
"s3.endpoint": "https://sotm2024.minio.heigit.org",
"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
"s3.access-key-id": s3_user,
"s3.secret-access-key": s3_password,
"s3.region": "eu-central-1"
}
)
Get an overview#
Find out what data exists and where to find it. Tables in Iceberg are organized in groups called NAMESPACES.
List all existing namespaces
List the tables that exist in a namespace
Get some table metadata
Currently this catalog consists of only a single namespace. You can think of namespaces like a schema
in postgres or other databases.
catalog.list_namespaces()
[('geo_sort',)]
In this step we list which tables are available in this namespace.
catalog.list_tables('geo_sort')
[('geo_sort', 'benni_test_heidelberg'),
('geo_sort', 'contributions'),
('geo_sort', 'contributions_germany')]
Let’s inspect a single Iceberg table and list all columns / attributes from this table. (We will explain these in detail again on the next page.)
iceberg_table = catalog.load_table(('geo_sort', 'contributions'))
display(iceberg_table)
contributions(
1: user_id: optional int,
2: valid_from: optional timestamp,
3: valid_to: optional timestamp,
4: osm_type: optional string,
5: osm_id: optional string,
6: osm_version: optional int,
7: contrib_type: optional string,
8: members: optional list<struct<32: type: optional string, 33: id: optional long, 34: role: optional string, 35: geometry: optional binary>>,
9: status: optional string,
10: changeset: optional struct<36: id: optional long, 37: timestamp: optional timestamp, 38: tags: optional map<string, string>, 39: hashtags: optional list<string>, 40: editor: optional string>,
11: tags: optional map<string, string>,
12: tags_before: optional map<string, string>,
13: map_features: optional struct<48: aerialway: optional boolean, 49: aeroway: optional boolean, 50: amenity: optional boolean, 51: barrier: optional boolean, 52: boundary: optional boolean, 53: building: optional boolean, 54: craft: optional boolean, 55: emergency: optional boolean, 56: geological: optional boolean, 57: healthcare: optional boolean, 58: highway: optional boolean, 59: historic: optional boolean, 60: landuse: optional boolean, 61: leisure: optional boolean, 62: man_made: optional boolean, 63: military: optional boolean, 64: natural: optional boolean, 65: office: optional boolean, 66: place: optional boolean, 67: power: optional boolean, 68: public_transport: optional boolean, 69: railway: optional boolean, 70: route: optional boolean, 71: shop: optional boolean, 72: sport: optional boolean, 73: telecom: optional boolean, 74: water: optional boolean, 75: waterway: optional boolean>,
14: area: optional long,
15: area_delta: optional long,
16: length: optional long,
17: length_delta: optional long,
18: xzcode: optional struct<76: level: optional int, 77: code: optional long>,
19: country_iso_a3: optional list<string>,
20: bbox: optional struct<79: xmin: optional double, 80: ymin: optional double, 81: xmax: optional double, 82: ymax: optional double>,
21: xmin: optional double,
22: xmax: optional double,
23: ymin: optional double,
24: ymax: optional double,
25: centroid: optional struct<83: x: optional double, 84: y: optional double>,
26: quadkey_z10: optional string,
27: h3_r5: optional long,
28: geometry_type: optional string,
29: geometry_valid: optional boolean,
30: geometry: optional string
),
partition by: [status, geometry_type],
sort order: [],
snapshot: Operation.APPEND: id=1440840715635230871, schema_id=0
Let’s dive deeper now into the data structure and what you can expect for your data analysis.