Vandalism Detection for LineStrings#
In this notebook we demonstrate how you detect and visualize vandalism in OSM. We are going to make use of one of the enriched attributes we have added to ohsome-data-insights.
These are the steps you see further down:
Set the connection parameters.
Prepare your input parameters, e.g. define area of interest and shape filters.
Download data using PyIceberg and DuckDB.
Filter and process data with DuckDB.
Visualize the results on a map.
Getting started#
Set connection params.
from dotenv import load_dotenv
load_dotenv()
True
import os
s3_user = os.environ["S3_ACCESS_KEY_ID"] # add your user here
s3_password = os.environ["S3_SECRET_ACCESS_KEY"] # add your password here
Configure DuckDB.
!pip install duckdb==1.0.0
import duckdb
con = duckdb.connect(
config={
'threads': 8,
'max_memory': '8GB',
}
)
con.install_extension("spatial")
con.load_extension("spatial")
Set the connection params to Iceberg Rest Catalog.
!pip install "pyiceberg[s3fs,duckdb,sql-sqlite,pyarrow]"
from pyiceberg.catalog.rest import RestCatalog
catalog = RestCatalog(
name="default",
**{
"uri": "https://sotm2024.iceberg.ohsome.org",
"s3.endpoint": "https://sotm2024.minio.heigit.org",
"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
"s3.access-key-id": s3_user,
"s3.secret-access-key": s3_password,
"s3.region": "eu-central-1"
}
)
Prepare the input parameters for your analysis#
# Set iceberg table
namespace = 'geo_sort'
tablename = 'contributions'
icebergtable = catalog.load_table((namespace, tablename))
# Define status filter
status = 'history'
# Define geometry type filter
geometry_type = 'LineString'
# Define time filter (optional)
min_timestamp = '2023-01-01T00:00:00'
# Define length_delta threshold in meter.
# This is a 5000 kilometers!
length_delta_threshold = 5_000_000
Get the Data#
First, we do an iceberg table scan with a pre-filter. This is a fast way to download all potential OSM elements that are needed for our analysis.
Here we use one of the enriched attributes we have calculated in advance for each OSM contribution. We use the length_delta
attribute to inspect all the OSM elements with an unusual large change. Here we download all linestrings which changed more than 5000 kilometers in length.
import time
start_time = time.time()
icebergtable.scan(
row_filter=(
f"status = '{status}' "
f"and geometry_type = '{geometry_type}' "
f"and length_delta >= '{length_delta_threshold}' "
),
selected_fields=(
"user_id",
"osm_id",
"valid_from",
"tags",
"geometry",
)
).to_duckdb('raw_osm_data',connection=con)
download_time = round(time.time() - start_time, 3)
print(f"download took {download_time} sec.")
download took 7.797 sec.
Display OSM user activity on map#
Get data from DucDKB into GeoPandas dataframe.
import geopandas as gpd
map_query = """
SELECT
osm_id,
user_id,
valid_from::varchar as date,
epoch_ms(valid_from) as valid_from,
geometry
FROM raw_osm_data;
"""
df = con.sql(map_query).df()
# convert the data to geodata
gdf = gpd.GeoDataFrame(
df,
geometry=gpd.GeoSeries.from_wkt(df['geometry'])
).set_crs('epsg:4326')
Define map parameters and style.
import datetime
import lonboard
from palettable.matplotlib import Viridis_20
# compute lonboard color style for contious color map
min_valid_from = datetime.datetime(2008,1,1).timestamp()
max_valid_from = datetime.datetime(2024,7,1).timestamp()
# the lonboard map definition
layer = lonboard.PathLayer.from_geopandas(
gdf,
extensions=[lonboard.layer_extension.DataFilterExtension(filter_size=1)],
get_filter_value=gdf["valid_from"], # replace with desired column
filter_range=[min_valid_from, max_valid_from], # replace with desired filter range
get_color=[255, 0, 0, 125],
get_width=1.5,
width_units='pixels'
)
currentness_map = lonboard.Map(
basemap_style=lonboard.basemap.CartoBasemap.DarkMatter,
layers=[layer]
)
Define Dates Slider Selection Widget and link to map filter range.
from datetime import date, timedelta
import ipywidgets
from traitlets import directional_link
start = datetime.datetime(2008,1,1)
end = datetime.datetime(2024,6,1)
delta = end - start # returns timedelta
dates = [start + timedelta(days=i) for i in range(delta.days + 1)]
options = [(i.strftime('%d-%b-%Y'), 1000 * i.timestamp()) for i in dates]
date_slider = ipywidgets.SelectionRangeSlider(
options=options,
index=(0, len(dates)-1),
description='Last Edit:',
layout=ipywidgets.Layout(width='1000px'),
disabled=False
)
directional_link(
(date_slider, 'value'),
(layer, 'filter_range')
)
<traitlets.traitlets.directional_link at 0x78f640dabe30>
Display the map. Have fun exploring and moving around the time slider!
display(currentness_map, date_slider)