Lesson 5. OvertureMaps Data#
From Local to Global analysis with Arrow#

Introduction#
In this Lesson, we will fetch Overture Maps data from the local level in Helsinki and the national level of Finland using tags like Buildings and Points of Interest (POI). Then, we will escalate the fetching process at the global level using grids and storing data in parquet format on our local disk (Scratch). The Scratch disk in the Puhti supercomputer has a capacity of 1 Tb which is way bigger than the projappl disk with 50 Gb. So, it is recommended to use the Scratch disk in processes where it is needed to store big amounts of data. We will also explore the compression capacity of Python for big data like PyArrow and cloud formats like GeoParquet. The test done at downloading the entire global data of Overture Maps showed that the data can be kept in only 3.8 GB using 417 GeoParquets.
This exercise is designed for teaching purposes of Spatial Data Science with High Performance Computing (HPC) at Aalto University. Thanks to the computational resources provided by CSC this exercise was tested in the Puhti supercomputer using Jupyter interactive session.
Objective#
To fetch global datasets from Overture Maps using Parallel Computing.
To analyze global data using GeoParquet and PyArrow. Which main category is the biggest at the global level?
Datasets#
Look at the whole data review at Overture Maps Documentation.
Points of Interest#
Data with the tag Places contains points of representations of real-world facilities, services, businesses, or amenities. Test done at the global level.
Buildings#
Data with the tag Buildings describes human-made structures with roofs or interior spaces that are permanently or semi-permanently in one place (source: OSM building definition). Done locally for Helsinki.
Country Admin Level#
The country border layer was downloaded from Natural Earth and stored in Allas for your availability. The Geopackage file is ~4 MB.
A low-resolution version was used directly from Geopandas data store.
Capacity#
2 Cores
60 GB memory
80 GB disk memory
Output#
A workflow that can be used for the analysis of big data at the global level for categorical variables and geo-visualization.
Hands-on coding#
Follow the instructions and run every cell in the supercomputer.
Importing Python libraries#
Be sure that you have installed the overturemaps-py and lonboard in your environment. Get familiar with this library by reading a bit the lonboard Documentation. Also, be aware that PyArrow and H3Pandas should be in the environment.
For parallelization, we are going to use Dask.
import numpy as np
from shapely.geometry import Polygon
import overturemaps
import geopandas as gpd
import pandas as pd
from IPython.display import IFrame
import gc
import dask_geopandas as geodask
import dask
import pyarrow.parquet as pq
import plotly.express as px
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from palettable.colorbrewer.sequential import Oranges_9, PuRd_9, YlGnBu_9, PuBuGn_9, YlOrRd_9, RdPu_7
from palettable.colorbrewer.qualitative import Paired_12
from palettable.colorbrewer.diverging import RdYlGn_7, Spectral_9, PiYG_6
from lonboard import Map, PolygonLayer, ScatterplotLayer
from lonboard.colormap import apply_continuous_cmap, apply_categorical_cmap
import pydeck
import h3pandas
import h3
import datashader as ds
import datashader.transfer_functions as ds_function
from datashader.utils import export_image
from functools import partial
import colorcet as cc
import os, shutil, glob
import time
import warnings
warnings.simplefilter("ignore")
# results folder
if not os.path.exists('output'):
os.makedirs('output')
scratch = '/scratch/project_2009245'
Builings of Helsinki (Local)#
Find a Bounding Box#
The overture map tool works using a Bounding Box coordinates. You can easily find the coordinates using the online tool Bounding Box Klokantech. You have to select CSV from the dropdown options. We will use a decent area of the Helsinki Region for this initial part.

# bbox coordinates in CSV format
bbox = 24.503673, 60.120966, 25.254511, 60.356827
Materialize data in memory#
The overturemaps uses the function overturemaps.record_batch_reader which is an iterator of Arrow. With read_all() we will fetch the data from the AWS S3 Cloud so it might take some minutes.
Let’s get a look to the data types we can fetch using overturemaps.get_all_overture_types()
overturemaps.get_all_overture_types()
['locality',
'locality_area',
'administrative_boundary',
'building',
'building_part',
'division',
'division_area',
'place',
'segment',
'connector',
'infrastructure',
'land',
'land_cover',
'land_use',
'water']
%%time
table = overturemaps.record_batch_reader("building", bbox).read_all()
# Temporarily required as of Lonboard 0.8 to avoid a Lonboard bug
# table = table.combine_chunks()
CPU times: user 3.32 s, sys: 902 ms, total: 4.22 s
Wall time: 38.4 s
Let’s take a look at the column names of the Arrow table object. The Arrow Table has a different behavior in comparison of a Pandas Dataframe. If you need to do operation it might have a different sintax than common so it might be good to check the documentation before. For now, we will continue without operations but at we will use them at the global level.
table.column_names
['id',
'geometry',
'bbox',
'version',
'update_time',
'sources',
'subtype',
'names',
'class',
'level',
'has_parts',
'height',
'num_floors',
'min_height',
'min_floor',
'facade_color',
'facade_material',
'roof_material',
'roof_shape',
'roof_direction',
'roof_orientation',
'roof_color',
'eave_height']
Normalization of building’s height#
We will use lonboard to visualize and color the building’s height. Building heights tend to scale non-linearly. That is, most buildings are relatively low, but a few are very tall. So that the low buildings aren’t completely washed out, we’ll use matplotlib’ LogNorm to normalize these values accordingly to a log scale. Take a look at more material in the Lonboard Documentation
# replace height nans in a numpy array
heights = table["height"].to_numpy()
heights = np.nan_to_num(heights, nan=1)
# normalize heights in logarithmic scale
normalizer = LogNorm(1, heights.max(), clip=True)
normalized_heights = normalizer(heights)
Add color palette#
You can add color to lonboard to a column with continuous values using apply_continuous_cmap, take a look to the documentation
Let’s add a color from colorbrewer. You can find more options in the documentation but for now, we will use Oranges_9 or PuRd_9.
Oranges_9.mpl_colormap
# let's apply the color palette
colors = apply_continuous_cmap(normalized_heights, Oranges_9)
Create a PolygonLayer#
Using lonboard you have to create different Layer objects to run the map instance. Take a look at the options in the Layers Documentation. We use the PolygonLayer because we are plotting polygons in the building geometry.
# ---- Create a Layer
layer = PolygonLayer(
# --- Select only a few attribute columns from the table
table=table.select(["id", "height", "geometry", "names"]),
extruded=True,
get_elevation=heights,
get_fill_color=colors,
)
Map instance using lonboard#
Configure the view state so the map starts pitched.
view_state = {
"longitude": 24.940576,
"latitude": 60.176289,
"zoom": 13.6,
"pitch": 40,
"bearing": 4,
}
m = Map(layer, view_state=view_state)
# set a name for html
html_file = 'helsinki_buildings.html'
#path
html_path = os.path.join('output', html_file)
m.to_html(html_path)
You will find some issues in some cases visualizing the m instance. So, we will open an iFrame using a local HTML that we downloaded.
# IFrame(src=html_path, width=1600 , height=700)
To GeoParquet#
In order to transform our data to GeoParquet we first need to create a GeoDataFrame. The main idea of having GeoParquet is that we have less memory used when writing and storing files.
# to GeoDataFrame
geodata = gpd.GeoDataFrame(table.to_pandas(),
geometry=gpd.GeoSeries.from_wkb(table['geometry'])
)
geodata.head(3)
| id | geometry | bbox | version | update_time | sources | subtype | names | class | level | ... | min_height | min_floor | facade_color | facade_material | roof_material | roof_shape | roof_direction | roof_orientation | roof_color | eave_height | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 08b0899790774fff020055d68982d703 | POLYGON ((24.50829 60.12120, 24.50817 60.12114... | {'xmin': 24.50817108154297, 'xmax': 24.5083942... | 0 | 2021-03-03T09:13:10.000Z | [{'property': '', 'dataset': 'OpenStreetMap', ... | residential | None | garage | NaN | ... | NaN | NaN | None | None | None | None | NaN | None | None | NaN |
| 1 | 08b0899790776fff0200be20f6049017 | POLYGON ((24.50847 60.12105, 24.50837 60.12100... | {'xmin': 24.50836753845215, 'xmax': 24.5087223... | 0 | 2021-03-03T09:13:10.000Z | [{'property': '', 'dataset': 'OpenStreetMap', ... | None | None | None | NaN | ... | NaN | NaN | None | None | None | None | NaN | None | None | NaN |
| 2 | 08b0899790773fff02007229f1b4e6c4 | POLYGON ((24.50899 60.12138, 24.50904 60.12131... | {'xmin': 24.508981704711914, 'xmax': 24.509342... | 0 | 2021-03-03T09:13:10.000Z | [{'property': '', 'dataset': 'OpenStreetMap', ... | None | None | None | NaN | ... | NaN | NaN | None | None | None | None | NaN | None | None | NaN |
3 rows × 23 columns
# let's check how much memory is it using
memory = geodata.memory_usage(index=True).sum()
print(f'Total memory used for GeoDataFrame: {memory/1000000000} GB')
Total memory used for GeoDataFrame: 0.037009499 GB
# to Geoparquet
geodata.to_parquet('output/buildings_helsinki.parquet', index=False)
POIS of Finland (National)#
Define Finland bbox#
Same as we did for Helsinki, we are going to use a bounding box that fetches data from Finland. We will deal with administrative borders to keep only the POIS and Finland country boundaries.

fin_bbox = 19.84, 59.68, 31.78, 70.25
Get data POIS#
Fetch all POIS from our bounding box that covers Finland.
%%time
table = overturemaps.record_batch_reader("place", fin_bbox).read_all()
# Temporarily required as of Lonboard 0.8 to avoid a Lonboard bug
# table = table.combine_chunks()
CPU times: user 1.38 s, sys: 478 ms, total: 1.86 s
Wall time: 7.89 s
To Finland border#
To get the data delimited to the Finland’s border we will use the Country Admin layer. Let’s download it into our working space first. We need to do this delimitation using Clip from Geopandas
We use this dataset due to quality accuracy.
!wget -N https://a3s.fi/swift/v1/AUTH_a6b8530017f34af9861fcf45a738ad3f/L2-CellTowers/L2-CountryAdmin-data.gpkg
--2024-07-08 16:32:18-- https://a3s.fi/swift/v1/AUTH_a6b8530017f34af9861fcf45a738ad3f/L2-CellTowers/L2-CountryAdmin-data.gpkg
Resolving a3s.fi (a3s.fi)... 86.50.254.18, 86.50.254.19
Connecting to a3s.fi (a3s.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘L2-CountryAdmin-data.gpkg’ not modified on server. Omitting download.
# read country layer
world_admin = gpd.read_file('L2-CountryAdmin-data.gpkg')
# clean and get needed columns
world_admin = world_admin.dropna()
world_gdf = world_admin[['name', 'continent', 'geometry']]
world_gdf.head(3)
| name | continent | geometry | |
|---|---|---|---|
| 0 | Uganda | Africa | MULTIPOLYGON (((33.92110 -1.00194, 33.92027 -1... |
| 1 | Uzbekistan | Asia | MULTIPOLYGON (((70.97081 42.25467, 70.98054 42... |
| 2 | Ireland | Europe | MULTIPOLYGON (((-9.97014 54.02083, -9.93833 53... |
# get Finland border
fin = world_gdf.loc[world_gdf['name']=='Finland']
ax = fin.plot();
ax.axis('off');
Let’s transform our arrow table into a geodataframe to operate the Clip
# to GeoDataFrame
geodata = gpd.GeoDataFrame(table.to_pandas(),
geometry=gpd.GeoSeries.from_wkb(table['geometry'])
)
# Clip data to Finland
fin_pois = gpd.clip(geodata, fin)
fin_pois.head(3)
| id | geometry | bbox | version | update_time | sources | names | categories | confidence | websites | socials | emails | phones | brand | addresses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 80379 | 08f1126d13813811035492970601cdd4 | POINT (25.02573 60.15663) | {'xmin': 25.025728225708008, 'xmax': 25.025732... | 0 | 2024-05-10T00:00:00.000Z | [{'property': '', 'dataset': 'meta', 'record_i... | {'primary': 'Koivusaari', 'common': None, 'rul... | {'main': 'active_life', 'alternate': ['island']} | 0.545360 | None | [https://www.facebook.com/407047872667629] | None | None | None | [{'freeform': None, 'locality': 'Helsinki', 'p... |
| 80394 | 08f1126d13b98c750373f6c5e88f8cfc | POINT (25.03381 60.16034) | {'xmin': 25.033811569213867, 'xmax': 25.033815... | 0 | 2024-05-10T00:00:00.000Z | [{'property': '', 'dataset': 'meta', 'record_i... | {'primary': 'Tahvonlahdenniemi', 'common': Non... | {'main': 'attractions_and_activities', 'altern... | 0.667112 | None | [https://www.facebook.com/268426200024634] | None | None | None | [{'freeform': None, 'locality': None, 'postcod... |
| 80396 | 08f1126d13710ca603d51db4e2126d09 | POINT (25.04591 60.15272) | {'xmin': 25.045907974243164, 'xmax': 25.045911... | 0 | 2024-05-10T00:00:00.000Z | [{'property': '', 'dataset': 'meta', 'record_i... | {'primary': 'Santahaminan ala-asteen koulu', '... | {'main': 'elementary_school', 'alternate': None} | 0.917202 | [https://www.hel.fi/peruskoulut/fi/koulut/sant... | [https://www.facebook.com/318142011597385] | None | [+358931089709] | None | [{'freeform': 'Santahaminantie 5', 'locality':... |
len(fin_pois)
181531
# let's check how much memory is it using
memory_fin = fin_pois.memory_usage(index=True).sum()
print(f'Total memory used for GeoDataFrame: {memory_fin/1000000000} GB')
Total memory used for GeoDataFrame: 0.022509844 GB
# to Geoparquet
fin_pois.to_parquet('output/pois_finland.parquet', index=False)
Map from GeoDataFrame#
Previously, we used the arrow table to visualization in lonboard. Now, we are going to test it using the converted GeoDataFrame.
YlGnBu_9.mpl_colormap
# let's apply the color palette
confidence_array = fin_pois['confidence'].to_numpy()
colors = apply_continuous_cmap(confidence_array, YlGnBu_9)
# ---- Create a Layer
layer = ScatterplotLayer.from_geopandas(
# --- Select only a few attribute columns from the table
fin_pois[["id", "categories", "geometry", "confidence"]],
get_radius = np.array([500*value for value in fin_pois['confidence'].to_numpy()]),
get_fill_color=colors,
)
view_state = {
"longitude": 26.814052,
"latitude": 62.399194,
"zoom": 6,
"pitch": 40,
"bearing": 4,
}
m = Map(layer, view_state=view_state)
# set a name for html
html_file = 'finland_pois_border.html'
#path
html_path = os.path.join('output', html_file)
m.to_html(html_path)
# IFrame(src=html_path, width=1600 , height=700)
POIS Worldwide (Global)#
Grid strategy#
We are going to create a Grid that covers globally the fetching process. We will use our World Admin layer and define a grid size in degrees. Then, the strategy is to create a process that fetches POI data per Grid and stores it in the local disk. This process is the one we will parallelize per each Grid.
We use degrees to avoid projection issues. By using degrees we can divide the globe in exact numbers.
Check the next function
def create_grid(study_area, grid_size_m):
'''
Give back a Geodataframe with specific grid size based on study area
study_area: geodaframe. default point geometry. crs not geographic
grid_size_m: size of grid side in meters
return: Geodataframe of polygons. Grid
'''
from shapely.geometry import Polygon
print(f'CRS of study area: {study_area.crs.name}')
# get bounds of province
# xmin, ymin, xmax, ymax = study_area.buffer(100).total_bounds
xmin, ymin, xmax, ymax = -180, -90, 180, 90
# grid size
length = grid_size_m
wide = grid_size_m
cols = list(np.arange(xmin, xmax + wide, wide))
rows = list(np.arange(ymin, ymax + length, length))
polygons = []
for x in cols[:-1]:
for y in rows[:-1]:
polygons.append(Polygon([(x,y), (x+wide, y), (x+wide, y+length), (x, y+length)]))
# grid = gpd.GeoDataFrame({'geometry':polygons}, crs = study_area.crs)
grid = gpd.GeoDataFrame({'geometry':polygons}, crs = 4326)
# ----- Remove empty grids
index_true = []
for row in grid.itertuples():
# check if contains
contains = study_area.intersects(row.geometry)
if True in contains.unique():
# catch index of grid that contains study area
index_true.append(row.Index)
# subset only valid grids
grid_valid = grid.loc[grid.index.isin(index_true)]
# reset index
grid_valid = grid_valid.reset_index(drop=True)
print(f'Total grids: {len(grid_valid)}')
return grid_valid
Global limits covered#
We will get a Geographic grid within the limits of the global boundaries.
We will start defining all Continents and Islands in the globe. We use the given country admin layer with high resolution that contains islands and we add the Antarctic from Natural Earth low resolution.
# using Natural Earth to add Antarctica
gdf = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
Antarctica = gdf.loc[gdf.continent=='Antarctica'][['name', 'geometry']];
# merge
world_gdf_globe = pd.concat([world_gdf[['name', 'geometry']], Antarctica])
print(f'Crs: {world_gdf_globe.crs.name}')
ax = world_gdf_globe.plot(figsize=(10,6))
ax.axis('off');
Crs: WGS 84
world_gdf_globe.crs.name
'WGS 84'
Grid creation and visualization#
Let’s use our function create_grid. It will give back a WGS84 grid layer that is overlapped in the administrative boundaries defined above.
# create a global grid
grid_globe = create_grid(world_gdf_globe, 10)
CRS of study area: WGS 84
Total grids: 428
print(f'Total Cells: {len(grid_globe)}')
ax = grid_globe.plot(figsize=(10,6), edgecolor='black')
ax.axis('off');
Total Cells: 428
grid_globe.to_file('output/grid_global.gpkg')
fig, ax = plt.subplots(figsize=(16, 14))
# plot
world_gdf_globe.plot(ax=ax,
# color='darkblue',
# markersize= 3,
alpha = 0.6,
)
grid_globe.plot(ax=ax,
facecolor="none",
edgecolor="black",
linewidth=0.6,
alpha=0.6
)
# ------- lables
grid_globe['coords'] = grid_globe['geometry'].apply(lambda x: x.representative_point().coords[:][0])
grid_globe['index'] = grid_globe.index
for idx, row in grid_globe.iterrows():
ax.text(row.coords[0],
row.coords[1],
s=row['index'],
horizontalalignment='center',
# color='red',
fontsize=6,
bbox={'facecolor': 'white', 'alpha':0.4, 'pad': 1, 'edgecolor':'none'})
plt.axis('off');
# a frame for names
names = gpd.GeoDataFrame({'name':[f'n{num}' for num in grid_globe.index],
'geometry':grid_globe.geometry.centroid})
%%time
# view state pydeck
view_state = pydeck.ViewState(latitude=40, longitude=0, zoom=0)
# set height and width variables
view = pydeck.View(type="_GlobeView", controller=True)#, width=800, height=800)
layers = [
pydeck.Layer(
"GeoJsonLayer",
id="world-admin",
data=world_gdf_globe,
stroked=False,
filled=True,
get_fill_color=[8, 173, 225], #, [230, 230, 230], # grey
# get_line_color=[8, 173, 225],
line_width_min_pixels=0.6
),
pydeck.Layer(
"GeoJsonLayer",
id="world-grid",
data=grid_globe,
stroked=True,
filled=False,
# get_fill_color=[230, 230, 230],
get_line_color=[75, 132, 216], #[216, 57, 57],
line_width_min_pixels=1
),
pydeck.Layer(
"TextLayer",
data=names,
get_position="geometry.coordinates",
get_size=12,
get_color=[75, 132, 216], #[216, 57, 57],
get_text="name",
),
]
deck = pydeck.Deck(
views=[view],
initial_view_state=view_state,
layers=layers,
map_provider=None,
# Note that this must be set for the globe to be opaque
parameters={"cull": True},
)
# deck.to_html("output/map.html")
CPU times: user 450 ms, sys: 20.4 ms, total: 470 ms
Wall time: 469 ms
deck