Lesson 3. Land Cover Classification

Lesson 3. Land Cover Classification#

Land Cover classification using Random Forest with in-situ and EO data in Finland#

Background#

This type of land cover interpretation provides the background for various environmental monitoring applications, such as the calculation of environmental indicators (e.g. “land degradation”), biotope inventories, nutrient loss modelling, and assessment of the consequences of legislation or land use planning.

The Finnish Environment Institute (SYKE) has worked since the 1990s recognizing the importance of such up-to-date spatial data. SYKE has participated in the European land cover monitoring since the early 2000s within the Corine programme, which has produced Corine land cover interpretations and modifications for the years 2000, 2006, 2012, and 2018. In addition, more local interpretations have been made with various partners, such as the remote sensing of Northern Lapland carried out for Metsähallitus in recent years. There are also various projects, such as FEO and Geoportti, which aim to improve the availability of environmental spatial information.

More information check the Land Cover Monitoring in Finland

Objective#

To perform Land Cover Classification using Random Forest ML with Python in a selected region of Finland.
To use the Lucas 2018 inventory dataset and Earth Observation data from the GeoCubes Finland service.

Learning outcome#

Generally, during this tutorial, you will learn:

How to conduct a generalized Land Cover classification using the scikit-learn Python library
How to use and download data from the GeoCubes service
How to use the Dask library in CSC’s Puhti supercomputer
How to interpret the classification results and improve them

Input datasets:#

1. In situ data - Lucas 2018#

In this exercise the In-situ dataset is the Lucas 2018 land inventory dataset from Finland. LUCAS (Land USe and Coverage Area frame Survey) is a land inventory organized by EUROSTAT, the statistical office of the European Union, which collects data on land cover and land use from the survey plots. There are about 330,000 plots throughout the EU, of which about 17,000 are in Finland. Part of the data collection is done by field visits and part by interpreting aerial photographs.

We are going to download the Lucas 2018 scores for the wholeof Finland as lucas_2018.shp during the exercise and you will find it available in the output folder. If you want to explore it independently you can download it here. The dataset contains the main land cover group in the column “lc_class” including province delimitation.

A more detailed land cover class can be found in column LC1 and for more detailed information on land cover and use classes check the Lucas 2018 Technical Document if needed.

Here is a quick summary of the codes:

Code	Land Cover
1	Build up areas
2	Agriculture
3	Forests
4	Sparsely vegetated forests
5	Grasslands
6	Unvegetated soil
7	Water
8	Wetlands

2. EO data features#

In this exercise, you will use raster datasets found in the GeoCubes Finland service as classification features. So, select a province and download the data of your choice, delimited by that province, using a pixel size of 10 m.

The variables we will use as classification features from GeoCubes Finland are:

Name	Feature
ndvi_max	Normalized Vegetation Index
dem	Digital Elevation Model
tree_classes	Tree density
veg_height	Vegetation density
canopy_cover	Canopy cover
s2_nir	Sentinel 2 Near Infrared
s2_rgb	Sentinel 2 bands: red, green, and blue

Study area#

For this exercise, we have selected the Lapland province. Feel free to choose the area of your preference. You will notice that even a smaller area can be selected by Bounding Box (Grid) from every province. Follow step by step the exercise.

HPC Resources#

CSC Machine-Puhti:

*Partition
Cores 8
Local Memory 64 Gb
Disk 240 Gb

Import needed Python libraries#

# for geo-data analysis
import rasterio
from rasterio.merge import merge
from rasterio.plot import show, show_hist
from shapely.geometry import Polygon
import rioxarray as rxr
import numpy as np
import xarray as xr
import geopandas as gpd
import pandas as pd

# for visualization
# import datashader as ds
# import datashader.transfer_functions as ds_function
# from datashader.utils import export_image
import matplotlib.pyplot as plt
# import colorcet as cc

import glob
import os, sys, shutil
import urllib
import time
from datetime import date

# for parallelization
import multiprocessing as mp
from dask import delayed
from dask import compute
from dask.diagnostics import ProgressBar
from dask.distributed import Client
import dask
import dask.array as da

# for ML modelling and running
from sklearn.preprocessing import LabelEncoder
from pprint import pprint
from tqdm.notebook import tqdm
import itertools

import tpot
import argparse
import pickle

Import Point-EO from local#

The PointEO is a Python library (scripts) stored locally that were developed internally in SYKE. The purpose of the library is to facilitate the land use prediction using the Random Forest model. By packing the function in a local library it is easier to call them all just when they are needed and makes the process simpler.

The library is given already in the local folder and you just need to run the command sys.path.append() to call available libraries in your home directory.

Feel free to take a look at the Point-EO GitHub Repository.

pointeo

# adding point-eo from local scripts
home = os.getcwd()
sys.path.append(home)

import point_eo.scripts.sample_xarray as sample_xarray # NOTE: this is my modification to original point-eo repo
import point_eo.scripts.analysis as analysis
import point_eo.scripts.tpot_train as tpot_train
import point_eo.scripts.predict as predict
import point_eo.scripts.postprocess_prediction as postprocess_prediction

We will create an output folder and we will name it basepth

# results folder
if not os.path.exists('output'):
    os.makedirs('output')
    
basepth = os.path.join(home, 'output')

Download Lucas 2018 from Allas#

The dataset was stored previously in Allas, the CSC’s cloud storage, and we will download it to our root folder for further use using wget. In case you want to save in a specific folder use the command -P folder_path

!wget -N https://a3s.fi/geoportti_training/lucas2018.tar

--2024-04-08 17:04:47--  https://a3s.fi/geoportti_training/lucas2018.tar
Resolving a3s.fi (a3s.fi)... 86.50.254.19, 86.50.254.18
Connecting to a3s.fi (a3s.fi)|86.50.254.19|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘lucas2018.tar’ not modified on server. Omitting download.

Then, we will decompress the datasets in the root folder using the command tar. The command tar unzip previously compressed data and you need to specify the .tar file and the folder where you want to decompress using -C command.

!tar -xf 'lucas2018.tar'

Once you have run the cell you will find the Lucas 2018 data in the subfolder training_data.

The folder contains the data for the whole of Finland and each province.

Let’s take a look at the files and let’s visualize the data at the national level and the selected province we are working on.

# check the files in Lucas 2018 training_data folder
lucas_paths = glob.glob(os.path.join(home, 'training_data/*.shp'))

lucas_files = [path.split('/')[-1] for path in lucas_paths]

print('Files for Lucas 2018 per province:\n')

for file in sorted(lucas_files):

    print('  ', file)

Files for Lucas 2018 per province:

   lucas2018_10_etelasavo.shp
   lucas2018_11_pohjoissavo.shp
   lucas2018_12_pohjoiskarjala.shp
   lucas2018_13_keskisuomi.shp
   lucas2018_14_etelapohjanmaa.shp
   lucas2018_15_pohjanmaa.shp
   lucas2018_16_keskipohjanmaa.shp
   lucas2018_17_pohjoispohjanmaa.shp
   lucas2018_18_kainuu.shp
   lucas2018_19_lappi.shp
   lucas2018_1_uusimaa.shp
   lucas2018_21_ahvenanmaa.shp
   lucas2018_2_varsinaissuomi.shp
   lucas2018_4_satakunta.shp
   lucas2018_5_kantahame.shp
   lucas2018_6_pirkanmaa.shp
   lucas2018_7_paijathame.shp
   lucas2018_8_kymenlaakso.shp
   lucas2018_9_etelakarjala.shp
   lucas_2018.shp

Lucas 2018 in Finland#

Let’s give a quick look at national level.

%%time

# Lucas 2018 in Finland
fi_path = os.path.join(home, 'training_data/lucas_2018.shp')
lucas2018_fi = gpd.read_file(fi_path)

CPU times: user 3.44 s, sys: 84 ms, total: 3.52 s
Wall time: 5.33 s

lucas2018_fi.dtypes

OBJECTID         int64
POINT_ID         int64
OFFICE_PI        int64
SURVEYDATE      object
OBS_DIST        object
OBS_DIRECT       int64
OBS_TYPE         int64
LC1             object
LC2             object
LU1             object
LU2             object
LC1_name        object
LU1_name        object
photolinkN      object
photolinkE      object
photolinkS      object
photolinkW      object
photolinkP      object
LCluokka         int64
maakunta_1       int64
geometry      geometry
dtype: object

lucas2018_fi.LU1_name.value_counts()

LU1_name
Forestry                                                 7864
Agriculture                                              3535
Semi-natural and natural areas not in use                2452
Residential                                               711
Road transport                                            698
Amenities, museums, leisure                               224
Mining and quarrying                                      171
Fallow land                                                76
Community services                                         60
Electricity, gas and thermal power distribution            58
Other abandoned areas                                      56
Commerce                                                   38
Railway transport                                          34
Sport                                                      34
Kitchen garden                                             33
Logistics and storage                                      28
Abandoned residential areas                                18
Wood based products                                        15
Air transport                                              14
Construction                                               13
Water transport                                             9
Machinery and equipment                                     6
Abandoned transport areas                                   4
Chemical and allied industries and manufacturing            4
Abandoned industrial areas                                  4
Aquaculture and fishing                                     4
Coal, oil and metal processing                              3
Energy production                                           3
Waste treatment                                             3
Manufacturing of food, beverages and tobacco products       2
Transport via pipelines                                     1
Financial, professional and information services            1
Production of non-metal mineral goods                       1
Telecommunication                                           1
Other primary production                                    1
Printing and reproduction                                   1
Name: count, dtype: int64

plt.style.use('ggplot')

fig, ax = plt.subplots(figsize=(6, 8))

# set class
class_name = ['Forestry', 'Agriculture', 'Residential', 'Semi-natural and natural areas not in use']
lu_class = lucas2018_fi.loc[lucas2018_fi['LU1_name'].isin(class_name)].sort_values('LU1_name', ascending=False)

# plot
lu_class.plot(ax=ax,
               column='LU1_name', 
               categorical=True, 
               cmap = 'tab10_r',
               markersize= 10,
               alpha = 0.6,
               legend=True,
               legend_kwds={'bbox_to_anchor':(1.2, 0.8)}
              )

plt.axis('off');

../../_images/a95f352d8b5810d164868563e449adcf2905b9e358cd740f6bab7a224dc35f92.png

Lucas 2018 in Lapland#

We will work in this exercise using the Lapland region.

# get path
lapland_path = os.path.join(home, 'training_data/lucas2018_19_lappi.shp')

# read
lucas2018_lapland = gpd.read_file(lapland_path)

lucas2018_lapland.LU1_name.value_counts()

LU1_name
Semi-natural and natural areas not in use          1333
Forestry                                           1053
Agriculture                                          61
Road transport                                       50
Residential                                          26
Amenities, museums, leisure                          21
Mining and quarrying                                 12
Other abandoned areas                                 9
Community services                                    8
Electricity, gas and thermal power distribution       4
Abandoned transport areas                             4
Fallow land                                           3
Air transport                                         2
Commerce                                              2
Water transport                                       2
Logistics and storage                                 1
Construction                                          1
Wood based products                                   1
Coal, oil and metal processing                        1
Production of non-metal mineral goods                 1
Other primary production                              1
Energy production                                     1
Name: count, dtype: int64

lucas2018_lapland.crs.name

'ETRS89 / TM35FIN(E,N)'

plt.style.use('ggplot')

fig, ax = plt.subplots(figsize=(4, 6))

# set class
class_name = ['Forestry', 'Agriculture', 'Residential', 'Semi-natural and natural areas not in use', 'Road transport',]
lu_class = lucas2018_lapland.loc[lucas2018_lapland['LU1_name'].isin(class_name)].sort_values('LU1_name', ascending=False)

# plot
lu_class.plot(ax=ax,
               column='LU1_name', 
               categorical=True, 
               cmap = 'tab10',
               markersize= 12,
               alpha = 0.8,
               legend=True,
               legend_kwds={'bbox_to_anchor':(1.2, 0.8)}
              )

plt.axis('off');

../../_images/b5d8f9bd60cf4c64696994f72020a6aadc9c955986c76655866afb99a58bc709.png

Download input rasters from Allas#

To keep this exercise straightforward we have stored the input raster in Allas. This will make it easy to download and use in the workflow.

You will download a folder input_data that contains:

Filename	Feature	Used in model
ndvi_max_2021_lapland.tif	Normalized Vegetation Index	Yes
km2_lapland.tif	Digital Elevation Model	Yes
tree_classes_lapland.tif	Tree density	Yes
vegetation_height_lapland.tif	Vegetation density	Yes
canopy_cover_lapland.tif	Canopy cover	Yes
s2_nir_lapland.tif	Sentinel 2 Near Infrared	Yes
s2_rgb_lapland.tif	Sentinel 2 bands: red, green, and blue	No
s2_rgb_lapland_band1.tif	Sentinel 2 band blue	Yes
s2_rgb_lapland_band2.tif	Sentinel 2 band green	Yes
s2_rgb_lapland_band3.tif	Sentinel 2 band red	Yes

!wget -N https://a3s.fi/geoportti_training/input_data.tar

--2024-04-08 17:05:02--  https://a3s.fi/geoportti_training/input_data.tar
Resolving a3s.fi (a3s.fi)... 86.50.254.18, 86.50.254.19
Connecting to a3s.fi (a3s.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘input_data.tar’ not modified on server. Omitting download.

!tar -xf 'input_data.tar'

Using GeoCubes for downloading data (Optional)#

GeoCubes offers raster data for different provinces. You can download datasets using urllib by selecting the region and data from API links in GeoCubes API access.

Attention

If you want to download data from GeoCubes using another Finland province of your interest be aware that sometimes the API might not be working due to maintenance

The next code will help you to download data using API endpoints. It uses a new created folder and urllib.request.urlretrieve()

            # -- Create a results folder
            
            if not os.path.exists('input_data'):
                os.makedirs('input_data')
                
            # -- Function to download from GeoCubes
            
            def download_data(url, output_folder, file_name):
                '''
                Download images from URL to a local path 
                    - params: URL, string
                    - output_folder, string
                    - file_name, string
                '''
                # get url and path
                download_url = url
                out_fn = os.path.join(output_folder, file_name)
                
                print(f"Downloading... {download_url}")
                
                try:
                urllib request
                    r = urllib.request.urlretrieve(download_url, out_fn)
                    
                    print(f"Download completed. Results saved to {out_fn}/n")
            
                except:
            
                    print(f"URL not valid {download_url}\n")
            
            # -- Using function (Example)
            
            url = 'https://vm0160.kaj.pouta.csc.fi/geocubes/clip/10/puustoisuusluokat/maakuntajako:19/202'
            input_data_folder = os.path.join(home, 'input_data')
            file_name = 'tree_classes_lapland.tif'
            
            # run
            download_data(url, input_data_folder, file_name)

Split multi-band rasters to single-band (Optional)#

If you have downloaded data from a different province using GeoCubes be aware that RGB images are a composite of Red, Green, and Blue images. As we prefer to have independent data it is recommended to split this multi-band image into single-band images.

For that, you can use the next function in your workflow:

            # -- Function to split a multi-band raster
            
            def split_multiband_raster(input_path, output_folder):
                '''
                Read from local folder a multi-band image and split it single-band
                    - input_path: multi-band tif
                    - output_folder: store single-band tif folder
                '''
                
                # Open the multiband raster
                ds = rxr.open_rasterio(input_path)
            
                # Get the number of bands
                num_bands = ds.shape[0]
            
                # Loop through each band
                for band_index in range(num_bands):
                    
                    # Get the band data
                    band = ds[band_index]
            
                    # Create the output file name
                    output_name = f"multi-band_split_band_{band_index + 1}.tif"
                    output_path = os.path.join(output_folder, output_name)
            
                    # Save the band as a single-band raster
                    band.rio.to_raster(output_path)
            
                    print(f"Band {band_index + 1} saved as {output_path}")
            
             # -- Using function (Example)
            
            input_raster = "input_data/s2_rgb_lapland.tif"
            output_folder = "input_data"
            
            # run
            split_multiband_raster(input_raster, output_folder)

Visualization of input raster - NDVI example#

Let’s take a look at the input raster data we are going to use. The library that support raster visualization is rasterio.

# remove the file we are not using
remove_path = glob.glob(os.path.join(home, 'input_data/s2_rgb_lapland.tif'))

# delete not needed file RGB Composition
os.remove(remove_path[0])

# check the files in raster input_data folder
raster_paths = glob.glob(os.path.join(home, 'input_data/*.tif'))

raster_files = [path.split('/')[-1] for path in raster_paths]

print('Files for input raster per variable:\n')

for file in sorted(raster_files):

    print('  ', file)

Files for input raster per variable:

   canopy_cover_lapland.tif
   km2_lapland.tif
   ndvi_max_2021_lapland.tif
   s2_nir_lapland.tif
   s2_rgb_lapland_band1.tif
   s2_rgb_lapland_band2.tif
   s2_rgb_lapland_band3.tif
   tree_classes_lapland.tif
   vegetation_height_lapland.tif

raster_paths

['/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/input_data/canopy_cover_lapland.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/input_data/ndvi_max_2021_lapland.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/input_data/vegetation_height_lapland.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/input_data/s2_rgb_lapland_band1.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/input_data/s2_rgb_lapland_band3.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/input_data/s2_nir_lapland.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/input_data/tree_classes_lapland.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/input_data/s2_rgb_lapland_band2.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/input_data/km2_lapland.tif']

fig, ax = plt.subplots(figsize=(8, 10), layout='constrained')

# get path and open
source = 'input_data/ndvi_max_2021_lapland.tif'
img = rasterio.open(source)

# fig
show(img.read(), title='NDVI Lapland', cmap='Greens', ax=ax);

# colorbar
cmap = plt.cm.Greens

fig.colorbar(plt.cm.ScalarMappable(cmap=cmap), 
             ax=ax, 
            )

plt.axis('off');

../../_images/721aadd069bcd41c786dbb61f95fe77b5e036a7719356136f6637725136f03e5.png

Land Use Class frequency visualization#

Let’s take a look at the frequency of Land Use Classes in our selected province. The column we will use is the LCluokka

# Land Use class
lucas2018_lapland.LCluokka.hist(figsize=(8, 4), bins=20)

plt.xticks(ticks = [0, 1, 2, 3, 4, 5, 6, 7, 8],
            labels=['non-built', 'Build up areas', 'Agriculture', 'Forests', 
                    'Sparsely vegetated forests', 'Grasslands',
                    'Unvegetated soil', 'Water', 'Wetlands'],
          rotation=70
          );
    

../../_images/2364aa5b4f6e7b14ffa63e25864346e53a1089723e919bbcde36197e5c837528.png

Store bands for point-eo sampling script#

We will keep the band names in a band_names.txt file for further use in our model.

band_name_file = os.path.join(basepth, 'band_names.txt')

with open(band_name_file, 'w') as fp:
    
    for item in band_names:
        
        fp.write("%s\n" % item)

Sample raster using point-eo for sampling#

In this process we are going to sample the raster data we have using the Lucas 2018 dataset. To use the point-eo tool we will use a parsing library called argparse that creates inputs variables while using scripts in terminal.

Follow the next process using the sample_xarray function from point-eo tool.

%%time

# -- paths
input_shape = os.path.join(home, 'training_data/lucas2018_19_lappi.shp')
output_sample_folder = os.path.join(basepth,'sampled')

# --  initiate parsing
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers(title="Available commands", dest="script")

# -- choose sampling xarrat - data cube
sample_xarray.add_args(subparsers)

# -- arguments
args = parser.parse_args(['sample_raster',
                            '--input', input_shape,
                            '--input_raster', 'dummy_placeholder',
                            '--target', 'LCluokka',
                            '--band_names', band_name_file,
                            '--out_folder', output_sample_folder])

# -- set input_raster to xarray variable - Data Cube
args.input_raster = cube

# -- RUN
sample_xarray.main(args)

Sampling xarray dataset using points from /projappl/project_2009245/GeoHPC-dev/source/lessons/L3/training_data/lucas2018_19_lappi.shp
Using bandnames:
['canopy_cover_lapland',
 'ndvi_max_2021_lapland',
 'vegetation_height_lapland',
 's2_rgb_lapland_band1',
 's2_rgb_lapland_band3',
 's2_nir_lapland',
 'tree_classes_lapland',
 's2_rgb_lapland_band2',
 'km2_lapland']
Saved outputs to /projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/sampled/xarray__lucas2018_19_lappi__LCluokka.csv
CPU times: user 35.2 s, sys: 25.9 s, total: 1min 1s
Wall time: 3min 13s

# read sampling file
path = os.path.join(basepth,'sampled', 'xarray__lucas2018_19_lappi__LCluokka.csv')

df = pd.read_csv(path)

df.head()

	LCluokka	canopy_cover_lapland	ndvi_max_2021_lapland	vegetation_height_lapland	s2_rgb_lapland_band1	s2_rgb_lapland_band3	s2_nir_lapland	tree_classes_lapland	s2_rgb_lapland_band2	km2_lapland
0	6	127.0	112.0	255.0	1047.0	841.0	1500.0	30.0	977.0	859.72390
1	6	127.0	113.0	255.0	1135.0	995.0	1510.0	30.0	1103.0	778.58010
2	5	127.0	142.0	255.0	743.0	600.0	2150.0	30.0	730.0	699.85986
3	3	15.0	178.0	4.0	224.0	271.0	3281.0	11.0	423.0	197.96394
4	3	2.0	167.0	3.0	267.0	311.0	2159.0	11.0	429.0	151.15020

Training with Random Forest model#

Let’s continue training our model using Random Forest implemented in point-eo tool.

#  Training, analysis with the defaults (Random Forest)

# -- paths
input_sampling_file = os.path.join(basepth,'sampled','xarray__lucas2018_19_lappi__LCluokka.csv')
rf_folder = os.path.join(basepth,'model_fitting_RF')


# -- start parser
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers(title="Available commands", dest="script")
analysis.add_args(subparsers)

# -- arguments
args = parser.parse_args(['analysis',
                            '--input', input_sampling_file,
                            '--out_folder', rf_folder,
                            '--out_prefix', 'RF_geohpc_',
                            '--decimal', '.',
                            '--separator',','])

# -- RUN
analysis.main(args)

Input csv: /projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/sampled/xarray__lucas2018_19_lappi__LCluokka.csv
Model: RF


### Data info ###

Columns. First one is chosen as target
Index		Column
0		LCluokka
1		canopy_cover_lapland
2		ndvi_max_2021_lapland
3		vegetation_height_lapland
4		s2_rgb_lapland_band1
5		s2_rgb_lapland_band3
6		s2_nir_lapland
7		tree_classes_lapland
8		s2_rgb_lapland_band2
9		km2_lapland


Shape of data table:
rows: 2598
columns: 10

Target class distribution
label	count
LCluokka
3    1271
8     741
4     235
5     194
0      52
6      42
7      38
2      17
1       8
Name: count, dtype: int64




### Starting cross-validation ###

Random forest parameters:
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 100, 'n_jobs': -1, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

Fold 0:
Accuracy: 0.721
Precision: 0.680
F1: 0.692

Fold 1:
Accuracy: 0.688
Precision: 0.632
F1: 0.652

Fold 2:
Accuracy: 0.723
Precision: 0.682
F1: 0.693

Fold 3:
Accuracy: 0.711
Precision: 0.661
F1: 0.679

Fold 4:
Accuracy: 0.717
Precision: 0.667
F1: 0.682

Overall results:
Accuracy: 0.712
Precision: 0.664
F1: 0.680

Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
              precision    recall  f1-score   support

           0       0.50      0.31      0.38        52
           1       0.00      0.00      0.00         8
           2       0.00      0.00      0.00        17
           3       0.81      0.90      0.85      1271
           4       0.26      0.10      0.14       235
           5       0.32      0.19      0.23       194
           6       0.47      0.40      0.44        42
           7       0.91      0.84      0.88        38
           8       0.66      0.79      0.72       741

    accuracy                           0.71      2598
   macro avg       0.44      0.39      0.40      2598
weighted avg       0.66      0.71      0.68      2598

Saved classification graph to /projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/model_fitting_RF/RF_geohpc___xarray__lucas2018_19_lappi__LCluokka__2024-04-08T17-11-41_metrics.png

Saved confusion matrix to /projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/model_fitting_RF/RF_geohpc___xarray__lucas2018_19_lappi__LCluokka__2024-04-08T17-11-41_confusion.png

Saved permutation importance to /projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/model_fitting_RF/RF_geohpc___xarray__lucas2018_19_lappi__LCluokka__2024-04-08T17-11-41_permutation_importance.png
Saved model to /projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/model_fitting_RF/RF_geohpc___xarray__lucas2018_19_lappi__LCluokka__2024-04-08T17-11-41_model.pkl

../../_images/18e2d2cb739592d4f19c4a095cd419648830602d5e0ae229cb2bc9afdb97fdd7.png

../../_images/e2b8c8147a7a627adad199a37056842fcc967f553e1d91459d858ca86d690a23.png

../../_images/5a37912c7ea78ab4bff9df075515eb3163c6fa06bedcb91f316bb9cdb9186db7.png

Parallelization of Random Forest#

grid

Grid creation#

We have a quite large area of the province so we are going to create grids that allow us to parallelize the process using each square. Follow step by step about the grid creation and the Land Use Class prediction with Random Forest.

Get familiar with the next function.

def create_grid(study_area, grid_size_m):
    '''
    Give back a Geodataframe with specific grid size based on study area
        study_area: geodaframe. default point geometry. crs not geographic
        grid_size_m: size of grid side in meters

        return: Geodataframe of polygons. Grid
    '''
    print(f'CRS of study area: {study_area.crs.name}')
    
    # get bounds of province
    xmin, ymin, xmax, ymax = study_area.buffer(100).total_bounds

    # grid size
    length = grid_size_m
    wide = grid_size_m

    cols = list(np.arange(xmin, xmax + wide, wide))
    rows = list(np.arange(ymin, ymax + length, length))

    polygons = []
    for x in cols[:-1]:
        for y in rows[:-1]:
            polygons.append(Polygon([(x,y), (x+wide, y), (x+wide, y+length), (x, y+length)]))
    
    grid = gpd.GeoDataFrame({'geometry':polygons}, crs = study_area.crs)

    # ----- Remove empty grids

    index_true = []
    
    for row in grid.itertuples():

        # check if contains
        contains = study_area.intersects(row.geometry)
  
        if True in contains.unique():

            # catch index of grid that contains study area
            index_true.append(row.Index)

    # subset only valid grids
    grid_valid = grid.loc[grid.index.isin(index_true)]

    # reset index
    grid_valid = grid_valid.reset_index(drop=True)
    
    print(f'Total grids: {len(grid_valid)}')
    
    return grid_valid

# create grids
grid = create_grid(lucas2018_lapland, 60000)

CRS of study area: ETRS89 / TM35FIN(E,N)
Total grids: 42

# save grid
grid.to_file('output/grid.gpkg')

plt.style.use('ggplot')

fig, ax = plt.subplots(figsize=(6, 8))

# plot
lucas2018_lapland.plot(ax=ax,
                       color='darkblue',
                       markersize= 3,
                       alpha = 0.6,
                      )

grid.plot(ax=ax,
          facecolor="none", 
          edgecolor="darkred",
          linewidth=1.4,
          alpha=0.6
         )

plt.axis('off');

../../_images/1cadcdf3b9281078fe182319cec2e4bf530356673aa00396a50a700d25694423.png

Automating Inference with Random Forest PointEO#

We are going to pack the inference of Random Forest per area. The next function will use as input the fitted model, the independent grid, and some paths for results.

Note

We added a couple of lines in the function that remove intermediate files. We do it to save memory in the disk.

def run_random_forest(data_cube, grid_square, grid_index, output_folder = basepth):
    '''
    Run random forest using default PointEO parameters with Grid squares
        data_cube: data cube with input raster of the whole study area
        grid_square: geom 
        grid_index: n index of the grid for model prediction
    '''
    grid_index = '00'+ str(grid_index)
    grid_index = grid_index[-2:]

    # -- default filepaths

    # paths: model, predict folder
    
    model_path = glob.glob(os.path.join(basepth, f'model_fitting_RF', f'RF*LCluokka*{date.today()}*_model.pkl'))[0]
    print(f'Model: {model_path}')
    
    predict_folder = os.path.join(basepth, f'predict_output_RF_{grid_index}')
    resolution_file = os.path.join(basepth, f"out_RF_{grid_index}.vrt")
    postprocess_folder = os.path.join(basepth, f'land_use_results_RF')
    
    result_path = os.path.join(basepth, f'preprocess_RF_Grid-{grid_index}.tif')

    # --------------------------

    # --> if exists omit the grid cell

    ms_result = os.path.join(postprocess_folder, f'out_RF_{grid_index}_S.tif')
    
    if not os.path.exists(ms_result):
        
        # get bounds of grid
        xmin = grid_square.bounds[0]
        ymin = grid_square.bounds[1]
        xmax = grid_square.bounds[2]
        ymax = grid_square.bounds[3]
        
        # subset data cube
        Fx = cube.where((xmin < cube.x) 
                        & (cube.x < xmax)
                        & (ymin < cube.y) 
                        & (cube.y < ymax), drop=True).to_array().squeeze()
    
        # cell size of the interference blocks
        grid_cells = predict.create_cell_grid(Fx, 10000)
        
        # load rf-model from the analysis stem
        with open(model_path, "rb") as f:
            RF_model = pickle.load(f)
    
        # ----- predict Random Forest
    
        
        predict.calculate(model = RF_model,
                          Fx = Fx,
                          cell_list = grid_cells,
                          verbose=2,
                          out_folder = predict_folder)
    
        # combine blocks
        predict.merge_folder(predict_folder, resolution_file)
    
        # save as geotiff
        Rx = rxr.open_rasterio(
                                resolution_file,
                                chunks={"band": -1, "x": 2**10, "y": 2**10},
                                lock=False,
                                parallel=True,
                            )
    
        # save as tif
        Rx.rio.to_raster(result_path)
    
        # ----- Postprocess prediction
    
        # initiate parser
        parser = argparse.ArgumentParser()
        subparsers = parser.add_subparsers(title="Available commands", dest="script")
        postprocess_prediction.add_args(subparsers)
    
        # arguments
        args = parser.parse_args(['postprocess_prediction',
                                    '--input_raster', resolution_file,
                                    '--out_folder', postprocess_folder
                                    ])
    
        # run
        postprocess_prediction.main(args)

        # ---------------------------------------------------------- remove intermediate info for saving memory

        # folders
        folders = glob.glob(os.path.join(output_folder, '*output*'))
        
        for folder in folders:
        
            shutil.rmtree(folder)
        
        # files
        files = glob.glob(os.path.join(output_folder, '*.vrt'))
        
        for file in files:
        
            os.remove(file)

        pre_files = glob.glob(os.path.join(output_folder, 'preprocess*.tif'))
    
        for pre_file in pre_files:
    
            os.remove(pre_file)

    else:

        print(f'processed {ms_result}')

Single grid square test#

Let’s check the model using only a single grid square from our grid.

s = time.time()

# test function with 1 grid square

grid_index = 1

# get geom
geom_grid = grid.at[grid_index, 'geometry']

# Run RF
run_random_forest(cube, geom_grid, grid_index);

single_time = time.time() - s

Model: /projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/model_fitting_RF/RF_geohpc___xarray__lucas2018_19_lappi__LCluokka__2024-04-08T14-22-12_model.pkl
processed /projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_01_S.tif

Land Use prediction visualization#

The results will be found in the postprocess_folder that we named land_use_results_RF. The file that contains the suffix _S is the one that contains the predicted land use. In this case the file is out_RF_25_S.tif

Let’s check briefly the results for the Grid-25

Create a Dask Client#

To use the computational resources in Parallel we need to create a Dask Cluster using the Client function. You can read a bit more in the Dask Documentation. There is an extension for Jupyter Lab called dask-labextension that we installed in our environment. You will find a Dask logo in the left options bar of Jupyter Lab that will allow you to create a Cluster. Simply, click on the + NEW cluster, and then click on < > to add the client to the cell.

Note

You need to delete the code in the cell below and add a new client tcp. Like in Figure 1.

Feel free to Launch dashboard in Jupyter Lab to see the performance of the supercomputer.

from dask.distributed import Client

client = Client("tcp://127.0.0.1:37379")
client

Client

Client-16d2f008-f5b2-11ee-a55f-ac1f6bb239f0

Connection method: Direct
Dashboard: http://127.0.0.1:8787/status

Scheduler Info

Scheduler

Scheduler-75a8662b-7085-42dc-9637-8774587bdda6

Comm: tcp://127.0.0.1:37379	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 8
Started: Just now	Total memory: 32.00 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:35675	Total threads: 2
Dashboard: http://127.0.0.1:36423/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:33259
Local directory: /run/nvme/job_21127759/tmp/dask-scratch-space/worker-luvc5oke
Tasks executing:	Tasks in memory:
Tasks ready:	Tasks in flight:
CPU usage: 2.0%	Last seen: Just now
Memory usage: 136.39 MiB	Spilled bytes: 0 B
Read bytes: 10.20 kiB	Write bytes: 9.54 kiB

Worker: 1

Comm: tcp://127.0.0.1:45239	Total threads: 2
Dashboard: http://127.0.0.1:36173/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:46161
Local directory: /run/nvme/job_21127759/tmp/dask-scratch-space/worker-ofjrg1a1
Tasks executing:	Tasks in memory:
Tasks ready:	Tasks in flight:
CPU usage: 2.0%	Last seen: Just now
Memory usage: 137.73 MiB	Spilled bytes: 0 B
Read bytes: 9.85 kiB	Write bytes: 9.19 kiB

Worker: 2

Comm: tcp://127.0.0.1:40011	Total threads: 2
Dashboard: http://127.0.0.1:34795/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:34085
Local directory: /run/nvme/job_21127759/tmp/dask-scratch-space/worker-p28zfw34
Tasks executing:	Tasks in memory:
Tasks ready:	Tasks in flight:
CPU usage: 2.0%	Last seen: Just now
Memory usage: 137.26 MiB	Spilled bytes: 0 B
Read bytes: 9.88 kiB	Write bytes: 9.22 kiB

Worker: 3

Comm: tcp://127.0.0.1:45167	Total threads: 2
Dashboard: http://127.0.0.1:34653/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:40351
Local directory: /run/nvme/job_21127759/tmp/dask-scratch-space/worker-4v7hbxse
Tasks executing:	Tasks in memory:
Tasks ready:	Tasks in flight:
CPU usage: 2.0%	Last seen: Just now
Memory usage: 137.09 MiB	Spilled bytes: 0 B
Read bytes: 8.61 kiB	Write bytes: 7.94 kiB

Delayed functions#

In Dask you can parallelize custom algorithms using the dask.delayed function. Not everything can fit in a Dask object so in this case the parallelization is sent as a function to the scheduler or cluster.

We will use delayed functions for each grid square.

grid.head()

	geometry
0	POLYGON ((249631.084 7580490.936, 309631.084 7...
1	POLYGON ((249631.084 7640490.936, 309631.084 7...
2	POLYGON ((309631.084 7280490.936, 369631.084 7...
3	POLYGON ((309631.084 7340490.936, 369631.084 7...
4	POLYGON ((309631.084 7400490.936, 369631.084 7...

%%time

# create a list of delayed objects
delayed_RF_functions = [dask.delayed((run_random_forest)(cube, geom, i)) 
                        for i, geom in grid.itertuples()]

# run
with ProgressBar():
    dask.compute(delayed_RF_functions)

Visualization of the final result#

Let’s look at the final map of our province. Remember the results are stored in the folder land_use_results_RF in the files specified with index and S key like: out_RF_<index>_S.tif.

land_use_results = glob.glob(os.path.join(basepth, 'land_use_results_RF', 'out*_S.tif'))

land_use_results

['/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_25_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_02_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_39_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_17_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_28_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_15_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_10_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_08_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_12_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_18_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_11_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_00_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_13_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_21_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_01_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_14_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_23_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_20_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_36_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_34_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_27_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_29_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_16_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_40_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_06_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_26_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_31_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_24_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_03_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_09_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_05_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_22_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_32_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_33_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_37_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_35_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_38_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_30_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_07_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_41_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_19_S.tif',
 '/projappl/project_2009245/GeoHPC-dev/source/lessons/L3/output/land_use_results_RF/out_RF_04_S.tif']

Let’s check the last results in QGIS

final_map