Imputation Algorithms

This is an example of how to use the BRAILS imputation algorithms to fill in missing data in an inventory. In this example, the user has a csv file that they will use to create the asset inventory. That csv file contains rows, some of which are complete and some of which have missing column values. For example, for some rows the roof shape attribute may be missing.

index,erabuilt,numstories,roofshape,fpAreas,occupancy2,fparea,repaircost,constype,occupancy,Lat,Lon
0,NA,NA,Gable,998,Other,,,,,37.86730380000001,-122.42452141999999
1,NA,NA,Hip,4411,Residential,,,,,37.870170718181825,-122.42705018181817
2,NA,NA,Gable,2669,Other,,,,,37.86590688571429,-122.43521675714284
3,NA,2,Hip,8599,Residential,,,,,37.86981550909091,-122.4255108090909
4,NA,2,Hip,10802,Residential,,,,,37.869326593333334,-122.42639508666667
......
......

In the python script, shown below, the inventory is first created with this csv file, a KNN imputer is then created, and this imputer when the impute() method is invoked returns a second inventory, which for the missing fields will contain a number of possible values.

 1# Written: sy Aug 2024
 2# License: BSD-2
 3
 4"""
 5imputation.py
 6================
 7
 8This is a simple BRAILS++ example to demonstrate imputating (estimating the
 9missing pieces) of an inventory dataset.
10
11"""
12
13import os
14import sys
15import json
16
17from brails.utils.importer import Importer
18from brails.types.image_set import ImageSet
19from brails.types.asset_inventory import Asset, AssetInventory
20
21
22# create the importer
23importer = Importer()
24
25#
26# create an asset invenntory from the contents of a csv file
27#
28
29file_path = "./example_Tiburon.csv"
30    
31inventory = AssetInventory()
32inventory.read_from_csv(file_path,keep_existing=True, id_column='index')
33
34#
35# its not perfect, in sense it contains missing data as shown for 4th asset
36#
37
38
39print(f'INCOMPLETE ASSET: {inventory.get_asset_features(4)[1]}')
40
41knn_imputer_class = importer.get_class("KnnImputer")
42imputer=knn_imputer_class()
43new_inventory = imputer.impute(inventory,n_possible_worlds=10)
44
45#
46# Saving the imputed database into a geojson file 
47#
48
49filepath = 'tmp/imputed_inventory.geojson'
50directory = os.path.dirname(filepath)
51if not os.path.exists(directory):
52    os.makedirs(directory)
53    
54new_inventory.write_to_geojson(filepath)
55
56print(f'COMPLETE ASSET: {new_inventory.get_asset_features(4)[1]}')
57

The script is run by issuing the following from a terminal window:

python3 imputation.py

and the application would produce:

 1
 2INCOMPLETE ASSET: {'index': 4, 'erabuilt': 'NA', 'numstories': 2, 'roofshape': 'Hip', 'fpAreas': 10802, 'occupancy2': 'Residential', 'fparea': '', 'repaircost': '', 'constype': '', 'occupancy': '', 'type': 'building'}
 3
 4
 5Missing percentages among 3249 assets
 6erabuilt: 14.19%
 7numstories: 13.82%
 8occupancy2: 3.42%
 9fparea: 14.19%
10repaircost: 14.19%
11constype: 14.19%
12occupancy: 14.19%
13Primitive imputation done.
14Running the main imputation. This may take a while.
15Done imputation. It took 0.01 mins
16
17
18COMPLETE ASSET: {'index': 4, 'erabuilt': 1975.0, 'numstories': 2, 'roofshape': 'Hip', 'fpAreas': 10802, 'occupancy2': 'Residential', 'fparea': [np.float64(4654.0), np.float64(1784.0), np.float64(4654.0), np.float64(4654.0), np.float64(2582.0), np.float64(2582.0), np.float64(4654.0), np.float64(2582.0), np.float64(2582.0), np.float64(1784.0)], 'repaircost': [np.float64(236440.24), np.float64(419571.341), np.float64(419571.341), np.float64(236440.24), np.float64(419571.341), np.float64(236440.24), np.float64(419571.341), np.float64(322763.278), np.float64(236440.24), np.float64(425334.148)], 'constype': ['W1', 'W1', 'S1', 'S1', 'RM1', 'W1', 'S1', 'RM1', 'S1', 'RM1'], 'occupancy': ['RES1', 'RES1', 'COM4', 'RES1', 'RES1', 'RES3A', 'RES1', 'RES3A', 'COM4', 'COM4'], 'type': 'building'}
19

Imputation Notebook

Below is a link to a Jupyter notebook that runs this basic code, with graphics to better understand the output.

"

Note

  1. Imputation is a statistical technique used to handle missing data by replacing it with substituted values. In BrailsPlusPlus the goal is to fill in the gaps in an inventory dataset to ensure that analyses can proceed without having to throw away assets from the inventory due to missing values. Imputation is used in many other fields like data science, machine learning, and statistics.

  2. There are a number of algorithms outlined in the literature. These algorithms produce either a single values for each missing data-point, e.g. mean, modal, median, or they produce a number of possible values for each missing data-point, e.g. K-NearestNeighbour. BrailsPlusPlus algorithms produce the latter.

  3. When multiple possible values are generated or exist for any feature of an Asset in the inventory that feature has a list of possible options, e.g. constype in above example.

  4. For the imputation algorithms specifically, if the list of possible options all result in the same value for a certain feature, a single value is provided, e.g. eraBuilt in the above example.

  5. These possible values represented by a list are not only generated by imputation. They represent unknown quatntaties in the inventory, they can come from Processors, Inferers or user supplied datasets.

  6. When multiple options in the workflow generate a number of samples field for one or more assets in the inventory, any option that generates samples is expected to generate the same number of samples as existing overriding users request to the contrary. With this enforced, the user can request from the inventory a set of distinct possible worlds.