Imputation Algorithms

This is an example of how to use the BRAILS imputation algorithms to fill in missing data in an inventory. In this example, the user has a csv file that they will use to create the asset inventory. That csv file contains rows, some of which are complete and some of which have missing column values. For example, for some rows the roof shape attribute may be missing.

index,erabuilt,numstories,roofshape,fpAreas,occupancy2,fparea,repaircost,constype,occupancy,Lat,Lon
0,NA,NA,Gable,998,Other,,,,,37.86730380000001,-122.42452141999999
1,NA,NA,Hip,4411,Residential,,,,,37.870170718181825,-122.42705018181817
2,NA,NA,Gable,2669,Other,,,,,37.86590688571429,-122.43521675714284
3,NA,2,Hip,8599,Residential,,,,,37.86981550909091,-122.4255108090909
4,NA,2,Hip,10802,Residential,,,,,37.869326593333334,-122.42639508666667
......
......

In the python script, shown below, the inventory is first created with this csv file, a KNN imputer is then created, and this imputer when the impute() method is invoked returns a second inventory, which for the missing fields will contain a number of possible values.

# Written: sy Aug 2024
# License: BSD-2

"""
imputation.py
================

This is a simple BRAILS++ example to demonstrate imputating (estimating the
missing pieces) of an inventory dataset.

"""

import os
import sys
import json

from brails.utils.importer import Importer
from brails.types.image_set import ImageSet
from brails.types.asset_inventory import Asset, AssetInventory


# create the importer
importer = Importer()

#
# create an asset invenntory from the contents of a csv file
#

file_path = "./example_Tiburon.csv"
    
inventory = AssetInventory()
inventory.read_from_csv(file_path,keep_existing=True, id_column='index')

#
# its not perfect, in sense it contains missing data as shown for 4th asset
#


print(f'INCOMPLETE ASSET: {inventory.get_asset_features(4)[1]}')

knn_imputer_class = importer.get_class("KnnImputer")
imputer=knn_imputer_class()
new_inventory = imputer.impute(inventory,n_possible_worlds=10)

#
# Saving the imputed database into a geojson file 
#

filepath = 'tmp/imputed_inventory.geojson'
directory = os.path.dirname(filepath)
if not os.path.exists(directory):
    os.makedirs(directory)
    
new_inventory.write_to_geojson(filepath)

print(f'COMPLETE ASSET: {new_inventory.get_asset_features(4)[1]}')

The script is run by issuing the following from a terminal window:

python3 imputation.py

and the application would produce:

INCOMPLETE ASSET: {'index': 4, 'erabuilt': 'NA', 'numstories': 2, 'roofshape': 'Hip', 'fpAreas': 10802, 'occupancy2': 'Residential', 'fparea': '', 'repaircost': '', 'constype': '', 'occupancy': '', 'type': 'building'}


Missing percentages among 3249 assets
erabuilt: 14.19%
numstories: 13.82%
occupancy2: 3.42%
fparea: 14.19%
repaircost: 14.19%
constype: 14.19%
occupancy: 14.19%
Primitive imputation done.
Running the main imputation. This may take a while.
Done imputation. It took 0.01 mins


COMPLETE ASSET: {'index': 4, 'erabuilt': 1975.0, 'numstories': 2, 'roofshape': 'Hip', 'fpAreas': 10802, 'occupancy2': 'Residential', 'fparea': [np.float64(4654.0), np.float64(1784.0), np.float64(4654.0), np.float64(4654.0), np.float64(2582.0), np.float64(2582.0), np.float64(4654.0), np.float64(2582.0), np.float64(2582.0), np.float64(1784.0)], 'repaircost': [np.float64(236440.24), np.float64(419571.341), np.float64(419571.341), np.float64(236440.24), np.float64(419571.341), np.float64(236440.24), np.float64(419571.341), np.float64(322763.278), np.float64(236440.24), np.float64(425334.148)], 'constype': ['W1', 'W1', 'S1', 'S1', 'RM1', 'W1', 'S1', 'RM1', 'S1', 'RM1'], 'occupancy': ['RES1', 'RES1', 'COM4', 'RES1', 'RES1', 'RES3A', 'RES1', 'RES3A', 'COM4', 'COM4'], 'type': 'building'}

Imputation Notebook

Below is a link to a Jupyter notebook that runs this basic code, with graphics to better understand the output.

"

Note

Imputation is a statistical technique used to handle missing data by replacing it with substituted values. In BrailsPlusPlus the goal is to fill in the gaps in an inventory dataset to ensure that analyses can proceed without having to throw away assets from the inventory due to missing values. Imputation is used in many other fields like data science, machine learning, and statistics.
There are a number of algorithms outlined in the literature. These algorithms produce either a single values for each missing data-point, e.g. mean, modal, median, or they produce a number of possible values for each missing data-point, e.g. K-NearestNeighbour. BrailsPlusPlus algorithms produce the latter.
When multiple possible values are generated or exist for any feature of an Asset in the inventory that feature has a list of possible options, e.g. constype in above example.
For the imputation algorithms specifically, if the list of possible options all result in the same value for a certain feature, a single value is provided, e.g. eraBuilt in the above example.
These possible values represented by a list are not only generated by imputation. They represent unknown quatntaties in the inventory, they can come from Processors, Inferers or user supplied datasets.
When multiple options in the workflow generate a number of samples field for one or more assets in the inventory, any option that generates samples is expected to generate the same number of samples as existing overriding users request to the contrary. With this enforced, the user can request from the inventory a set of distinct possible worlds.