.. _imputation_exammple: Imputation Algorithms ===================== This is an example of how to use the BRAILS imputation algorithms to fill in missing data in an inventory. In this example, the user has a csv file that they will use to create the asset inventory. That csv file contains rows, some of which are complete and some of which have missing column values. For example, for some rows the roof shape attribute may be missing. .. code-block:: index,erabuilt,numstories,roofshape,fpAreas,occupancy2,fparea,repaircost,constype,occupancy,Lat,Lon 0,NA,NA,Gable,998,Other,,,,,37.86730380000001,-122.42452141999999 1,NA,NA,Hip,4411,Residential,,,,,37.870170718181825,-122.42705018181817 2,NA,NA,Gable,2669,Other,,,,,37.86590688571429,-122.43521675714284 3,NA,2,Hip,8599,Residential,,,,,37.86981550909091,-122.4255108090909 4,NA,2,Hip,10802,Residential,,,,,37.869326593333334,-122.42639508666667 ...... ...... In the python script, shown below, the inventory is first created with this csv file, a **KNN** imputer is then created, and this imputer when the **impute()** method is invoked returns a second inventory, which for the missing fields will contain a number of possible values. .. literalinclude:: imputation.py :language: python :linenos: The script is run by issuing the following from a terminal window: .. code-block:: python3 imputation.py and the application would produce: .. literalinclude:: output.txt :linenos: Imputation Notebook ------------------- Below is a link to a Jupyter notebook that runs this basic code, with graphics to better understand the output. .. raw:: html " .. nbgallery:: ./imputation_example.ipynb .. raw:: html
.. note:: #. Imputation is a statistical technique used to handle missing data by replacing it with substituted values. In |app| the goal is to fill in the gaps in an inventory dataset to ensure that analyses can proceed without having to throw away assets from the inventory due to missing values. Imputation is used in many other fields like data science, machine learning, and statistics. #. There are a number of algorithms outlined in the literature. These algorithms produce either a single values for each missing data-point, e.g. **mean**, **modal**, **median**, or they produce a number of possible values for each missing data-point, e.g. **K-NearestNeighbour**. |app| algorithms produce the latter. #. When multiple possible values are generated or exist for any feature of an **Asset** in the inventory that feature has a list of possible options, e.g. **constype** in above example. #. For the imputation algorithms specifically, if the list of possible options all result in the same value for a certain feature, a single value is provided, e.g. **eraBuilt** in the above example. #. These possible values represented by a list are not only generated by imputation. They represent unknown quatntaties in the inventory, they can come from Processors, Inferers or user supplied datasets. #. When multiple options in the workflow generate a number of samples field for one or more assets in the inventory, any option that generates samples is expected to generate the same number of samples as existing **overriding users request to the contrary**. With this enforced, the user can request from the inventory a set of distinct **possible worlds**.