User Tools

Site Tools


ease:machinelearning:data_preparation

NEEMS Lecture: 2. Data Preparation

In the previous section we started off by visualizing the NEEMS data as pie charts and tables. This section is about preparing the data for training, including filling empty data cells, transforming the data in one-hot-encoding and shrinking down the tables to the relevant bits.

2.1 Filling Empty Cells

We are working on the narratives variable here. If you want to check your implementation, put the following code below the implementation of the function, and execute the line. The given implementation is also used later down in the notebook to check for inconsistencies in the data.

# Print modified data
fill_empty_cells(narratives)
# Check if the code is working
fill_empty_cells(narratives).isna().any()
# If all entries tell 'false', the function works.

This function takes care of null entries in the data, and replaces those entries with predefined values.

# Solution
def fill_empty_cells(data):
    filled_data = data.copy()
 
    filled_data[header_names.PARENT]= filled_data[header_names.PARENT].fillna('NoParent')
    #TODO Fill the rest of the remaining empty cells
    filled_data[header_names.NEXT]= filled_data[header_names.NEXT].fillna('NoNext')
    filled_data[header_names.PREVIOUS]= filled_data[header_names.PREVIOUS].fillna('NoPrevious')
    return filled_data
2.2 Transform Categorical Values to Numeric Values

One-hot-encoding transforms our feature data into values of 0 or 1, which makes it easier to work with. When you print out the function output on the narratives data, scroll to the left to see that the table has expanded.

def transform_categorial_to_one_hot_encoded(data):
    encoded_data = data.copy()
 
    encoded_parent_data = pd.get_dummies(encoded_data[header_names.PARENT], prefix='parent')
    encoded_data = pd.concat([encoded_data, encoded_parent_data],axis=1)
 
    #TODO Transform the rest of the categorial features into one hot encoded features
    #Hint: NEXT must not be encoded
    encoded_type_data = pd.get_dummies(encoded_data[header_names.TYPE], prefix='type')
    encoded_previous_data = pd.get_dummies(encoded_data[header_names.PREVIOUS], prefix='previous')
    encoded_data = pd.concat([encoded_data, 
                              encoded_type_data, 
                              encoded_previous_data],axis=1)
 
    return encoded_data
2.3 Data Cleaning

For predicting which action follows upon another action, we don't need any of the initial columns, only the ones generated by one-hot-encoding. Remember to also remove the ID column!

def clean(data):
    cleaned_data = data.copy()
 
    #TODO Decide which columns are not required to be able to predict the next robot action
    #Hint: The NEXT column IS required.
 
    cols = [header_names.PARENT,
            header_names.TYPE,
            header_names.START_TIME, 
            header_names.END_TIME, 
            header_names.DURATION,
            header_names.PREVIOUS,
            header_names.ID]
 
    for col in cols:
        cleaned_data = cleaned_data.drop(col, 1)
 
    return cleaned_data
2.4 Data Preparation Pipeline

Simply apply the three functions above on the narratives data.

def prepare_data(data):
    prepared_data = data.copy()
 
    prepared_data = fill_empty_cells(prepared_data)
    #TODO apply all preparation methods on prepare_data
    prepared_data = transform_categorial_to_one_hot_encoded(prepared_data)
    prepared_data = clean(prepared_data)
 
    return prepared_data
2.5 Prepared Data Evaluation
#TODO store the prepared narratives in a prepared_narratives variable and evalute them by printing them
prepared_narratives = prepare_data(narratives)
#TODO verifiy that the prepared narratives do not have any empty cells
prepared_narratives.isna().any()

In the next chapter we will talk about decision trees, our preferred model design.

ease/machinelearning/data_preparation.txt · Last modified: 2020/06/22 11:39 by s_fuyedc

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki