Following the EDA segment of this article (part 1), we will be diving straight into the modeling process.
Following the EDA segment of this article (part 1), we will be diving straight into the modeling process. We will touch on the following :
To reiterate for the bad pupils who didn’t read part 1: we analyse building performance energy diagnostic data. The goal is to build models capable of predicting the energy (or energy label) of a building based on its features.
We could go one of two ways : a regression problem where we predict the energy consumption, or a classification problem where we predict the energy class.
We decided to go the regression route because, well, we wear the trousers in this relationship (and for more accurate predictions).
So to get our ducks in a row:
★ Problem: Regression with a little twist ;) (You’ll figure it out below!)
★ Target : Energy consumption per building
★ Features : Age, Location, Different building surfaces, Temperatures
The following is the workflow we implemented:
During data analysis, we saw that each department has specific energy behaviour. However, the dataset in hand does not give any descriptive information about the departments other than their postal code.
Consequently, we sought the information elsewhere. It is safe to presume, at least at this stage, that the weather is a big catalyst in energy consumption, ergo we decided to ingest temperature data into the models.
We extract the data from a public government website. It details daily temperatures (min, max and average) per department, spanning from January 2018 up until January 2020.
Using the dates, we average each temperature (min, max and average) per season and per department. We get the following:
Then, we simply merge it with our main dataset.
Since we already took on the major load of data cleaning in part 1. The only preprocessing step needed is to encode categorical features. The most problematic one is the department code mainly because of its high-cardinality (96 values).
Since machine learning fairies have blessed us with LightGBM and CatBoost, we only need to encode this feature for the decision tree regressor model. We tried One-Hot-Encoding which did not bode well for the training (96 extra features). We settled for Frequency Encoding.
We do not need to scale nor normalise the data, since we mainly use tree based models.
Now, this is where it gets a little confusing (the twist ;)). As stated above, this is a regression problem. However, we will not be using regression metrics to evaluate the model but rather classification metrics. Simply because, in our case, they’re more expressive and easier to explain business wise.
If I were to tell you we get a 70% recall on class D, it would be easier to comprehend than a 140 RMSE score, right?
Converting predicted energy consumption to a class is very easy since we already have a label system associated to the problem (energy class A through to G).
Now, you might ask: why not take the easy way out and just define a classification problem? and to you I say :
Simply put, we prefer having both and we can use the regression prediction to infer the class but not the other way around.
We use recall, precision and f1-score for each of the 7 classes.
We use the mean energy consumption as a baseline.
We then mostly train tree based models: Decision Tree Regressor, LightGBM and CatBoost.
b. Hyperparameter tuning
We use grid search. It obviously takes a long time to train but it is worth it. Also, we’re too lazy to do a bayesian optimisation 🤷♀️.
c. Cross Validation
We use the good old fashioned k-fold cross validation.
We build a Pipeline that executes validation and evaluations tasks common to all models. And a Model class, for preprocessing and training, specific to each model.
We use the MLflow Tracking component to manage our experiments, which includes logging parameters, metrics, models and artifacts (confusion matrix, feature importance plots etc).
After running all experiments, MLflow allows us to compare the metrics and keep track of the best hyperparameters.
The detailed code can be found in this GitHub repo.
a. Confusion matrices
We first take a look at the four confusion matrices side by side:
We can see that the best predicted classes are ‘D’ and ‘E’ across the board (except for the mean where logically only class D is a winner), which makes sense since they are the most frequent classes. The three tree based models have similar matrices. Unfortunately, class ‘G’ is never predicted; which, again, makes sense since it has the smallest support and we did not let your models run for very long.
We also notice that the wrongly predicted classes, are usually predicted by the models as their neighbouring class; next higher or lower class.
The three models have very close performances and values for classification metrics across the 7 energy classes:
We mainly focus on f1-score since it is the weighted average of precision and recall, and we do not have a specific need in mind.
Class D comes out on top once again followed closely by classes B, E and A.
The models have surprisingly underperformed in predicting class C, even though it is the third most represented class in the dataset.
b. Feature Importance
We use SHAP values to explain our model outputs. We focus on feature importance. The plots below show the distribution of the impact each feature has on the model output.
The age, surface, building type and average winter temperature are the most important features for the model. The lower the age the lower the energy consumption which makes sense since building have become more energy sufficient with time by using isolating materials for example.
The age, the surface and the building type come out on top once again for both LightGBM and CatBoost with minor differences in SHAP values. The postal code is a winner for LightGBM, being almost as important to the model as the building type. Whereas it drops a little lower for CatBoost, its place taken by number of floors and maximum winter temperature.
We have somewhat successfully modelled the problem. The performance can be enhanced by widening the grid search and prolonging training time.
Obviously, we had an imbalanced classes issue which can be solved by adding more data for class G and A. However, the least represented classes are not a fault of poor quality data but rather a representation of reality.
We did not model greenhouse gaz emissions since they are perfectly correlated with energy consumption and can be inferred with a simple linear regression.
And there you have it!
We hope you remember this next time you rent or buy a house; ALWAYS do your energy due diligence ;)
Notre Manifeste est le garant des droits et devoirs de chaque CodeWorker et des engagements que nous avons vis-à-vis de lui.
Il se veut réaliste, partagé et inscrit dans une démarche d'amélioration continue.
Tu veux partager tes connaissances et ton temps avec des pairs empathiques, incarner une vision commune de l'excellence logicielle et participer activement à un modèle d'entreprise alternatif, rejoins-nous.