Use Case #2: Predicting Buildings’ Energy Consumption using Machine Learning

If you’ve ever lived in Europe, chances are you’ve already seen an EU energy label on an appliance for example like a washing machine or a dryer. You know that label that tells you how much that fridge is going to wreak havoc on your already exorbitant electricity bill.

April 14, 2022
Photo by Daniele Franchi on Unsplash

Use Case #2: Predicting Buildings’ Energy Consumption using Machine Learning

Part 1 : Exploratory Data Analysis


If you’ve ever lived in Europe, chances are you’ve already seen an EU energy label on an appliance for example like a washing machine or a dryer. You know that label that tells you how much that fridge is going to wreak havoc on your already exorbitant electricity bill.

Well, the same principle goes for buildings. In Europe, you can get a label attached to your building (i.e. an energy performance certificate) detailing how energy efficient it is (ranging for class A 🤩 to class G 😩) .

In France, it’s called DPE (Diagnostic de Performance Énergétique) which is, in short, a diagnosis for a building that results in two labels: an energy label (for how much energy the building consumes ⚡) and a climate label (for how much greenhouse gas it produces ).

Many factors influence the labels: how old the construction is, having insulating windows, sun exposure and location temperature etc.

Energy label and Climate label

Needless to say, at the risk of sounding dramatic, the higher you climb, the more you and the environment (so basically all of humanity) pay.

In this article, we will look at DPE data from France, analyse it and build ML models that predict the energy consumption of a building as well as its greenhouse gas emissions.

We’ll approach it as follows:

  1. Data source, description and stack
  2. Data cleaning and exploration (the fun stuff)
  3. Modelling and prediction (the really fun stuff) (Part 2)

Let’s hit the ground running 🏃‍♀️

The code as well as the data used can be found in this Github repo.

  1. Data Source, Description & Stack

We used a public government (l’ADEME) DPE dataset.

The data contain 9.5 million rows. Each row details a building diagnosis (we will henceforth refer to it as DPE) with data ranging from the diagnosis date, the building description to the two labels assigned to each building.

The diagnoses span from 2001 to 2020.

We sampled 5% of the data so about 500k rows to explore and model.
We use python and google Colab as our main stack. The python packages’ details can be found in the Git repo.

2. Data Cleaning & Exploration

To be honest, this was not the cleanest dataset out there, but we made do.

We will briefly go through the data cleaning steps we took before diving into data analysis.

  • First steps:

The dataset has originally 66 columns. We initially got rid of unwanted columns based on the metadata file provided with the data. We got rid of the columns relating to the diagnosis (DPE), such as the ‘id’ of the person in charge, the ‘id’ of the method used, the name of the method used, the address details such as street number etc.

Which leave us with 34 columns.

  • Missing values:

Next, we looked at missing values and deleted any columns with more than 80% missing data.

Once we looked at the correlation between the remaining missing values, we concluded that we have a high correlation between groups of columns mainly 4 groups. Which leads us to deleting the missing values without loosing too many rows.

Heat map for missing values
Dendrogram for missing values
  • Correlation:

Next, we looked at the correlation between the remaining columns. Leading us to put aside two columns which have a high correlation of 1 with two others.

Which leaves us with a 280k rows and the following 28 columns:

Dataframe info
  • Column inspection:

After getting rid of the elephants in the room. We move on to a more thorough inspection of each and every problematic column.

Construction year (année de construction):

The column has many outliers. We replace the values smaller than 1500 and bigger than 2020 with the column median of 1967.

There is a surge of construction in the 1940s, most likely due to rebuilding after the end of WW2.

Construction year histogram

We use this column to compute the age for each building at the moment of diagnosis. Since no column states clearly and correctly the diagnosis date, we use a combination of two columns (the diagnostician visit date and the official DPE date).

Living space (surface habitable):

The box-plot shows some outliers, and looking at the column description it seemed that values above 1000 for the surface seem unreasonable.

This is strictly based on intuition, as there is no real expertise based evidence that supports this.

Number of floors (nombre de niveaux):

Many of the values make absolutely no sense, the highest building in France has only 56 floors.

After taking a look at the adresses on google maps it appears that neither the adresses nor the number of floors are reliable features.

We made the choice to cut all rows with a number of floors higher than 25. Since we cannot make sense of the outliers, and since they only represent 101 rows. The choice of 25 is based on intuition once again.

As John Lennon once said :

I use my intuition, it takes me for a ride
Intuition takes me there
Intuition takes me anywhere 🎶

Postal code (code INSEE commune actualisé):

About 100 rows don’t have valid postal codes, thus we delete them.

We look at the frequency count of each French department:

Postal code histogram

The departments : Nord (59), Bouches-du-Rhône (13), Paris (75) and Rhône(69) have the most diagnosed buildings. Which makes sense since those are the most populated departments in France.

The surfaces (floors, doors, walls, windows facing north, east…):

These column seem to have a few negative values which we consider to be human typing error. We take instead their positive counter parts.

Moving on to the pièce de résistance: targets.

We will be focusing on two different, but highly correlated, targets: the energy consumption (kWh ep./m².year) and the greenhouse gas emission (Kg eqCO2/m².year).

Energy consumption (consommation d’énergie):

The target is expressed in two columns, the energy consumption (consommation_energie) and the corresponding class (classe_consommation_energie).

This energy consumption column does not have any missing values, however it does have negative and zero values.

For the negative values, and after consulting google maps, the adresses are not for energy establishments, so either there was a mistake when entering the data and the ‘-’ sign was accidental or the buildings have solar power (or some other form of power) integrated.

We consider the negative values an oversight. So we convert them to their positive counter parts.

We delete the zero values since the buildings do not have net-zero emissions. We get the following histogram. The mean energy consumption is 206.8 kWh ep./m².year.

Energy consumption histogram

The energy consumption class column seems to have unknown values and often wrong ones. Consequently, we reassign the classes based on the energy consumption column and using the standard energy labelling (A through G).

Energy consumption class count

More than a third of the diagnosed buildings are classified with a label D , followed closely by label E with 25% . Whereas class A is a mere 4%.

Suffice it to say, we still have a long way to go and a lot of work to do to reach energy efficiency ☘️.

Let’s look at the total energy consumption per department:

The departments : Nord (59), Rhône (69), Paris (75) and Haute-Garonne (31) have the highest total energy consumption which makes sense, since they have some of the highest population numbers in France.

We dive deeper into the energy consumption, and use the postal code to plot the energy consumption mean per French department.

Haut Alpe (05), Lozère (48) , Cantal (15) and Nièvre (58) have the highest average of energy consumption per building surpassing 254 kWh ep./m².year. Which makes sense since they’re fairly cold departments.

To have a more general idea, and for fun’s sake because who doesn’t like maps; we’ll look at the mean energy consumption per French region.

Greenhouse Gas Emissions (GES):

We have about 4 negative values, in this column, that we delete. The mean is 27 KgeqCO2/m².year

The GHG column distribution is as follows:

GHG emissions histogram

Same as before, the greenhouse gas emission(GHG) class column is questionable at best. We reassign the values based on GHG emissions’ column values and the labels seen above.

GHG class count
GHG count pie chart

24% of the diagnosed buildings are classified C, followed closely by class B for 18%. Once again class A is way down the list coming in at 9.5%.

And if you haven’t already, this is where you ask yourself just how horrible of an impact does your house have, probably a lot 👀

For the highest total GHG emissions, surprise surprise, the fantastic four strike again: Nord (59), Rhône (69), Paris (75) and Haute-Garonne (31).

Let’s take a look at which departments score the highest mean GHG emissions:

The departments: Ardennes(08), Creuse(23) and Moselle(57) emit the highest GHG per building.

And voilà! That concludes the data analysis portion of this article.

In part 2, we will be diving into building ML models to predict the targets.

PS: Think about all of this next time you leave the heater on when you leave the house, or “forget” to unplug your phone charger…Consequences!

Useful links :



DPE explained (it’s in French though 🤓)

Part 2

CodeWorks, un paradigme d'ESN équitable, solidaire et durable

Notre Manifeste est le garant des droits et devoirs de chaque CodeWorker et des engagements que nous avons vis-à-vis de lui.
Il se veut réaliste, partagé et inscrit dans une démarche d'amélioration continue.

Rejoins-nous !

Tu veux partager tes connaissances et ton temps avec des pairs empathiques, incarner une vision commune de l'excellence logicielle et participer activement à un modèle d'entreprise alternatif, rejoins-nous.