January 18th 2021

Buildings, Data & Dinosaurs? Why M&V Matters!

Buildings, Data & Dinosaurs? Why M&V Matters!

Ok - so buildings and dinosaurs may not initially seem like two topics which are suitable for comparison but let me explain.

This article will explore some concepts relating to data analysis and comparison methods within the context of building energy performance. More specifically we will examine some example use cases in which the process of Data Analytics is not the same as Data Visualisation.

So if you're interested in Energy Management, Building Analytics, Data Visualisation and Measurement & Verification then please keep reading. Oh, and don't worry... we'll get to the dinosaur soon enough!

First we'll start with a quick maths history lesson and introduce a famous mathematical statistician called Francis Anscombe. Back in 1973 Francis Anscombe developed a set of numbers known as Anscombe's Quartet. The number-set itself is simple in nature and consists of four sets of eleven (x,y) value pairs:

As you can see there is nothing unique about the numbers themselves. In the simple table view above it is a challenge to make out any form of pattern or shape in the data or even to spot any significant outlying value. To progress further with our exploration we now need to visualise the data:

In visualising the numbers in Anscombe's Quartet the shape of the four individual data sets immediately becomes clear. Set 1 is scattered, Set 2 is a curve, Set 3 a linear ramp and Set 4 a vertical line. Sets 3 and 4 also both have a single high outlying value.

It's important to stress at this point in our analysis that it's largely the power of the human eye and brain that has driven this visualisation and shape definition process. So far we have not computed any numerical calculations.

So as a next step let's take a look at some of the basic maths and statistics for the number sets. The calculated stats will include the Mean, Sample variance, Correlation, Linear regression line and R-squared.

In calculating this selection of data summary metrics the magic of Anscombe's Quartet becomes clear. Each of the four unique (x,y) value pair datasets return identical summary metrics with identical values.

So what does this tell us from a data analysis perspective? It tells us that mathematically speaking (using this limited number summary statistics at least) that the four data sets are identical. Yet we know from the previous visual analysis that each of the data sets are not identical as they display their own unique shape and form; scatter, curve, linear ramp & vertical line.

This short introduction to Anscombe's Quartet has highlighted an important lesson that Data Visualisation and Data Analytics are not the same thing. Numerical analysis on sets of very different numbers can in fact return identical summary statistics even when the very shape of the data sets themselves are very different. Or to quote Ronald H. Coase “if you torture the data long enough, it will confess to anything."

Anscombe's Quartet is a very well known and well studied dataset that has been around for nearly half a century but what happens when we try and push the boundaries and make it bigger? In today's Big Data world is it possible to create a larger version of Anscombe's Quartet? Is it possible to have more sets of larger number combinations that still return identical numerical summary statistics?

Thanks to some great open source work in this area these questions have already been answered through the development of a dataset called the 'Datasaurus Dozen' (acknowledgement to Alberto Cairo). The Datasaurus Dozen is a unique dataset which essentially acts as a bigger and more complicated version of Anscombe's Quartet. Instead of four sets of numbers in the original Quartet there are now twelve sets of numbers in the Dozen (plus a special 13th!). In visualising the Datasaurus Dozen it looks like this:

When a range of summary statistics is calculated for each of these twelve datasets identical values are again computed (at two decimal place accuracy at least).

And what about the Dinosaur? Well it turns out there is a special 13th set of numbers included within the Dozen, and guess what it looks like when plotted? 

Both Anscombe's Quartet and the Datasaurus Dozen teach us an important lesson in the field of data analysis. Data Visualisation is as important and if not more important than Data Analysis via statistical number crunching. We can let loose advanced Artificial Intelligence (AI) routines and Machine Learning (ML) based algorithms but we need to be sure first that we are not trying to analyse a monster - or a dinosaur.

Both Anscombe Quartet and the Datasaurus Dozen were structured datasets i.e. human/computer generated sets of numbers that were produced with the single purpose of proving a point to start with. But the datasets collected from modern Building Management System (BMS) and Internet of Things (IoT) driven infrastructure are largely unstructured and not as simple. So how can we apply this lesson within the context of today's built environment and our Smart Building dominated cities and communities?

The bigger the building the more complex the datasets become and the more sophisticated our response and analytics toolkit needs to be. If it's possible for simple datasets like the Quartet and Datasaurus to provide us with numerical false truths then it's more than possible for our buildings to provide us with false truths too.

This led me to consider whether it would be possible to create a new set of numbers that illustrate the same concepts of Anscombe's Quartet and the Datasaurus Dozen but specifically shaped to suit the field of building performance analysis.

Utilising the IES Virtual Environment software suite a basic building energy model was created and ran through a series of climate control sequences. Simulation 1 considered a heating dominated building in the Northern European climate zone. Simulation 2 evolved the control strategy to be cooling plant dominated. Simulation 3 is seasonally independent whereby the building uses an equal amount of energy for every month of the year. The reported monthly energy totals (MWh) for this Dynamic Thermal Model or DTM Trio looks like this:

A standard practice technique used in the energy management industry would now be to calculate an Energy Use Intensity (or EUI) rating for each of the simulation scenarios in this Trio. In doing so however the same EUI figure of 40,415 kWh is achieved for all three simulations.

This goes some way to replicating the overall effect of Anscombe's Quartet and the Datasaurus Dozen. The EUI rating alone is not sufficiently granular to allow the real performance characteristics and unique nature of the building to be analysed.

In order to deliver an energy efficiency program within the Simulation 1 facility a range of Energy Conservation Measures (or ECM's) targeted around the building heating plant would be required. The Simulation 2 facility however would require a very different set of ECM's targeted around the building cooling plant in order for an energy efficiency upgrade to be effective. So EUI based analysis and energy benchmarking alone may well lead to the implementation of an ineffective CapEx program with no ROI.

In reality this could for example involve upgrading the chiller plant in Simulation 1 whilst upgrading boiler plant in Simulation 2. Both of these example energy upgrade programs would have little effect on delivering any significant level of % energy reduction and would have been ill conceived.

For this reason EUI based energy benchmarking alone is not sufficient to drive energy investment programs and we need to go much further as an industry. International standards exist (e.g. IPMVP) which do provide us with an outline roadmap of how we can further enhance our analysis approach. Measurement & Verification (M&V) is the central backbone to methodologies such as IPMVP and new statistical metrics such as Net Mean Biased Error (NMBE) and Root Mean Squared Error (RMSE) are introduced.

More detailed calibration metrics such as NMBE & RMSE certainly do move our energy analysis approach well away from the error prone basic EUI based method and set a higher standard. The implementation of an M&V process allows for more rigorous and targeted analysis of building energy use at both Monthly and Hourly levels but in turn better quality data is needed.

In order to deliver Monthly based M&V regular and accurate monthly meter readings are needed. This may not always be possible (especially during the current COVID crisis with building access being limited) and it is not uncommon to encounter buildings with highly irregular meter readings. So relying on manual meter readings being taken regularly by FM teams is not a solid approach to M&V and often leads to poor quality data and the overall failure of any planned M&V process.

This is especially true at a sub-meter level. It may well be possible to have regular and accurate monthly data recorded at a Main Incoming utility level (e.g. Electricity/Gas/Water) but in a building with say 50-100 sub-meters unless FM teams are completing regular monthly inspections and recording manual meter readings across all sub-meters then data is often lost with human error being the main driving factor.

In order to implement an Hourly based M&V program then obviously hourly data is required as a pre-requisite. This moves us away from manually based meter reading M&V programs to Automatic Meter Reading (or AMR) based technology. Through the use of Building Management System (BMS) infrastructure high quality hourly (or sub-hourly) data can be automatically collected and stored in a time-series database or data archive facility.

By harvesting these hourly and sub-hourly data archives collected from the BMS more advanced M&V programs can ultimately be delivered although new challenges arise as the datasets become larger and more complex in nature. Data healing and cleansing routines are often needed in order to support both the data visualisation and data analytics process.

Despite these challenges having an effective strategy for implementation of an Hourly based M&V program is essential part of truly understanding the performance or DNA make-up of your building.

So is your building performing like a Dinosaur or are you achieving a Gold Star standard? Well why not let Data Visualisation, Data Analytics and Measurement & Verification (M&V) answer that question for you? But make sure that you understand the difference or get a suitably qualified M&V professional on-board who does!