The heterogeneity of the energy sector at the residential level makes it difficult to develop specific strategies for fostering the energy transition towards renewable or more
sustainable energy sources. Achieving a proper segmentation of residential consumers is of utmost relevance to determine the impact of potential policy instruments. This
Deliverable presents a segmentation methodology, which creates a dictionary of electrical behaviour from the load profiles of households. This methodology can be roughly divided into four steps:
(1) the processing of a wide sample of residential electrical load profiles in order to complete missing values, detect and correct inconsistencies, and provide a suitable format;
2) the extraction of features, understood as a summary of values of each load profile;
(3) the application of several cluster analysis methods to the extracted features; and
(4) the analysis of results to identify the most representative behaviour patterns as well as the assessment of regional and temporal trends among the elements.
The household segmentation provided in this Deliverable will be used as both an external variable of the causal diagram in Task 2.2 and a key unit of assessment in Work Package 4. The data sources collected consist of a pool of twenty datasets of load profiles from eight different countries, which have been obtained from online open access resources and non-publicly accessible electricity suppliers. The collected datasets are heterogeneous, i.e. with unequal lengths, sampling frequencies, and measurement units, and the way they store the load profile information varies from one to another. This implies performing different processing tasks per dataset. In addition, some datasets incorporate additional demographic and socioeconomic information that requires to be processed separately.
The particular characteristics of each individual dataset have been summarised in a table indicating the number of supply points and their type, the country of origin, the sampling period, and the duration of the data collection process, among others. The five datasets selected for analysis include those labelled as (1) Electric cooperatives from Spain, (2) ISSDA from Ireland, (3) Low Carbon London from the UK, (4) Elergone Energia from Portugal, and (5) NEEA from the USA, totaling 33,114 supply points.
The data processing performed, that is all those actions for the extraction of relevant information from the original dataset files, including the tasks of data extraction, data
cleaning and feature extraction, are then discussed. With regard to data extraction, the way in which the original dataset files are translated into raw files is explained. Raw files
consist of useful information in a common and simple format. They only contain timestamped values of consumed electricity of individual households and are stored as comma-separated value (CSV) files. The data extraction process of the five selected datasets is thoroughly described. For example, the Electric cooperatives dataset has required dealing with the specificities of the Spanish electric market and its Electrical Metering Information System (SIMEL), while the complexity of the extraction of data from
the rest of the collected datasets has been lower. Raw files, however, are still not ready for analysis, as these files may contain gaps of missing values, outliers, or other data
inconsistencies that must be harmonised. This process is commonly known as data cleaning. Five operations have been mainly performed in this work: (1) data imputation, (2)
adaptation to local time, (3) avoidance of COVID-19 lockdown dates, (4) exclusion of short load profiles, and (5) exclusion of 0-valued load profiles.
Data imputation is the process of replacing missing data. Two different strategies of data imputation are carried out depending on the length of the sequences of missing values.
On the one hand, sequences of missing values accounting for eight consecutive hours or less of the time series are imputed by linear interpolation. On the other hand, sequences of missing values accounting for over eight consecutive hours of the time series are imputed by the Last Observation Carried Forward (LOCF) method using a 7-day season.
This means that the missing samples are replaced with values from the preceding seven days, thus ensuring the preservation of the same time and day of the week. Regarding the adaptation to local time, in order to make the comparison between the different datasets possible, in this work all time series have been referenced to their local time zone, and the discontinuities produced by any daylight saving time have been eliminated. In addition, all time series running during COVID-19 lockdowns or stay-at-home periods have been split into two parts: a pre-COVID-19 one, excluding the lockdown period, and a post-COVID-19 one, including the in-lockdown and post-lockdown periods. Finally, all time series shorter than one year or with all their values equal to 0 have been excluded from the general processing.
Time series feature extraction is a dimensionality reduction technique that finds common characteristics in the data and provides a more manageable and representative subset of variables. It has a predominant role as a data processing tool. Essentially, feature extraction translates each load profile, regardless of its length or range of dates, into a
reduced set of meaningful values, the so-called features, thus reducing the complexity of any subsequent processing. In total, 3,179 features are extracted from each time series,
which can be categorised into seven types.
First are the basic statistics, which include statistical moments (mean, variance, skewness, kurtosis); quartiles and deciles; outlier-related statistics, such as the interquartile range;
and the sum of all the values of the time series. Second, there are the so-called seasonal aggregates, which are the largest group of features. Features are obtained by splitting the time series into subsets and then calculating summary statistics for each. Fourteen groups of subsets are defined, each subset corresponding to particular time bands (e.g. hour intervals, days of the week, months, and more complex combinations of those). A subset comprises all samples of the time series for which their date and time fall within its time band. The summary statistics computed for each subset are the mean, the standard deviation, and the sum (in some groups). Third, the peak and off-peak time bands. These features are obtained by splitting the time series into subsets and then adding all samples of each subset together. Eight groups of subsets are defined, and the time band
containing the maximum value is identified as the peak time band. Similarly, the time band with the minimum value is identified as the off-peak time band. Fourth, the lag k-day autocorrelations, that is the correlation between values that are k time periods apart. The selected values of k span from 1 to 28 days. Fifth are the load factors, typical from the electrical system analysis. The load factor is the average load divided by the peak load in a specified time period (days, weeks, and years). The last two groups are computed using predefined software packages: the tsfeatures R-package, which includes 64 features such as STL decompositions, autocorrelation coefficients, seasonal strengths, entropies, and other values resulting from different analyses; and the catch22 package, which includes 22 features whose selection was based on their successful classification performance.
Read the full Executive Summary and download the Deliverable.