Exploring the socioeconomic and geographical impacts on electricity consumption.
In this report, we will explore the Power Outages dataset composed by Sayanti Mukherjee, Roshanak Nateghi, and Makaran Hastak and perform quantitative analysis on a selection of its provided features. This dataset contains 1536 rows or entries and 57 columns or features.
The motivation for choosing the power outages dataset stems from our desire to gain a better understanding of how socioeconomic factors play a role in the manufacturing and maintanence of United States’ electrical grid. This will help us gain a better understanding of what groups may be at risk of suffering from the increasingly devestating effects of climate change and how utility companies respond to more marginal socioeconomic and geographical groups in the U.S.
We decided to choose a question which would elucidate how utility companies might choose which states to expand into and how socioeconomic/geographical features may affect the prevalence of electricity usage in a state.
The question that this report seeks to answer is the following: How does a state’s total gross state product and the size of its urban clusters, specifically its population proportion in urban clusters and the percent area of the state designated as urban clusters, affect the total residential electricity sales?
To address the research question, the analysis will focus on the following columns from the dataset:
TOTAL.REALGSP
Represents the real or inflation-adjusted gross state product of a state contributed by all industries and sectors in that particular state. This metric is a key indicator of the economic strength of a state, measured in constant 2009 U.S. dollars.
POPPCT_UC
Denotes the percentage of a state’s population residing in urban clusters—defined as areas with populations between 2,500 and 50,000. For example, we may imagine these areas to be small towns, townships, code cities, and villages.
AREAPCT_UC
Measures the percentage of the total area of a state that is designated as part of urban clusters.
RES.SALES
Measures the total electricity consumption in the residential sector of a state, recorded in megawatt-hours.
This analysis is critical for understanding how socioeconomic and geographical variables drive electricity consumption patterns across the United States. By answering the investigative question, the findings can:
By integrating these insights, this report aims to provide actionable recommendations for improving the resilience and equity of the energy sector.
For more details on the dataset, refer to the original publication:
Power Outages Dataset
To ensure our dataset was prepared for meaningful analysis, several data cleaning steps were undertaken. Each step addressed specific issues arising from the data generation process, ensuring that the final dataset was consistent, accurate, and ready for exploration. Below, we describe these steps in detail:
Upon loading the dataset into a DataFrame, the following issues were identified:
variables
column consisted entirely of NaN
values across all rows and contributed no useful information.OBS
column, which served as an outdated index, was redundant for our analysis.float64
.Action Taken to Clean Data:
variables
and OBS
columns using the .drop()
method.convert_to_float()
functionNext, we standardized the units of our features, particularly our monetary features, into more comprehensible terms. We put all of the variables measuring total GSP of a state or a sector into units of “billions of dollars” to make our future plots more legible.
Lastly, we checked if any of the columns relevant to our question (POPPCT_UC
, TOTAL.REALGSP
, AREAPCT_UC
, RES.SALES
, etc.) had any missing values. The RES.SALES
feature, which describes the electricity consumption in the residential sector in units of megawatt-hours, has 22 missing values.
To resolve the issue of the missing values in the RES.SALES
column, we utilized conditional probabilistic imputation. We chose to use this method of imputation since we know that that the total sales of electricity in the residential sector greatly differs from state to state due to population and urbanization metrics, meaning that it is necessary that we group by state before performing imputation. Additionally, probabilistic imputation was used to avoid introducing bias around the center of the RES.SALES
data. Using simpler methods, such as mean imputation, would have created artificial spikes in the data distribution, misrepresenting the variability present in the original dataset. By employing conditional probabilistic imputation, we ensured that the integrity and natural spread of the data were preserved.
To ensure the dataset was suitable for analysis, we examined the RES.SALES
feature for missing values across all states. The goal was to verify that each state had some non-missing entries for RES.SALES
, enabling valid conditional imputation. To assess this, we created a summary table that displayed the total number of entries for each state and the number of entries with missing RES.SALES
values.
The analysis revealed that nearly all states had sufficient valid entries for RES.SALES
, confirming that conditional imputation could be performed. However, the state of Alaska presented a unique challenge: it had only one entry in the entire dataset, and the RES.SALES
value for this entry was missing. Furthermore, most features for Alaska contained missing values, making its inclusion in the dataset impractical.
Given this, we decided to remove Alaska from the dataset. Since it contributed only a single incomplete data point out of nearly 1,540 total entries, its removal would not significantly impact the dataset’s integrity or future modeling efforts. Including such a sparse datapoint would likely hinder extrapolation and add noise to the analysis.
With Alaska excluded, we proceeded with conditional probabilistic imputation for the RES.SALES
column. This step ensured that all missing values in this feature were effectively addressed while maintaining the relationships within the data. After this process, the RES.SALES
feature no longer contained any missing values, enabling us to move confidently to the next stage of exploratory data analysis.
To better analyze the data, we organized it by grouping states and calculating the mean values for relevant columns, creating a set of interesting aggregates. These aggregates include total gross state product (TOTAL.REALGSP
), percentage of the population in urban clusters (POPPCT_UC
), percentage of the land area represented by urban clusters (AREAPCT_UC
), and residential electricity sales (RES.SALES
). This grouping provides valuable insights into the relationships between economic, geographic, and energy consumption variables while removing the frequency bias caused by over and underreporting certain states throughout the dataset. For instance, the state of Texas has considerably more entires than the state of Vermont in the original outages dataset. This would shift the peaks of our relevant distributions towards overreported states, which would not give us an accurate understanding of the relationship between these variables. Thus, we will perform a groupby()
operation on our relevant columns with respect to the state, allowing us to look past the bias introduced with the frequency of entries per state.
Below is a pivot table displaying the average total gross state product (TOTAL.REALGSP
), average percentage of the population in urban clusters (POPPCT_UC
), average percentage of the land area represented by urban clusters (AREAPCT_UC
), and average residential electricity sales (RES.SALES
) for each U.S. state. These interesting aggregates highlight patterns and relationships between economic indicators, geographic factors, and residential electricity usage.
To understand the distribution of the percentage of the population living in urban clusters across U.S. states, we performed a univariate analysis on the POPPCT_UC
feature. The goal of this analysis was to examine how urban cluster population percentages vary across states and interpret any patterns or trends that emerge.
We first visualized the raw distribution of the POPPCT_UC
values using a box plot. Subsequently, to account for state-level differences, we grouped the data by U.S._STATE
and calculated the mean POPPCT_UC
for each state. This step helped mitigate any bias introduced by multiple entries for the same state, creating a clearer view of state-level trends.
The box plot displays the distribution of urban cluster population percentages across states. Key observations include:
The left-skewed nature of the data suggests that urban clusters, as a percentage of the total population, are generally less prominent in most states. However, the presence of a few high outliers indicates that certain states are exceptions, potentially due to unique demographic or geographic factors that lead to higher concentrations of populations in urban clusters.
This analysis provides valuable context for understanding the role of urbanization patterns in broader societal or economic outcomes. For instance, if the investigative question centers on the relationship between urbanization and resource distribution, these findings highlight the need to consider the uneven distribution of urban populations when formulating policies or conducting further analysis.
To analyze the relationship between the total real GSP (TOTAL.REALGSP
) and residential sector sales (RES.SALES
), we calculated the average values of these columns for each state. This aggregation was necessary to reduce bias introduced by states with multiple entries in the dataset, ensuring that each state is represented by a single average value for both variables.
We then visualized the relationship between these two columns using a scatter plot. The plot displays the average total real GSP (in billions of dollars) on the x-axis and the average residential sector sales (in megawatt-hours) on the y-axis.
The scatter plot highlights a clear positive linear trend between the two variables:
This analysis partially answers our investigative question by revealing that economic indicators like the total real GSP are positively associated with electricity usage in the residential sector. As a state’s total real GSP increases, its residential sector electricity sales can be expected to increase as well. This insight supports the hypothesis that economic growth is a key driver of energy consumption patterns, which could help stakeholders better allocate resources for states with higher energy demands.
Upon plotting AREAPCT_UC
vs. RES.SALES
and POPPCT_UC
vs. RES.SALES
, we found that the relationships between these sets of variables could potentially be non-linear.
For instance, for the the trend between POPPCT_UC
and RES.SALES
, the residential sales fall in a descreasing manner for the majority of the bivariate distribution. However, the rate of decrease itself seems to decrease. Furthermore, for the trend between AREAPCT_UC
and RES.SALES
, we could see that the residential sector sales consistently increase with respect to the percent area of a state designated as urban clusters, although the rate increase appears to decrease slightly. It is likely that for both of the aforementioned relationships, we will have to linearize this relationship in order to fully utilize them in a linear regressor.
The goal of this analysis is to predict the total residential electricity sales (RES.SALES
) for a state based on its total gross state product (TOTAL.REALGSP
), population proportion in urban clusters (POPPCT_UC
) and percentage of the land area of a state represented by the land area of the urban clusters (AREAPCT_UC
).
This is a regression problem because the response variable, RES.SALES
, is a continuous numerical value representing the total residential electricity sales in megawatt-hours (MWh). In particular, we will construct a multilinear regression model to relate our chosen features to the residential sector sales of energy.
RES.SALES
(Total Residential Electricity Sales in Megawatt-Hours)The features used for prediction were carefully selected based on their relevance and availability at the time of prediction:
Features unavailable at the time of prediction, such as future consumption patterns or external economic changes, were excluded to ensure that the model aligns with realistic use cases. This approach ensures the predictions remain practical and deployable in real-world scenarios.
Our baseline model is a multilinear regression model designed to predict residential electricity sales (RES.SALES
) in megawatt-hours given input variables of the total real GSP of a state (TOTAL.REALGSP
), the percent of a state’s population residing in urban clusters (POPPCT_UC
), and the percent of a state’s total area designated as urban clusters (AREAPCT_UC
).
TOTAL.REALGSP
(Total Real Gross State Product):
POPPCT_UC
(Percent Population in Urban Clusters):
AREAPCT_UC
(Percentage of Urban Land Area):
Our baseline model is implemented as a multiple linear regression model, trained using sklearn
’s Pipeline
. In this scenario, we implemented the Pipeline
out of best practice, but if we were to use a transformation in the future, the Pipeline
would serve to streamline the preprocessing and modeling process. All features were numerical, so no additional encoding for categorical or ordinal variables was required.
Mean Absolute Error (MAE):
Our baseline model has lots of areas to improve in its current form. While it uses relevant features, the high MAE suggest that it fails to capture critical relationships or additional drivers of electricity consumption. The model’s performance is likely hindered by potential non-linear relationships between features and the response variable. Therefore, to improve the model, we explored non-linear models and more feature engineering to better capture complex relationships. The baseline model serves as a starting point for further refinement and exploration in our predictive task.
Building on the limitations identified in the baseline model, the final model introduces feature transformations to better capture the underlying non-linear relationships between the predictors and residential electricity sales (RES.SALES
). By leveraging transformations and hyperparameter optimization, the final model significantly improves predictive accuracy.
The final model retains the same three features as the baseline model—TOTAL.REALGSP
, POPPCT_UC
, and AREAPCT_UC
—but applies data transformations to improve their representation in the model. These transformations address the non-linear relationships observed in the bivariate analysis between urban cluster features and residential electricity sales (RES.SALES
).
TOTAL.REALGSP
(Total Real Gross State Product):
TOTAL.REALGSP
and RES.SALES
is largely linear, so this feature is included without transformation to directly model its influence on electricity sales.POPPCT_UC
(Percent Population in Urban Clusters):
POPPCT_UC
vs. RES.SALES
, it appears that the trend in the data is positive and increasing but resembles the upper-left bulge in the Tukey Mosteller Bulge Diagram. This suggests a non-linear relationship. To capture this behavior, we define a fractional polynomial transformation of the form \(x^b\), where \(b\) is optimized using GridSearchCV
over the range \([0.1, 2.0]\) with steps of \(0.1\).AREAPCT_UC
(Urban Cluster Area Percentage):
AREAPCT_UC
vs. RES.SALES
shows a decreasing trend with a lower-left bulge, consistent with the Tukey Mosteller Bulge Diagram. To account for this, we apply a negative exponential transformation of the form \(e^{-a \cdot x}\), where \(a\) is optimized using GridSearchCV
over the range \([0.1, 2.0]\) with steps of \(0.1\).These transformations align with the data-generating process, reflecting the expected non-linear relationships between geographic, population, and economic factors and residential electricity consumption.
The final model is implemented as a multiple linear regression model with feature transformations using FunctionTransformer()
within an SKLearn ColumnTransformer
. To fine-tune the transformations, we employed GridSearchCV to optimize two hyperparameters:
a
) for the transformation of AREAPCT_UC
:
b
) for the transformation of POPPCT_UC
:
The GridSearchCV
framework performed 4-fold cross-validation to evaluate all combinations of the hyperparameters, using mean absolute error (MAE) as the scoring metric. This ensured that the selected parameters minimized prediction errors on unseen data.
Mean Absolute Error (MAE): 565,446.19
The final model demonstrates a substantial reduction in MAE compared to the baseline model (743,552.41). With out file model, we achieved a 23.9% reduction in mean absolute error (MAE) compared to the baseline model, highlighting a significant improvement in prediction accuracy.
The final model successfully addresses the limitations of the baseline by incorporating domain-specific feature transformations and leveraging hyperparameter optimization. The resulting decrease in mean absolute error confirms the effectiveness of these changes, establishing the final model as a more robust tool for predicting residential electricity sales. While further improvements may involve additional features or non-linear algorithms, this model provides a strong foundation for understanding the relationships between economic, geographic, and population factors in energy consumption.