This article will demonstrate the data ingestion and exploration for Power plants dataset provided by US EIA Open Data Portal.

U.S. Energy Information Administration

EIA Open Data

1. Import libraries

import requests
import pandas as pd
import copy
import pyodbc
from sqlalchemy import create_engine, text
from sqlalchemy.engine import URL
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import OperationalError
import seaborn as sns
import matplotlib.pyplot as plt
import os

# Set the display option to show all max column width
pd.set_option('display.max_colwidth', None)

# Set the display option to show all columns
pd.set_option('display.max_columns', None)

2. Ingest data from EIA of (Power Plants)

These data identify operable electric generating plants in the United States by energy source.
https://atlas.eia.gov/datasets/eia::power-plants/api
https://developers.arcgis.com/rest/services-reference/enterprise/query-feature-service-layer/

2.1 Get Power Plants data

2.1.1 Url Connection setup

resultOffset = 0
resultRecordCount = 2000

def eia_powplt_url(resultOffset):

    return  "".join(f"""
            https://services7.arcgis.com
            /FGr1D95XCGALKXqM/arcgis/rest/services
            /Power_Plants_Testing
            /FeatureServer
            /0
            /query
            ?where=1%3D1
            &outFields=*
            &outSR=4326
            &f=json
            &resultOffset={resultOffset}
            &resultRecordCount={resultRecordCount}
            """.split()
        )

2.1.2 using Get request to ingest from url:

Run the below cell only once

raw_data = []

# Make the HTTP request.
response = requests.get(eia_powplt_url(resultOffset))

# Check the status of the request
if response.status_code == 200:

    print("Request was successful and made more requests",response.headers['Content-Type'])

    while resultOffset < 12275:

        data = response.json()

        records = data.get("features", []) # to return an empty list if 'features' key is empty or not present

        raw_data.extend(records)

        # Check if there are more records to retrieve
        if len(records) < resultRecordCount or resultOffset > 12275:
            break;

        # Update the offset for the next request
        resultOffset += resultRecordCount

        # Make requests until all records are retrieved
        response = requests.get(eia_powplt_url(resultOffset))

        if response.status_code != 200:
            print(f"Request failed with status code: {response.status_code} at {resultOffset} resultOffset")
            break;

else:
    print(f"Request failed with status code: {response.status_code} at {resultOffset} resultOffset")

headers

response.headers

response.content

2.1.3 Exploring json response

type(data), len(data)

data

data.keys()

data['exceededTransferLimit']

pd.DataFrame(data=data['fields'])

data['features']

type(data['features']), len(data['features'])

raw_data # all records

json_eia_powplt = copy.deepcopy(raw_data)
type(json_eia_powplt), len(json_eia_powplt)

json_eia_powplt[0]

2.1.4 Destructuring json response

destrc_json_eia_powplt = []

for powplt in json_eia_powplt:
    emtdict = {}
    for data in powplt.values():
        emtdict.update(data) 
    destrc_json_eia_powplt.append(emtdict)

2.1.5 Use pandas DataFrame to convert the JSON data:

eia_powplt = pd.DataFrame(data=destrc_json_eia_powplt)
eia_powplt.dtypes, eia_powplt.shape

eia_powplt

2.1.6 Transformations

eia_powplt_filt = eia_powplt.copy()

2.1.6.1 Check for duplicates

eia_powplt.merge(eia_powplt.duplicated().to_frame().rename(columns={0: "Isduplicate"}), left_index=True, right_index=True).query("Isduplicate == True")

eia_powplt_filt.loc[
    eia_powplt_filt['OBJECTID'] == 23578
]

eia_powplt_filt

3. Store final dataframe in SQL Server

3.1 Configuring Database connection setup

3.1.1 Check the driver

pyodbc.drivers()

3.1.2. Configure the connection string

connection_url = URL.create(
    "mssql+pyodbc",
    username = sql_login_name,
    password = sql_login_password,
    host = server_name,
    port= port_number,
    database = database_name,
    query = {
        "driver": "ODBC Driver 18 for SQL Server",
         "TrustServerCertificate": "yes", # When yes, the transport layer will use SSL to encrypt the channel and bypass walking the certificate chain to validate trust. Useful when using self-signed certificates or when the certificate chain cannot be validated.
        "authentication": "SqlPassword", # use SQL login credentials instead of Windows authentication.
        "pool_size": "1", # to limit the number of sessions to one
    },
)

3.1.3. Create an engine using the create_engine() function, specifying the database URL

engine = create_engine(connection_url)

3.1.4 Create a session using sessionmaker

only run this if you are not using pandas read sql query or to sql i.e if you want to perform DDL or DML oprations:

Session = sessionmaker(bind=engine)
session = Session()

3.2 Read the existing tables in the SQL Server Database

3.2.1 Using Pandas read_sql_query() method - DQL: Select

first, confirm if the tables already exist in the database

qlist_tables = """
    SELECT TOP 10000 *
    FROM [dballpurpose].INFORMATION_SCHEMA.TABLES
    WHERE TABLE_TYPE IN ('BASE TABLE')
    ORDER BY TABLE_NAME ASC
"""

df_var = pd.read_sql_query(qlist_tables,engine)
df_var

table_name = 'Power Plants'

qlist_tables = f"""
    SELECT TOP 10000 *
    FROM [dballpurpose].INFORMATION_SCHEMA.TABLES
    WHERE TABLE_TYPE IN ('BASE TABLE')
        AND TABLE_NAME = '{table_name}'
    ORDER BY TABLE_NAME ASC
"""

df_var = pd.read_sql_query(qlist_tables,engine)

if df_var.empty:
    print(f"Table [{table_name}] does not exist")
else:
    print(f"Table [{table_name}] exists")

3.2.2 Using sqlalchemy Session() method - DDL: DROP

Drop table if n only if exists

Caution: Below is a DROP SQL statement

# Define the DROP SQL statement

table_name = 'Power Plants'

qdrp_table = f"""
IF OBJECT_ID(N'[dbo].[{table_name}]', N'U') IS NOT NULL
BEGIN
    DROP TABLE [dballpurpose].[dbo].[{table_name}]
END
"""

try:
    # Execute the SQL statement using the session's execute() method
    session.execute(text(qdrp_table))

    # Commit the changes
    session.commit()
    print(f"{table_name} dropped successfully!")
except OperationalError as e:
    # Handle the OperationalError exception
    session.rollback()
    print(f"An error occurred: {str(e)}")
finally:
    # Close the session
    session.close()

3.3 Send the ingested data in dataframes to SQL Server tables

3.3.1 Using Pandas to_sql() method - DDL: Create

Run the below cell only once

table_name = 'Power Plants'

eia_powplt_filt.to_sql(table_name, engine, if_exists='replace', index=False)

4. Query the data from SQL table

Read from your database Using Pandas read_sql_query() method - DQL: Select

table_name = 'Power Plants'

sql_string = f"""
  SELECT TOP 100
    *
  FROM [dballpurpose].[dbo].[{table_name}]
"""

df_var = pd.read_sql(sql_string, engine)
df_var

Viz 📉

df_var = df_var.groupby(['State', 'sector_name']).agg({'Total_MW': 'mean'})

fig, axs = plt.subplots(1, 1, figsize=(25, 8))

sns.set_theme(style="darkgrid")

sns.barplot(
    data=df_var,
    x= 'State',
    y = 'Total_MW',
    hue = 'sector_name',
    palette = 'Pastel1',
    ax=axs
)

# Get the current tick positions and labels
tick_positions = axs.get_xticks()
tick_labels = [label.get_text() for label in axs.get_xticklabels()]

# Set the tick positions and labels with rotation and Rotate x-axis labels by 90 degrees
axs.set_xticks(tick_positions)
axs.set_xticklabels(labels=tick_labels, rotation=90)

plt.title(label='Mean Mass of Power Plants by sector_name Type', loc='center')

plt.show()

5. Save nasa_ml_filt as CSV file

5.1 Check file if exists

os.getcwd()

# Define the file path
path = r'C:\Users\HP\Downloads\\'
file_name = 'Power Plants'

os.chdir(f'{path}')

csv_files = [file for file in os.listdir() if file.endswith('.csv')]

if any(file_name in file for file in csv_files):
    print(f'Found a file with {file_name} in the name.')
else:
    print(f'No {file_name} in any file name.')

5.2 Remove file if exists

Caution: Below is a remove file statement

# Define the file path
path = r'C:\Users\HP\Downloads\\'
file_name = 'Power Plants'
file_path = f"{path}{file_name}.csv"

try:
    # Check if the file exists
    if os.path.exists(file_path):
        # Delete the file
        os.remove(file_path)
        print(f"File {file_name} deleted successfully!")
    else:
        print(f"File {file_name} does not exist.")
except Exception as e:
    # Handle any exceptions that occur during file deletion
    print("An error occurred:", str(e))

5.3 Using pandas to_csv() to save dataframe as CSV file

Run the below cell only once

# Define the file path
path = r'C:\Users\HP\Downloads\\'
file_name = 'Power Plants'
file_path = f"{path}{file_name}.csv"

try:
    eia_powplt_filt.to_csv(f'{file_path}', index=False)
    print(f'CSV file {file_name}.csv is successfully saved')
except Exception as e:
    print(f"An error occurred while saving the CSV file: {e}")

6. Read from CSV file

6.1 Using pandas

# Define the file path
path = r'C:\Users\HP\Downloads\\'
file_name = 'Power Plants'
file_path = f"{path}{file_name}.csv"

try:
    df_var = pd.read_csv(filepath_or_buffer=f'{file_path}', delimiter=',')
    print(f'CSV file {file_name} is successfully read')
except Exception as e:
    print(f"An error occurred while reading the CSV file: {e}")

df_var

Conclusion

Learning Objectives,

Python & Pandas: Import libraries and use Pandas for data manipulation and analysis.
Database Connectivity: Configuring database connections and creating engines with SQLAlchemy in Python.
Data Ingestion: Ingesting data from NYC OpenData using Python requests and Pandas functions.
SQL Operations: Perform CRUD operations and query data from SQL Server using Python.
Data Visualization: Visualize data using Python libraries such as Matplotlib and Seaborn for insightful charts and graphs.

Source: EIA [Link], [Link]

Ingest data of Power plants from US EIA Open Data

Table of contents

U.S. Energy Information Administration

1. Import libraries

2. Ingest data from EIA of (Power Plants)

2.1 Get Power Plants data

2.1.1 Url Connection setup

2.1.2 using Get request to ingest from url:

2.1.3 Exploring json response

2.1.4 Destructuring json response

2.1.5 Use pandas DataFrame to convert the JSON data:

2.1.6 Transformations

2.1.6.1 Check for duplicates

3. Store final dataframe in SQL Server

3.1 Configuring Database connection setup

3.1.1 Check the driver

3.1.2. Configure the connection string

3.1.3. Create an engine using the create_engine() function, specifying the database URL

3.1.4 Create a session using sessionmaker

3.2 Read the existing tables in the SQL Server Database

3.2.1 Using Pandas read_sql_query() method - DQL: Select

3.2.2 Using sqlalchemy Session() method - DDL: DROP

3.3 Send the ingested data in dataframes to SQL Server tables

3.3.1 Using Pandas to_sql() method - DDL: Create

4. Query the data from SQL table

5. Save nasa_ml_filt as CSV file

5.1 Check file if exists

5.2 Remove file if exists

5.3 Using pandas to_csv() to save dataframe as CSV file

6. Read from CSV file

6.1 Using pandas

Conclusion

Source: EIA [Link], [Link]

Author: Dheeraj. Yss

Connect with me:

Ingest data of Power plants from US EIA Open Data

Table of contents

U.S. Energy Information Administration

1. Import libraries

2. Ingest data from EIA of (Power Plants)

2.1 Get Power Plants data

2.1.1 Url Connection setup

2.1.2 using Get request to ingest from url:

2.1.3 Exploring json response

2.1.4 Destructuring json response

2.1.5 Use pandas DataFrame to convert the JSON data:

2.1.6 Transformations

2.1.6.1 Check for duplicates

3. Store final dataframe in SQL Server

3.1 Configuring Database connection setup

3.1.1 Check the driver

3.1.2. Configure the connection string

3.1.3. Create an engine using the create_engine() function, specifying the database URL

3.1.4 Create a session using sessionmaker

3.2 Read the existing tables in the SQL Server Database

3.2.1 Using Pandas read_sql_query() method - DQL: Select

3.2.2 Using sqlalchemy Session() method - DDL: DROP

3.3 Send the ingested data in dataframes to SQL Server tables

3.3.1 Using Pandas to_sql() method - DDL: Create

4. Query the data from SQL table

5. Save nasa_ml_filt as CSV file

5.1 Check file if exists

5.2 Remove file if exists

5.3 Using pandas to_csv() to save dataframe as CSV file

6. Read from CSV file

6.1 Using pandas

Conclusion

Source: EIA [Link], [Link]

Author: Dheeraj. Yss

Connect with me:

Did you find this article valuable?