Table of contents
- United States Department of Agriculture (US DA)
- 1. Import libraries
- 2. Ingest data of records of wheat commodities with (US DA NAAS)
- 2.1 USDA NAAS Connection setup
- 2.2 using Get request to ingest from url:
- 2.3 Destructuring json response
- 2.4 Use pandas DataFrame to convert the JSON data:
- 2.5 Empty & no use columns data exploration
- 2.6 Dropping columns
- 2.7 Transformation on a column
- 2.7 Sorting the filtered dataframe
- 2.8 Reordering the columns
- 3. Configuring Database connection setup
- 4. Read the existing tables in the SQL Server Database
- 5. Send the ingested data in dataframes to SQL Server tables
- 6 Query the data from SQL table
- Conclusion
- Source: US DA NAAS [Link]
- Author: Dheeraj. Yss
- Connect with me:
This article mentions how to ingest and transform wheat data obtained from US DA NAAS.
United States Department of Agriculture (US DA)
National Agricultural Statistics Service (NAAS)
1. Import libraries
import requests
import sqlalchemy, pyodbc, os
import pandas as pd
from sqlalchemy import create_engine, text
from sqlalchemy.engine import URL
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import OperationalError
from IPython.display import Markdown, display
import seaborn as sns
import matplotlib.pyplot as plt
# Set the display option to show all max column width
pd.set_option('display.max_colwidth', None)
# Set the display option to show all columns
pd.set_option('display.max_columns', None)
2. Ingest data of records of wheat commodities with (US DA NAAS)
- Returns a set of records with data for wheat Commodity with the corresponding Commodity data:
2.1 USDA NAAS Connection setup
source_desc = 'CENSUS' #Program
sector_desc = 'CROPS' #Sector
group_desc = 'FIELD CROPS' #Group
commodity_desc = 'WHEAT' #Commodity
statisticcat_desc = 'PRODUCTION' #Category
agg_level_desc = 'STATE' #Geographic Level
cmdt_code_url = (
f"https://quickstats.nass.usda.gov/api/api_GET/?"
f"key={API_Key}"
f"&source_desc={source_desc}"
f"§or_desc={sector_desc}"
f"&group_desc={group_desc}"
f"&commodity_desc={commodity_desc}"
f"&statisticcat_desc={statisticcat_desc}"
f"&agg_level_desc={agg_level_desc}"
"&format=json"
)
2.2 using Get request to ingest from url:
# Make the HTTP request.
response = requests.get(cmdt_code_url)
# Check the status of the request
if response.status_code == 200:
json_whtcmdt = response.json()
print("Request was successful.",response.headers['Content-Type'])
display(Markdown("- headers"))
print(response.headers)
display(Markdown("- contents"))
print(response.content)
else:
print(f"Request failed with status code: {response.status_code}")
2.3 Destructuring json response
raw_data = json_whtcmdt['data']
type(raw_data)
raw_data[0]
2.4 Use pandas DataFrame to convert the JSON data:
whtcmdt = pd.DataFrame(data=raw_data)
whtcmdt.dtypes, whtcmdt.shape
- NASS may suppress individual table cells by explicitly replacing the cell value with an indicator identifying a suppression. NASS uses a (D), in place of the actual number in the affected cells, to indicate the value is withheld to avoid disclosure of an individual operation. USDA NASS Data Lab Handbook
whtcmdt
- Filter rows that does not contains a string value:
whtcmdt[~whtcmdt['Value'].str.contains('D')]
- group by
whtcmdt.groupby('year').count().sort_index()
2.5 Empty & no use columns data exploration
..
2.6 Dropping columns
cols_to_drop = ['end_code', 'county_ansi', 'watershed_desc', 'region_desc', 'zip_5', 'asd_desc', 'county_code', 'country_code', 'congr_district_code', 'asd_code', 'watershed_code', 'state_fips_code', 'week_ending', 'county_name', 'begin_code']
whtcmdt_filter = whtcmdt.drop(columns=cols_to_drop, axis=1, inplace=False)
2.7 Transformation on a column
- To filter out the rows in the DataFrame whtcmdt that do not contain the letter 'D' in the "Value" column, you can use the str.contains() method with the ~ operator to negate the condition.
whtcmdt_filter = whtcmdt_filter[~whtcmdt_filter['Value'].str.contains('D')]
whtcmdt_filter['Value'] = whtcmdt_filter['Value'].str.replace(',', '')
whtcmdt_filter
2.7 Sorting the filtered dataframe
whtcmdt_filter.sort_values(by=['state_ansi', 'year'], ascending=[True, False], inplace=True) #.query("state_ansi == '01' & year == 2022")
whtcmdt_filter.reset_index(drop=True, inplace=True)
2.8 Reordering the columns
whtcmdt_filter = whtcmdt_filter[['source_desc', 'sector_desc', 'group_desc', 'commodity_desc','class_desc', 'prodn_practice_desc', 'util_practice_desc', 'statisticcat_desc', 'short_desc', 'domain_desc', 'domaincat_desc', 'agg_level_desc', 'state_ansi', 'state_alpha', 'state_name', 'location_desc', 'country_name', 'year', 'freq_desc', 'reference_period_desc', 'load_time', 'Value', 'unit_desc', 'CV (%)']]
whtcmdt_filter
- changing the data type:
whtcmdt_filter['Value'] = whtcmdt_filter['Value'].astype(int)
3. Configuring Database connection setup
3.1. Check the driver
pyodbc.drivers()
3.2. Configure the connection string
connection_url = URL.create(
"mssql+pyodbc",
username = sql_login_name,
password = sql_login_password,
host = server_name,
port= port_number,
database = database_name,
query = {
"driver": "ODBC Driver 18 for SQL Server",
"TrustServerCertificate": "yes", # When yes, the transport layer will use SSL to encrypt the channel and bypass walking the certificate chain to validate trust. Useful when using self-signed certificates or when the certificate chain cannot be validated.
"authentication": "SqlPassword", # use SQL login credentials instead of Windows authentication.
"pool_size": "1", # to limit the number of sessions to one
},
)
3.3. Create an engine using the create_engine() function, specifying the database URL
engine = create_engine(connection_url)
4. Read the existing tables in the SQL Server Database
4.1 Using Pandas read_sql_query() method - DQL: Select
- first, confirm if the tables already exist in the database
qlist_tables = """
SELECT TOP 10000 *
FROM [dballpurpose].INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE IN ('BASE TABLE')
ORDER BY TABLE_NAME ASC
"""
df_var = pd.read_sql_query(qlist_tables,engine)
df_var
5. Send the ingested data in dataframes to SQL Server tables
5.1 Using Pandas to_sql() method - DDL: Create
whtcmdt_filter.to_sql('us wheat production', engine, if_exists='replace', index=False)
6 Query the data from SQL table
- Read from your database Using Pandas read_sql_query() method - DQL: Select
6.1 Wheat production by states
sql_string = """
SELECT TOP 5
*
FROM [dballpurpose].[dbo].[us wheat production]
"""
df_var = pd.read_sql(sql_string, engine)
df_var
- Viz ๐
fig, axs = plt.subplots(1, 1, figsize=(25, 8))
sns.set_theme(style="darkgrid")
sns.barplot(
data=df_var,
x= 'state_name',
y = 'Value',
hue = 'commodity_desc',
palette = 'Pastel1',
ax=axs
)
# Get the current tick positions and labels
tick_positions = axs.get_xticks()
tick_labels = [label.get_text() for label in axs.get_xticklabels()]
# Set the tick positions and labels with rotation and Rotate x-axis labels by 90 degrees
axs.set_xticks(tick_positions)
axs.set_xticklabels(labels=tick_labels, rotation=90)
# Get the current tick positions and labels
tick_positions = axs.get_yticks()
tick_labels = [f'{int(tick/1e6)} M' for tick in tick_positions] # Custom labels in millions
# Set the tick positions and labels
axs.set_yticks(tick_positions)
axs.set_yticklabels(labels=tick_labels)
plt.title(label='Wheat production by state', loc='center')
plt.ylabel('Production value (BU)')
plt.show()
6.2 State wise wheat production
sql_string = """
SELECT
'WHEAT' AS [commodity_desc],
location_desc,
year,
AVG(Value) AS [Avg area harvested],
'BU' AS [unit desc]
FROM [dballpurpose].[dbo].[us wheat production]
GROUP BY location_desc, year
ORDER BY location_desc ASC, year DESC
"""
df_var = pd.read_sql(sql_string, engine)
df_var
- Viz ๐
states = df_var.groupby('location_desc').count().reset_index()['location_desc'].tolist()
# Create a 10x5 grid of subplots without a shared x-axis, and set the figure size to 30x30
fig, axs = plt.subplots(10, 5, figsize = (40,30), sharex = False, sharey= False)
# Flatten the array of axes (subplots) for easier iteration
axs = axs.flatten()
# Initialize a counter variable
i = 0
sns.set_theme(style="darkgrid")
# Flatten the array of axes (subplots) for easier iteration
axs = axs.flatten()
# Initialize a counter variable
i = 0
# Loop over each column (state) in the unemp_states dataframe
for state in states:
data = df_var.query(f"location_desc == '{state}'")
# Plot the unemployment data for the current state on the i-th subplot's primary y-axis
sns.barplot(
data= data,
x= 'year',
y = 'Avg area harvested',
ax=axs[i]
)
# Get the current tick positions and labels
tick_positions = axs[i].get_yticks()
tick_labels = [f'{float(tick/1e6)} M' for tick in tick_positions] # Custom labels in millions
# Set the tick positions and labels
axs[i].set_yticks(tick_positions)
axs[i].set_yticklabels(labels=tick_labels)
# Set the title of the i-th subplot to the current state
axs[i].set_title(state)
axs[i].set_ylabel('')
# Increment the counter variable
i += 1
# Remove any empty subplots
for j in range(i, len(axs)):
fig.delaxes(axs[j])
axs[25].set_ylabel('Avg area harvested (BU)')
# Adjust the padding between and around the subplots
plt.tight_layout()
# Set the title
fig.suptitle('State wise Wheat production', fontsize=14, fontweight='bold')
# Add a legend to the figure with the labels
fig.legend(labels = ['Wheat production trend in BU'])
# Display the figure with its subplots
plt.show()
Error:
Arithmetic overflow error converting expression to data type int.
- Casted Value as Int64.
Conclusion
Learning Objectives,
Source: US DA NAAS [Link]
Author: Dheeraj. Yss
Connect with me:
ย