Ingest & transform wheat data from US DA NAAS

Ingest & transform wheat data from US DA NAAS

ยท

6 min read

This article mentions how to ingest and transform wheat data obtained from US DA NAAS.

United States Department of Agriculture (US DA)

National Agricultural Statistics Service (NAAS)

1. Import libraries

import requests
import sqlalchemy, pyodbc, os
import pandas as pd
from sqlalchemy import create_engine, text
from sqlalchemy.engine import URL
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import OperationalError
from IPython.display import Markdown, display
import seaborn as sns
import matplotlib.pyplot as plt

# Set the display option to show all max column width
pd.set_option('display.max_colwidth', None)

# Set the display option to show all columns
pd.set_option('display.max_columns', None)

2. Ingest data of records of wheat commodities with (US DA NAAS)

  • Returns a set of records with data for wheat Commodity with the corresponding Commodity data:

2.1 USDA NAAS Connection setup

source_desc = 'CENSUS' #Program
sector_desc = 'CROPS' #Sector
group_desc = 'FIELD CROPS' #Group
commodity_desc = 'WHEAT' #Commodity
statisticcat_desc = 'PRODUCTION' #Category
agg_level_desc = 'STATE' #Geographic Level

cmdt_code_url = (
    f"https://quickstats.nass.usda.gov/api/api_GET/?"
    f"key={API_Key}"
    f"&source_desc={source_desc}"
    f"&sector_desc={sector_desc}"
    f"&group_desc={group_desc}"
    f"&commodity_desc={commodity_desc}"
    f"&statisticcat_desc={statisticcat_desc}"
    f"&agg_level_desc={agg_level_desc}"
    "&format=json"
)

2.2 using Get request to ingest from url:

# Make the HTTP request.
response = requests.get(cmdt_code_url)

# Check the status of the request
if response.status_code == 200:
    json_whtcmdt = response.json()
    print("Request was successful.",response.headers['Content-Type'])
    display(Markdown("- headers"))
    print(response.headers)
    display(Markdown("- contents"))
    print(response.content)
else:
    print(f"Request failed with status code: {response.status_code}")

2.3 Destructuring json response

raw_data = json_whtcmdt['data']
type(raw_data)
raw_data[0]

2.4 Use pandas DataFrame to convert the JSON data:

whtcmdt = pd.DataFrame(data=raw_data)
whtcmdt.dtypes, whtcmdt.shape
  • NASS may suppress individual table cells by explicitly replacing the cell value with an indicator identifying a suppression. NASS uses a (D), in place of the actual number in the affected cells, to indicate the value is withheld to avoid disclosure of an individual operation. USDA NASS Data Lab Handbook
whtcmdt

  • Filter rows that does not contains a string value:
whtcmdt[~whtcmdt['Value'].str.contains('D')]
  • group by
whtcmdt.groupby('year').count().sort_index()

2.5 Empty & no use columns data exploration

..

2.6 Dropping columns

cols_to_drop = ['end_code', 'county_ansi', 'watershed_desc', 'region_desc', 'zip_5', 'asd_desc', 'county_code', 'country_code', 'congr_district_code', 'asd_code', 'watershed_code', 'state_fips_code', 'week_ending', 'county_name', 'begin_code']

whtcmdt_filter = whtcmdt.drop(columns=cols_to_drop, axis=1, inplace=False)

2.7 Transformation on a column

  • To filter out the rows in the DataFrame whtcmdt that do not contain the letter 'D' in the "Value" column, you can use the str.contains() method with the ~ operator to negate the condition.
whtcmdt_filter = whtcmdt_filter[~whtcmdt_filter['Value'].str.contains('D')]

whtcmdt_filter['Value'] = whtcmdt_filter['Value'].str.replace(',', '')

whtcmdt_filter

2.7 Sorting the filtered dataframe

whtcmdt_filter.sort_values(by=['state_ansi', 'year'], ascending=[True, False], inplace=True) #.query("state_ansi == '01' & year == 2022")
whtcmdt_filter.reset_index(drop=True, inplace=True)

2.8 Reordering the columns

whtcmdt_filter = whtcmdt_filter[['source_desc', 'sector_desc', 'group_desc', 'commodity_desc','class_desc', 'prodn_practice_desc', 'util_practice_desc', 'statisticcat_desc', 'short_desc', 'domain_desc', 'domaincat_desc', 'agg_level_desc', 'state_ansi', 'state_alpha', 'state_name', 'location_desc', 'country_name', 'year', 'freq_desc', 'reference_period_desc', 'load_time', 'Value', 'unit_desc', 'CV (%)']]
whtcmdt_filter

  • changing the data type:
whtcmdt_filter['Value'] = whtcmdt_filter['Value'].astype(int)

3. Configuring Database connection setup

3.1. Check the driver

pyodbc.drivers()

3.2. Configure the connection string

connection_url = URL.create(
    "mssql+pyodbc",
    username = sql_login_name,
    password = sql_login_password,
    host = server_name,
    port= port_number,
    database = database_name,
    query = {
        "driver": "ODBC Driver 18 for SQL Server",
         "TrustServerCertificate": "yes", # When yes, the transport layer will use SSL to encrypt the channel and bypass walking the certificate chain to validate trust. Useful when using self-signed certificates or when the certificate chain cannot be validated.
        "authentication": "SqlPassword", # use SQL login credentials instead of Windows authentication.
        "pool_size": "1", # to limit the number of sessions to one
    },
)

3.3. Create an engine using the create_engine() function, specifying the database URL

engine = create_engine(connection_url)

4. Read the existing tables in the SQL Server Database

4.1 Using Pandas read_sql_query() method - DQL: Select

  • first, confirm if the tables already exist in the database
qlist_tables = """
    SELECT TOP 10000 *
    FROM [dballpurpose].INFORMATION_SCHEMA.TABLES
    WHERE TABLE_TYPE IN ('BASE TABLE')
    ORDER BY TABLE_NAME ASC
"""

df_var = pd.read_sql_query(qlist_tables,engine)
df_var

5. Send the ingested data in dataframes to SQL Server tables

5.1 Using Pandas to_sql() method - DDL: Create

whtcmdt_filter.to_sql('us wheat production', engine, if_exists='replace', index=False)

6 Query the data from SQL table

  • Read from your database Using Pandas read_sql_query() method - DQL: Select

6.1 Wheat production by states

sql_string = """
  SELECT TOP 5 
    *
  FROM [dballpurpose].[dbo].[us wheat production]
"""

df_var = pd.read_sql(sql_string, engine)
df_var
  • Viz ๐Ÿ“‰
fig, axs = plt.subplots(1, 1, figsize=(25, 8))

sns.set_theme(style="darkgrid")

sns.barplot(
    data=df_var,
    x= 'state_name',
    y = 'Value',
    hue = 'commodity_desc',
    palette = 'Pastel1',
    ax=axs
)


# Get the current tick positions and labels
tick_positions = axs.get_xticks()
tick_labels = [label.get_text() for label in axs.get_xticklabels()]

# Set the tick positions and labels with rotation and Rotate x-axis labels by 90 degrees
axs.set_xticks(tick_positions)
axs.set_xticklabels(labels=tick_labels, rotation=90)

# Get the current tick positions and labels
tick_positions = axs.get_yticks()
tick_labels = [f'{int(tick/1e6)} M' for tick in tick_positions]  # Custom labels in millions

# Set the tick positions and labels
axs.set_yticks(tick_positions)
axs.set_yticklabels(labels=tick_labels)

plt.title(label='Wheat production by state', loc='center')

plt.ylabel('Production value (BU)')


plt.show()

6.2 State wise wheat production

sql_string = """
  SELECT 
    'WHEAT' AS [commodity_desc], 
    location_desc, 
    year,
    AVG(Value) AS [Avg area harvested],
    'BU' AS [unit desc]
  FROM [dballpurpose].[dbo].[us wheat production]
  GROUP BY location_desc, year
  ORDER BY location_desc ASC, year DESC
"""

df_var = pd.read_sql(sql_string, engine)
df_var
  • Viz ๐Ÿ“‰
states = df_var.groupby('location_desc').count().reset_index()['location_desc'].tolist()

# Create a 10x5 grid of subplots without a shared x-axis, and set the figure size to 30x30
fig, axs = plt.subplots(10, 5, figsize = (40,30), sharex = False, sharey= False)

# Flatten the array of axes (subplots) for easier iteration
axs = axs.flatten()

# Initialize a counter variable
i = 0

sns.set_theme(style="darkgrid")

# Flatten the array of axes (subplots) for easier iteration
axs = axs.flatten()

# Initialize a counter variable
i = 0


# Loop over each column (state) in the unemp_states dataframe
for state in states:
    data = df_var.query(f"location_desc == '{state}'")

    # Plot the unemployment data for the current state on the i-th subplot's primary y-axis
    sns.barplot(
        data= data,
        x= 'year',
        y = 'Avg area harvested',
        ax=axs[i]
    )

    # Get the current tick positions and labels
    tick_positions = axs[i].get_yticks()

    tick_labels = [f'{float(tick/1e6)} M' for tick in tick_positions]  # Custom labels in millions

    # Set the tick positions and labels
    axs[i].set_yticks(tick_positions)
    axs[i].set_yticklabels(labels=tick_labels)

    # Set the title of the i-th subplot to the current state
    axs[i].set_title(state)

    axs[i].set_ylabel('')

    # Increment the counter variable
    i += 1

# Remove any empty subplots
for j in range(i, len(axs)):
    fig.delaxes(axs[j])

axs[25].set_ylabel('Avg area harvested (BU)')

# Adjust the padding between and around the subplots
plt.tight_layout()

# Set the title
fig.suptitle('State wise Wheat production', fontsize=14, fontweight='bold')

# Add a legend to the figure with the labels
fig.legend(labels = ['Wheat production trend in BU'])

# Display the figure with its subplots
plt.show()

Error:

  • Arithmetic overflow error converting expression to data type int.

    • Casted Value as Int64.

Conclusion

Learning Objectives,

Source: US DA NAAS [Link]

Author: Dheeraj. Yss

Connect with me:

Did you find this article valuable?

Support dheerajy blog by becoming a sponsor. Any amount is appreciated!

ย