Ingest data from NYC OpenData FHVS & OPCV

Ingest data from NYC OpenData FHVS & OPCV

Β·

10 min read

Table of contents

This article will demonstrate the data ingestion and exploration for datasets For Hire Vehicles (FHVS) & Open Parking and Camera Violations (OPCV) provided by NYC OpenData.

1. Import libraries

import requests
import sqlalchemy, pyodbc, os
import pandas as pd
from sqlalchemy import create_engine, text
from sqlalchemy.engine import URL
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import OperationalError
from IPython.display import Markdown, display

2. Configuring Database connection setup

i. Check the driver

pyodbc.drivers()

ii. Configure the connection string

connection_url = URL.create(
    "mssql+pyodbc",
    username = sql_login_name,
    password = sql_login_password,
    host = server_name,
    port= port_number,
    database = database_name,
    query = {
        "driver": "ODBC Driver 18 for SQL Server",
         "TrustServerCertificate": "yes", # When yes, the transport layer will use SSL to encrypt the channel and bypass walking the certificate chain to validate trust. Useful when using self-signed certificates or when the certificate chain cannot be validated.
        "authentication": "SqlPassword", # use SQL login credentials instead of Windows authentication.
        "pool_size": "1", # to limit the number of sessions to one
    },
)

iii. Create an engine using the create_engine() function, specifying the database URL

engine = create_engine(connection_url)

3. Ingest data of For Hire Vehicles (FHVS) from NYC openData

# get data from open data endpoints
limit = 1000000

fhvs_url = (
    f"https://data.cityofnewyork.us/resource/8wbx-tsch.json?"
    f"$limit={limit}&"
)

3.1 using Get request to ingest from url:

Run the below cell only once

# Make the HTTP request.
response = requests.get(fhvs_url)

# Check the status of the request
if response.status_code == 200:
    json_fhvs = response.json()
    print("Request was successful.",response.headers['Content-Type'])
else:
    print(f"Request failed with status code: {response.status_code}")
response.headers
response.content

3.2 Use pandas DataFrame to convert the JSON data:

df_fhvsRaw = pd.DataFrame(data = json_fhvs)
df_fhvsRaw

fhvs.shape
fhvs.dtypes

4. Ingest data of Open Parking and Camera Violations (OPCV) from NYC openData

# get data from open data endpoints
limit = 1000000

# only gets violations from 2023-12-25
violations_url = (
    f"https://data.cityofnewyork.us/resource/nc67-uf89.json?"
    f"&issue_date=12/25/2023"
    f"&$limit={limit}&"
)

4.1 using pandas read_json() to ingest from url:

Run the below cell only once

violations = pd.read_json(violations_url)

violations['issue_date'] = pd.to_datetime(violations['issue_date'])

4.1.1 Transforming dataframe by changing the column data type

violations['summons_image'] = violations['summons_image'].astype(str)
violations.dtypes

Error faced during pd.to_sql(), i.e is due to summons_image column has an obj in its row value.

Solution:

violations['summons_image'] = violations['summons_image'].astype(str)

4.1.2 Transforming dataframe by extractiong a column contents

The below are the steps performed:

  • Create a copy of the 'violations' DataFrame

  • Extract values from 'summons_image' column and assign them to new columns

  • Remove 'summons_image' column from the DataFrame

violations_bronze = violations.copy()
violations_bronze[['summons_image_url', 'summons_image_description']] = violations_bronze['summons_image'].str.extract("url': '(.*?)', 'description': '(.*?)'")
violations_bronze = violations_bronze.drop('summons_image', axis=1)

violations_bronze

5. Read the existing tables in the SQL Server Database

5.1 Using Pandas read_sql_query() method - DQL: Select

  • first, confirm if the tables already exist in the database
qlist_tables = """
    SELECT TOP 10000 *
    FROM [dballpurpose].INFORMATION_SCHEMA.TABLES
    WHERE TABLE_TYPE IN ('BASE TABLE')
    ORDER BY TABLE_NAME ASC
"""

df_var = pd.read_sql_query(qlist_tables,engine)
df_var

6. Send the ingested data in dataframes to SQL Server tables

6.1 Using Pandas to_sql() method - DDL: Create

  • fhvs data to for_hire_vehicles sql table:
fhvs.to_sql('for_hire_vehicles', engine, if_exists='replace', index=False)
  • violations data to violations sql table:
violations.to_sql('violations', engine, if_exists='replace', index=False)

Error:

  • ProgrammingError: (pyodbc.ProgrammingError) ('Invalid parameter type. param-index=16 param-type=dict', 'HY105')

    • (Background on this error at: https://sqlalche.me/e/20/f405)

    • Like.. the column data has an object as its row value.

    • violations['summons_image'] = violations['summons_image'].astype(str)

In SQL Server:

for_hire_vehicles table:

SELECT TOP 10000 
FORMAT(COUNT(*), '###,### K') AS [Total no of rows]
FROM [dbo].[for_hire_vehicles]

SELECT TOP 10000 
*
FROM [dbo].[for_hire_vehicles]

violations table:

SELECT TOP 10000 
FORMAT(COUNT(*), '###,### K') AS [Total no of rows]
FROM [dballpurpose].[dbo].[violations]

SELECT TOP 10000 
*
FROM [dballpurpose].[dbo].[violations]

7. Query the data from SQL table

7.1 Read a SQL Query out of your database and into a pandas dataframe Using Pandas read_sql_query() method - DQL: Select

sql_string = """
  SELECT TOP 5 
    *
  FROM [dballpurpose].[dbo].[violations]
"""

df_var = pd.read_sql(sql_string, engine)
df_var.head(1000000)

8 Bonus data

The [Bonus] step below will connect to the NYC OpenData platform to pull in data from Financial Plan Baseline & Initiatives By Funding

# get data from open data endpoints
limit = 1000000

fbpi_url = (
    f"https://data.cityofnewyork.us/resource/e64w-ctmw.json?"
    f"$limit={limit}&"
)

8.1 using pandas read_json() to ingest from url:

Run the below cell only once

fbpi = pd.read_json(fbpi_url)
fbpi

fbpi.dtypes

  • first, confirm if the tables already exist in the database
qcheck_tables = """
    SELECT TABLE_NAME
    FROM [dballpurpose].INFORMATION_SCHEMA.TABLES
    WHERE TABLE_TYPE IN ('BASE TABLE')
    AND TABLE_NAME = 'financial_plan'
"""

if (pd.read_sql_query(qcheck_tables,engine).empty):
    print("financial_plan table does not exist in the SQL Server")

8.2. Send the ingested data in dataframes to SQL Server tables

8.2.1 Using Pandas to_sql() method - DDL: Create

  • fhvs data to for_hire_vehicles sql table:
fbpi.to_sql('financial_plan', engine, if_exists='replace', index=False)
  • first, confirm if the tables already exist in the database

9 Query the data from SQL table

9.1 Read a SQL Query out of your database and into a pandas dataframe Using Pandas read_sql_query() method - DQL: Select

sql_string = """
  SELECT TOP 5 
    *
  FROM [dballpurpose].[dbo].[financial_plan]
"""

df_var = pd.read_sql(sql_string, engine)
df_var.head(1000000)

10. Exploring violations data

10.1 Total no of Violations:

sql_string = """
SELECT TOP 10000 
FORMAT(COUNT(*), '###,### K') AS [Total no of rows]
FROM [dballpurpose].[dbo].[violations]
"""

df_var = pd.read_sql(sql_string, engine)
df_var.head(1000000)

  • Viz πŸ“‰
fig, axs = plt.subplots(1, 1, figsize=(5, 3))

sns.barplot(
    data=df_var,
    y='Total no of rows',
    hue = 'Total no of rows',
    palette = 'pastel',
    ax=axs
)

plt.xlabel('Total no of violations')
plt.ylabel('')

plt.show()

10.2 Retrieving all info about Fire Hydrant violation:

sql_string = """
SELECT
*
FROM [dballpurpose].[dbo].[violations]
WHERE violation = 'FIRE HYDRANT'
"""

df_var = pd.read_sql(sql_string, engine)
df_var.head(1000000)

  • Viz πŸ“‰
fig, axs = plt.subplots(1, 1, figsize=(15, 5))

sns.set_theme(style="darkgrid")

sns.scatterplot(
    data=df_var,
    x= 'state',
    y = 'payment_amount',
    hue = 'penalty_amount',
    palette = 'Pastel1',
    ax=axs
)

plt.title(label='Sate wide distribution of payment & penalty amount', loc='center')

plt.show()

10.3 Display no of violations, no of unique plates and avg fine amount for each violations:

sql_string = """
SELECT
violation,
COUNT(*) AS [Total no of Violations],
COUNT(DISTINCT plate) AS [Unique no of plate],
AVG(fine_amount) AS [Avg fine]
FROM [dballpurpose].[dbo].[violations]
GROUP BY violation
HAVING COUNT(*) > 100
ORDER BY COUNT(*) DESC
"""

df_var = pd.read_sql(sql_string, engine)
df_var.head(1000000)

# Create a figure with two subplots sharing the x-axis
fig, axs = plt.subplots(2, 1, sharex=True, figsize=(20, 10))
fig.subplots_adjust(hspace=0.009)  # Adjust space between the two subplots

# Flatten the array of axes (subplots) for easier iteration
axs = axs.flatten()

# Plot the same data on both subplots
for ax in axs:
    sns.barplot(data=df_var, x='violation', y='Total no of Violations', ax=ax)

# Set the limits for the y-axis on both subplots to 'break' the axis
# For example, if you want to break between 100 and 200:
axs[0].set_ylim(df_var['Total no of Violations'].nlargest(2).iloc[-1] + 100, df_var['Total no of Violations'].max()+500)  # Upper part for outliers
axs[1].set_ylim(0, df_var['Total no of Violations'].nlargest(2).iloc[-1] + 100)  # Lower part for the rest

# Hide the spines between ax1 and ax2
axs[0].spines['bottom'].set_visible(False)
axs[1].spines['top'].set_visible(False)

# Add diagonal lines to indicate the break in the y-axis
kwargs = dict(transform=axs[0].transAxes, color='k', clip_on=False)
axs[0].plot((-0.005, 0.005), (-0.005, 0.005), **kwargs)  # top axes-left diagonal
axs[0].plot((0.995, 1.005), (-0.005, 0.005), **kwargs)  # top axes-right diagonal

kwargs.update(transform=axs[1].transAxes)  # switch to the bottom axes
axs[1].plot((-0.005, 0.005), (0.995, 1.005), **kwargs)  # bottom axes-left diagonal
axs[1].plot((0.995, 1.005), (0.995, 1.005), **kwargs)  # bottom axes-right diagonal

# Set the title
axs[0].set_title('Each violations distribution', loc='center')

# Set the labels
axs[0].set_ylabel('')
axs[1].set_xlabel('Violation')
axs[1].set_ylabel('Total no of Violations')

# Get the current tick positions and labels
tick_positions = axs[1].get_xticks()
tick_labels = [label.get_text() for label in axs[1].get_xticklabels()]

# Set the tick positions and labels with rotation and Rotate x-axis labels by 90 degrees
axs[1].set_xticks(tick_positions)
axs[1].set_xticklabels(labels=tick_labels, rotation=90)


plt.tight_layout()

plt.show()

10.4 Display no of violations, no of unique for hire vehilcles plates and avg fine amount for each violation:

sql_string = """
SELECT
    violation,
    CASE
        WHEN [for_hire_vehicles].[dmv_license_plate_number] IS NOT NULL THEN 'FHV'
        ELSE 'NOT FHV'
    END AS [Category],
    COUNT(DISTINCT summons_number) AS [Total no of Violations],
    COUNT(DISTINCT plate) AS [Unique no of plate],
    AVG(fine_amount) AS [Avg fine]
FROM [dballpurpose].[dbo].[violations]
LEFT JOIN [dballpurpose].[dbo].[for_hire_vehicles] ON [dballpurpose].[dbo].[for_hire_vehicles].[dmv_license_plate_number] = [dballpurpose].[dbo].[violations].[plate]
GROUP BY violation,
    CASE
        WHEN [for_hire_vehicles].[dmv_license_plate_number] IS NOT NULL THEN 'FHV'
        ELSE 'NOT FHV'
    END 
ORDER BY COUNT(*) DESC
"""

df_var = pd.read_sql(sql_string, engine)
df_var.head(1000000)

  • Viz πŸ“‰
# Create a figure with two subplots sharing the x-axis
fig, axs = plt.subplots(2, 1, sharex=True, figsize=(20, 10))
fig.subplots_adjust(hspace=0.009)  # Adjust space between the two subplots

# Flatten the array of axes (subplots) for easier iteration
axs = axs.flatten()

# Plot the same data on both subplots
for ax in axs:
    sns.barplot(
        data=df_var, 
        x='violation', 
        y='Total no of Violations', 
        hue = 'Category',
        palette = 'Pastel1',
        ax=ax
    )

# Set the limits for the y-axis on both subplots to 'break' the axis
# For example, if you want to break between 100 and 200:
axs[0].set_ylim(df_var['Total no of Violations'].nlargest(3).iloc[-1] + 5, df_var['Total no of Violations'].max()+500)  # Upper part for outliers
axs[1].set_ylim(0, df_var['Total no of Violations'].nlargest(3).iloc[-1] + 5)  # Lower part for the rest

# Hide the spines between ax1 and ax2
axs[0].spines['bottom'].set_visible(False)
axs[1].spines['top'].set_visible(False)

# Add diagonal lines to indicate the break in the y-axis
kwargs = dict(transform=axs[0].transAxes, color='k', clip_on=False)
axs[0].plot((-0.005, 0.005), (-0.005, 0.005), **kwargs)  # top axes-left diagonal
axs[0].plot((0.995, 1.005), (-0.005, 0.005), **kwargs)  # top axes-right diagonal

kwargs.update(transform=axs[1].transAxes)  # switch to the bottom axes
axs[1].plot((-0.005, 0.005), (0.995, 1.005), **kwargs)  # bottom axes-left diagonal
axs[1].plot((0.995, 1.005), (0.995, 1.005), **kwargs)  # bottom axes-right diagonal

# Set the title
axs[0].set_title('Each violations distribution', loc='center')

# Set the labels
axs[0].set_ylabel('')
axs[1].set_xlabel('Violation')
axs[1].set_ylabel('Total no of Violations')

# Get the current tick positions and labels
tick_positions = axs[1].get_xticks()
tick_labels = [label.get_text() for label in axs[1].get_xticklabels()]

# Set the tick positions and labels with rotation and Rotate x-axis labels by 90 degrees
axs[1].set_xticks(tick_positions)
axs[1].set_xticklabels(labels=tick_labels, rotation=90)


plt.tight_layout()

plt.show()

11. Workshop Questions

11.1 Choose your favorite violation type and filter the violations table for it!

sql_string = """
SELECT
    *
FROM [dballpurpose].[dbo].[violations]
WHERE violation = 'CROSSWALK'
"""

df_var = pd.read_sql(sql_string, engine)
df_var

  • Viz πŸ“‰
fig, axs = plt.subplots(1, 1, figsize=(15, 5))

sns.set_theme(style="darkgrid")

sns.barplot(
    data=df_var,
    x= 'payment_amount',
    y = 'penalty_amount',
    hue = 'state',
    palette= 'Pastel1',
    ax=axs
)

plt.title(label='Sate wide distribution of penalty amount', loc='center')

plt.show()

11.2 How many total violations are there in that category?

sql_string = """
SELECT
    COUNT(*) AS [Total no of violations]
FROM [dballpurpose].[dbo].[violations]
WHERE violation = 'CROSSWALK'
"""

df_var = pd.read_sql(sql_string, engine)
df_var
  • Viz πŸ“‰
fig, axs = plt.subplots(1, 1, figsize=(15, 5))

sns.set_theme(style="darkgrid")

sns.barplot(
    data=df_var,
    y = 'Total no of violations',
    ax=axs
)

plt.xlabel('CROSSWALK violation')

plt.title(label='Total no of violations for \'CROSSWALK\' violation', loc='center')

plt.show()

11.3 What state’s residents are the worst offenders (excluding NY)?

sql_string = """
SELECT
    state,
    COUNT(DISTINCT plate) AS [Total no of Violations]
FROM [dballpurpose].[dbo].[violations]
WHERE violation = 'CROSSWALK'
    AND state <> 'NY'
GROUP BY state
ORDER BY [Total no of Violations] DESC
"""

df_var = pd.read_sql(sql_string, engine)
df_var
  • Viz πŸ“‰
fig, axs = plt.subplots(1, 1, figsize=(15, 5))

sns.set_theme(style="darkgrid")

sns.barplot(
    data=df_var,
    x = 'state',
    y = 'Total no of Violations',
    ax=axs
)

plt.xlabel('State')

plt.title(label='Total no of violations for \'CROSSWALK\' violation for each state', loc='center')

plt.show()

11.4 How many of those violations came from FHVs?

sql_string = """
SELECT
    state,
    CASE
        WHEN [for_hire_vehicles].[dmv_license_plate_number] IS NOT NULL THEN 'FHV'
        ELSE 'NOT FHV'
    END AS [Category],
    COUNT(DISTINCT summons_number) AS [Total no of Violations],
    COUNT(DISTINCT plate) AS [Unique no of plate],
    AVG(fine_amount) AS [Avg fine]
FROM [dballpurpose].[dbo].[violations]
LEFT JOIN [dballpurpose].[dbo].[for_hire_vehicles] ON [dballpurpose].[dbo].[for_hire_vehicles].[dmv_license_plate_number] = [dballpurpose].[dbo].[violations].[plate]
WHERE violation = 'CROSSWALK'
    AND state <> 'NY'
GROUP BY state,
    CASE
        WHEN [for_hire_vehicles].[dmv_license_plate_number] IS NOT NULL THEN 'FHV'
        ELSE 'NOT FHV'
    END 
ORDER BY COUNT(*) DESC
"""

df_var = pd.read_sql(sql_string, engine)
df_var
  • Viz πŸ“‰
fig, axs = plt.subplots(1, 1, figsize=(15, 5))

sns.set_theme(style="darkgrid")

sns.barplot(
    data=df_var,
    x = 'state',
    y = 'Total no of Violations',
    hue= 'Category',
    palette='Pastel1',
    ax=axs
)

plt.xlabel('CROSSWALK violation')

plt.title(label='Total no of violations for \'CROSSWALK\' violation', loc='center')

plt.show()

Conclusion

Learning Objectives,

  • Python & Pandas: Import libraries and use Pandas for data manipulation and analysis.

  • Database Connectivity: Configuring database connections and creating engines with SQLAlchemy in Python.

  • Data Ingestion: Ingesting data from NYC OpenData using Python requests and Pandas functions.

  • SQL Operations: Perform CRUD operations and query data from SQL Server using Python.

  • Data Visualization: Visualize data using Python libraries such as Matplotlib and Seaborn for insightful charts and graphs.

Source: Meghan Maloy [Link], [Link], [Link]

Author: Dheeraj. Yss

Connect with me:

Did you find this article valuable?

Support dheerajy blog by becoming a sponsor. Any amount is appreciated!

Β