Python script to collect the last 30 days of temperature data from a web API for select cities in the US - Python 2 Version

The goal of the following code is to collect observed weather data from The Dark Sky Forecast API (https://developer.forecast.io/docs/v2#time_cal), extract the maximum daily temperatures for the last 30 days for select cities, store the extracted data in a SQLite database, perform summary calculations, visualize the results and write them to a .csv file.

Start by importing the necessary Python libraries.

In [1]:
%matplotlib inline
import datetime
import requests
import json
import sqlite3 as lite
import pandas as pd
import seaborn as sns

Create cities dictionary to include cities of interest and their locations.

In [2]:
cities = {"Atlanta": '33.755960,-84.390304',
          "Austin": '30.265327,-97.743788',
          "Boston": '42.358056,-71.063611',
          "Denver": '39.761850,-104.881105',
          "Los Angeles": '34.050000,-118.250000',
          "Seattle": '47.609722,-122.333056'
          }

Initialize a few key variables. The start date is 30 days before now; "num_days" holds this 30-day value.

In [3]:
num_days = 30
start_date = datetime.datetime.now() - datetime.timedelta(days=num_days)
APIKEY = "2afe41b558da0181d1a9483c41f42df8"

Create a connection object, "con", which is the connection to the SQLite database "weather.db". (If weather.db does not exist, SQLite3 will create it.) Once the connection object "con" has been created, use the .cursor method to create a cursor object, "cur", to manipulate tables in the "weather.db" database. If the "temperature" table exists in the "weather.db" database, drop it and create a new table to start fresh with an empty table.

In [4]:
con = lite.connect('weather.db')
with con:
    cur = con.cursor()
    cur.execute("DROP TABLE IF EXISTS temperature")
    cur.execute("CREATE TABLE temperature \
    (city text, tmax float, tmax_time integer)")

Next loop through all the cities in the "cities" dictionary, creating a properly formatted API request for that city for each day of data desired (30 days in this example). Once the API request has been formatted, ping the API to get the data, select the data of interest (which is the maximum temperature for that day in this example) and store it in the "weather.db" database as well as a Pandas DataFrame, "df", to aid in future calculations. This process requires a loop within a loop. The inner loop collects the data for a single city for each day of the period of interest, the past 30 days in this case. The outer loop cycles through each of the cities.

In [5]:
for city_var, latLong in cities.iteritems():
    
    date = start_date     # initialize the date variable
    
    for i in range(0, num_days):  
        date_str = (date.strftime('%Y') + '-' + date.strftime('%m') + '-' +
                    date.strftime('%d') + 'T' + date.strftime('%H') + ':' +
                    date.strftime('%M') + ':' + date.strftime('%S'))

        url = ('https://api.forecast.io/forecast/' + APIKEY + '/' + latLong +
               ',' + date_str)

        r = requests.get(url)  # issue the request
        response_dict = json.loads(r.text)

        tmax_var = response_dict['daily']['data'][0]['temperatureMax']
        tmax_time_var = response_dict['daily']['data'][0]['temperatureMaxTime']
        
        # convert time of max temperature from UNIX time to standard format
        tmax_time_var_std = (datetime.datetime.fromtimestamp
                             (int(tmax_time_var))
                             .strftime('%Y-%m-%d %H:%M:%S'))

        con = lite.connect('weather.db')
        with con:
            cur = con.cursor()
            cur.execute("INSERT INTO temperature (city, tmax, tmax_time) \
            VALUES (?, ?, ?)", (city_var, tmax_var, tmax_time_var_std))
            cur.execute("SELECT * FROM temperature")

            # store all rows of data in the "rows" variable
            rows = cur.fetchall()
            
            # get the column names from the cur.description attribute 
            # which are in the first position, the 0th spot.
            cols = [desc[0] for desc in cur.description] 
            
            df = pd.DataFrame(rows, columns=cols)

        date = date + datetime.timedelta(days=1)

Let's look at the first and last five rows of the data in the "df" DataFrame. Each row is one day's worth of data; "tmax" is the maximum temperature for the day in degrees Fahrenheit; "tmax_time" is the date and time of the maximum temperature that day.

In [6]:
df.head()
Out[6]:
city tmax tmax_time
0 Boston 85.10 2016-06-07 14:00:00
1 Boston 73.21 2016-06-08 12:00:00
2 Boston 66.70 2016-06-09 13:00:00
3 Boston 70.14 2016-06-10 16:00:00
4 Boston 67.71 2016-06-11 10:00:00
In [7]:
df.tail()
Out[7]:
city tmax tmax_time
175 Austin 94.26 2016-07-02 15:00:00
176 Austin 95.80 2016-07-03 14:00:00
177 Austin 97.28 2016-07-04 13:00:00
178 Austin 96.47 2016-07-05 14:00:00
179 Austin 95.99 2016-07-06 14:00:00

Calculate summary statistics over the 30-day period and store in a new DataFrame, "df_summary"; separate the latitude and longitude into two columns instead of one, converting them from strings to floats; then write the DataFrame to a .csv file called "summary.csv".

In [8]:
df_summary = pd.DataFrame(columns=('city', 'long', 'lat', 'max_tmax',
                                   'min_tmax', 'range_tmax', 'mean_tmax',
                                   'sd_tmax'))

df_summary['max_tmax'] = df.groupby('city')['tmax'].max()
df_summary['min_tmax'] = df.groupby('city')['tmax'].min()
df_summary['mean_tmax'] = df.groupby('city')['tmax'].mean()
df_summary['sd_tmax'] = df.groupby('city')['tmax'].std()
df_summary['range_tmax'] = df_summary['max_tmax'] - df_summary['min_tmax']
df_summary['city'] = df_summary.index

for k, v in cities.iteritems():
    location = tuple(float(x) for x in v.split(','))
    df_summary.set_value(k,'lat',location[0])
    df_summary.set_value(k,'long',location[1])
    
df_summary.to_csv('summary.csv', index=False)

Let's look at the "df_summary" DataFrame. Since there are only six cities, we can look at all of the rows, not just the first five.

In [9]:
df_summary
Out[9]:
city long lat max_tmax min_tmax range_tmax mean_tmax sd_tmax
city
Atlanta Atlanta -84.3903 33.756 97.33 80.02 17.31 90.414333 4.073621
Austin Austin -97.7438 30.2653 97.28 84.40 12.88 93.116000 2.512278
Boston Boston -71.0636 42.3581 88.96 66.70 22.26 77.679333 5.401550
Denver Denver -104.881 39.7619 96.47 72.37 24.10 84.813667 6.570845
Los Angeles Los Angeles -118.25 34.05 96.54 68.44 28.10 78.550667 6.939654
Seattle Seattle -122.333 47.6097 81.17 56.78 24.39 67.533333 5.285666

Create a box plot for each city using the seaborn library.

In [10]:
sns.set_style("whitegrid")
ax = sns.boxplot(x="city", y="tmax", data=df.sort_values(by='city'))
ax = sns.swarmplot(x="city", y="tmax", data=df.sort_values(by='city'),
                   color=".25")

If you have not run the code above yourself, it should show data from the 30 days between June 7, 2016 and July 6, 2016 inclusive.

For this time period, the data show that the maximum daily temperature for Austin has the smallest fluctuation changing only 12.88 degrees over the 30-days compared to Los Angeles, which has the largest range of 28.10 degrees F (this includes a couple of points shown as outliers in the tan box plot for Los Angeles).

Austin also has the highest mean maximum temperature over the period, indicating that not only is the maximum temperature relatively consistent, but also it just plain hot there.

Seattle has the lowest maximum daily temperatures of the six cities considered, which is not surprising given the generally cloudier and rainier climate of the Pacific Northwest.

Note: this code has been updated to run using Python 3 in this companion Notebook (https://github.com/hdavis/Temperature/blob/gh-pages/Temperature_Python3_Version.ipynb).