Python script to collect the last 30 days of temperature data from a web API for select cities in the US - Python 2 Version¶

The goal of the following code is to collect observed weather data from The Dark Sky Forecast API (https://developer.forecast.io/docs/v2#time_cal), extract the maximum daily temperatures for the last 30 days for select cities, store the extracted data in a SQLite database, perform summary calculations, visualize the results and write them to a .csv file.

Start by importing the necessary Python libraries.

%matplotlib inline
import datetime
import requests
import json
import sqlite3 as lite
import pandas as pd
import seaborn as sns

Create cities dictionary to include cities of interest and their locations.

cities = {"Atlanta": '33.755960,-84.390304',
          "Austin": '30.265327,-97.743788',
          "Boston": '42.358056,-71.063611',
          "Denver": '39.761850,-104.881105',
          "Los Angeles": '34.050000,-118.250000',
          "Seattle": '47.609722,-122.333056'
          }

Initialize a few key variables. The start date is 30 days before now; "num_days" holds this 30-day value.

num_days = 30
start_date = datetime.datetime.now() - datetime.timedelta(days=num_days)
APIKEY = "2afe41b558da0181d1a9483c41f42df8"

Create a connection object, "con", which is the connection to the SQLite database "weather.db". (If weather.db does not exist, SQLite3 will create it.) Once the connection object "con" has been created, use the .cursor method to create a cursor object, "cur", to manipulate tables in the "weather.db" database. If the "temperature" table exists in the "weather.db" database, drop it and create a new table to start fresh with an empty table.

con = lite.connect('weather.db')
with con:
    cur = con.cursor()
    cur.execute("DROP TABLE IF EXISTS temperature")
    cur.execute("CREATE TABLE temperature \
    (city text, tmax float, tmax_time integer)")

Next loop through all the cities in the "cities" dictionary, creating a properly formatted API request for that city for each day of data desired (30 days in this example). Once the API request has been formatted, ping the API to get the data, select the data of interest (which is the maximum temperature for that day in this example) and store it in the "weather.db" database as well as a Pandas DataFrame, "df", to aid in future calculations. This process requires a loop within a loop. The inner loop collects the data for a single city for each day of the period of interest, the past 30 days in this case. The outer loop cycles through each of the cities.

for city_var, latLong in cities.iteritems():
    
    date = start_date     # initialize the date variable
    
    for i in range(0, num_days):  
        date_str = (date.strftime('%Y') + '-' + date.strftime('%m') + '-' +
                    date.strftime('%d') + 'T' + date.strftime('%H') + ':' +
                    date.strftime('%M') + ':' + date.strftime('%S'))

        url = ('https://api.forecast.io/forecast/' + APIKEY + '/' + latLong +
               ',' + date_str)

        r = requests.get(url)  # issue the request
        response_dict = json.loads(r.text)

        tmax_var = response_dict['daily']['data'][0]['temperatureMax']
        tmax_time_var = response_dict['daily']['data'][0]['temperatureMaxTime']
        
        # convert time of max temperature from UNIX time to standard format
        tmax_time_var_std = (datetime.datetime.fromtimestamp
                             (int(tmax_time_var))
                             .strftime('%Y-%m-%d %H:%M:%S'))

        con = lite.connect('weather.db')
        with con:
            cur = con.cursor()
            cur.execute("INSERT INTO temperature (city, tmax, tmax_time) \
            VALUES (?, ?, ?)", (city_var, tmax_var, tmax_time_var_std))
            cur.execute("SELECT * FROM temperature")

            # store all rows of data in the "rows" variable
            rows = cur.fetchall()
            
            # get the column names from the cur.description attribute 
            # which are in the first position, the 0th spot.
            cols = [desc[0] for desc in cur.description] 
            
            df = pd.DataFrame(rows, columns=cols)

        date = date + datetime.timedelta(days=1)

Let's look at the first and last five rows of the data in the "df" DataFrame. Each row is one day's worth of data; "tmax" is the maximum temperature for the day in degrees Fahrenheit; "tmax_time" is the date and time of the maximum temperature that day.

df.head()

df.tail()

Calculate summary statistics over the 30-day period and store in a new DataFrame, "df_summary"; separate the latitude and longitude into two columns instead of one, converting them from strings to floats; then write the DataFrame to a .csv file called "summary.csv".

df_summary = pd.DataFrame(columns=('city', 'long', 'lat', 'max_tmax',
                                   'min_tmax', 'range_tmax', 'mean_tmax',
                                   'sd_tmax'))

df_summary['max_tmax'] = df.groupby('city')['tmax'].max()
df_summary['min_tmax'] = df.groupby('city')['tmax'].min()
df_summary['mean_tmax'] = df.groupby('city')['tmax'].mean()
df_summary['sd_tmax'] = df.groupby('city')['tmax'].std()
df_summary['range_tmax'] = df_summary['max_tmax'] - df_summary['min_tmax']
df_summary['city'] = df_summary.index

for k, v in cities.iteritems():
    location = tuple(float(x) for x in v.split(','))
    df_summary.set_value(k,'lat',location[0])
    df_summary.set_value(k,'long',location[1])
    
df_summary.to_csv('summary.csv', index=False)

Let's look at the "df_summary" DataFrame. Since there are only six cities, we can look at all of the rows, not just the first five.

df_summary

Create a box plot for each city using the seaborn library.

sns.set_style("whitegrid")
ax = sns.boxplot(x="city", y="tmax", data=df.sort_values(by='city'))
ax = sns.swarmplot(x="city", y="tmax", data=df.sort_values(by='city'),
                   color=".25")

If you have not run the code above yourself, it should show data from the 30 days between June 7, 2016 and July 6, 2016 inclusive.

For this time period, the data show that the maximum daily temperature for Austin has the smallest fluctuation changing only 12.88 degrees over the 30-days compared to Los Angeles, which has the largest range of 28.10 degrees F (this includes a couple of points shown as outliers in the tan box plot for Los Angeles).

Austin also has the highest mean maximum temperature over the period, indicating that not only is the maximum temperature relatively consistent, but also it just plain hot there.

Seattle has the lowest maximum daily temperatures of the six cities considered, which is not surprising given the generally cloudier and rainier climate of the Pacific Northwest.

Note: this code has been updated to run using Python 3 in this companion Notebook (https://github.com/hdavis/Temperature/blob/gh-pages/Temperature_Python3_Version.ipynb).

	city	tmax	tmax_time
0	Boston	85.10	2016-06-07 14:00:00
1	Boston	73.21	2016-06-08 12:00:00
2	Boston	66.70	2016-06-09 13:00:00
3	Boston	70.14	2016-06-10 16:00:00
4	Boston	67.71	2016-06-11 10:00:00

	city	tmax	tmax_time
175	Austin	94.26	2016-07-02 15:00:00
176	Austin	95.80	2016-07-03 14:00:00
177	Austin	97.28	2016-07-04 13:00:00
178	Austin	96.47	2016-07-05 14:00:00
179	Austin	95.99	2016-07-06 14:00:00

	city	long	lat	max_tmax	min_tmax	range_tmax	mean_tmax	sd_tmax
city
Atlanta	Atlanta	-84.3903	33.756	97.33	80.02	17.31	90.414333	4.073621
Austin	Austin	-97.7438	30.2653	97.28	84.40	12.88	93.116000	2.512278
Boston	Boston	-71.0636	42.3581	88.96	66.70	22.26	77.679333	5.401550
Denver	Denver	-104.881	39.7619	96.47	72.37	24.10	84.813667	6.570845
Los Angeles	Los Angeles	-118.25	34.05	96.54	68.44	28.10	78.550667	6.939654
Seattle	Seattle	-122.333	47.6097	81.17	56.78	24.39	67.533333	5.285666