National Park Visitation: Contributing Factors to Park Popularity

Analysis conducted by Andrew Sumner

Published July 18th, 2019

Introduction

The National Park system contains some of the USA's most iconic treasures. Yosemite, the Grand Canyon, and Yellowstone are all household names, each receiving millions of visitors annually. The Grand Canyon has even occasionally been classified as one of the seven wonders of the world. Yet, these undoubtedly beautiful places are only a small fraction of the 61 National Parks distributed throughout our massive country. For an avid lover of the outdoors, this raises a major question: what "hidden gems" exist within the park system?

In this analysis, I sought to identify whether the most visited parks correlate with those that provide the best experience for their visitors. Then I narrow down which undervalued parks are most worth visiting, and look at some other factors that could be driving visitation.

These insights will be particularly useful for any person seeking relatively unknown and crowd-free places to savor the natural world. Additionally, my research could assist the National Park Service in identifying where to expand their visitor resources and the factors that may cause a traveller to pick one park over another. This could potentially lead to increased traffic through some lesser-known parks, resulting in increased revenue for the park service. Using this data to lobby for increased NPS funding is also a possibility.

Let's start by scraping and cleaning the data we think might be useful.

Data Scraping, Cleaning, and Processing

In [1]:
# Remember to install these packages before running the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
import requests
from bs4 import BeautifulSoup
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
import re
import statsmodels.formula.api as smf

Our primary dataset will be sourced from the National Park Service's database of use statistics. The data we want can be found at this link. Due to the implementation of this website, I had to open it manually with a browser and select options from a user interface before the page would load the necessary data. There is a button to download the displayed data in various data formats after the data options are submitted. I will briefly summarize the menu options we selected, for the sake of reproducibility.

Menu options: years = all from 1979 thru 2018, all months, all regions, park type = National Park, all parks, all fields, all additional fields, annual summary only = TRUE

I downloaded the corresponding data as an Excel file. If you're attempting to reproduce this analysis, you should host the file locally in the same folder as this python script/notebook.

In [2]:
# Loading excel file as dataframe
nps_df = pd.read_excel("Query Builder for Public Use Statistics (1979 - Last Calendar Year).xlsx")
nps_df.head()
Out[2]:
NPS Public Use Statistics Query Builder Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16
0 Bookmark this report: https://irma.nps.gov/Sta... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Park Unit Code Park Type Region State Year Recreation Visits Non-Recreation Visits Recreation Hours Non-Recreation Hours Concessioner\nLodging Concessioner\nCamping Tent Campers RV Campers Backcountry Campers Non-Recreation Overnight Stays Misc. Overnight Stays
3 Acadia NP ACAD National Park Northeast ME 1979 2787366 395913 17155530 198056 0 0 139708 74279 0 0 0
4 Acadia NP ACAD National Park Northeast ME 1980 2779666 493429 18671588 276719 0 0 155704 76124 0 0 0

As you can easily see, we need to do some tidying up of our table to make it usable. We drop unneeded rows/cols below, and change the column indices to match the row at index 2.

In [3]:
# Correctly set column names
nps_df.columns = (nps_df.iloc[2]).values
# Drop padding and unneeded notes
nps_df.drop([0, 1, 2], inplace = True)

After performing some brief inspection of the data table via the excel file or website, you'll notice that there are a few parks that are recorded as having zero recreation visits in particular years.

In [4]:
nps_df.loc[(nps_df['Recreation Visits'] == 0)]
Out[4]:
Park Unit Code Park Type Region State Year Recreation Visits Non-Recreation Visits Recreation Hours Non-Recreation Hours Concessioner Lodging Concessioner Camping Tent Campers RV Campers Backcountry Campers Non-Recreation Overnight Stays Misc. Overnight Stays
1370 Katmai NP & PRES KATM National Park Alaska AK 1995 0 0 0 0 0 0 0 0 0 0 0
1502 Kobuk Valley NP KOVA National Park Alaska AK 2014 0 0 0 0 0 0 0 0 0 0 0
1503 Kobuk Valley NP KOVA National Park Alaska AK 2015 0 0 0 0 0 0 0 0 0 0 0
1705 National Park of American Samoa NPSA National Park Pacific West AS 2003 0 0 0 0 0 0 0 0 0 0 0

If you look at the surrounding data for these parks, it becomes apparent that the zero value is most likely due to temporary park closures, issues with data collection, or data loss. Since these data points are not representative of the public's desire to visit these parks during the years in question, we will exclude them from our analysis.

In [5]:
nps_df.drop([1370, 1502, 1503, 1705], inplace = True)

Next, we're going to scrape some carefully selected subjective rankings of the national parks for the purpose of averaging them into an overall subjective ranking. I sought out rank lists created specifically by individuals or panels of individuals who had visited ALL the national parks. There were five such lists that I determined were suitable for use. Each selected website documents the reasoning behind their rankings. You can find these rank lists here: List 1, List 2, List 3, List 4, List 5. List 5 has separate documentation available here. You may notice that lists 2 and 5 are hosted on the same website, but they compiled their rankings using different panelists. It is also worth noting that list 3 ranks the parks by five subjective criteria. I did not scrape that part of the data since it was only given by a single source, but it may provide some insight into what makes a "good" park experience. Further investigation would be required.

Scraping Subjective Source 1

In [6]:
r = requests.get('https://blog.thediscoverer.com/every-us-national-park-ranked/')
root = BeautifulSoup(r.content, "lxml")

# By investigating the website's HTML code via a browser or print statement,
# we can see that every park ranking is contained in a header of consistent
# size.
#print(root.prettify())

headers_with_tags = root.findAll("h2")
headers = []
for header in headers_with_tags:
    h_text = header.getText()
    if '.' in h_text:
        headers.append(h_text)

# The following code is data cleaning necessary to separate the parks at rankings 16 and 17
headers.remove('16./17. Sequoia & Kings Canyon National Parks')
headers.append('16. Sequoia National Park')
headers.append('17. Kings Canyon National Park')

# Splitting rankings from names
temp_lst = []
for header in headers:
    for s in header.split('.', 1):
        temp_lst.append(s)

# Putting rankings and park names into a dataframe        
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(61, 2)
ranksV1_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV1_df['Rank'] = [float(val) for val in ranksV1_df['Rank'].values]
ranksV1_df.sort_values(by = 'Rank', inplace = True)
ranksV1_df.set_index('Name', inplace = True)
#ranksV1_df

Scraping Subjective Source 2

In [7]:
# Now for the second subjective source. The process is practically identical to the first.
r = requests.get('https://www.theactivetimes.com/travel/all-59-national-parks-ranked')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
headers_with_tags = root.findAll("h2")
headers = []
for header in headers_with_tags:
    h_text = header.getText()
    if '.' in h_text:
        headers.append(h_text)

temp_lst = []
for header in headers:
    for s in header.split('.', 1):
        temp_lst.append(s)
        
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(len(headers), 2)
ranksV2_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV2_df['Rank'] = [float(val) for val in ranksV2_df['Rank'].values]
ranksV2_df.sort_values(by = 'Rank', inplace = True)
ranksV2_df.set_index('Name', inplace = True)
#ranksV2_df

Scraping Subjective Source 3

In [8]:
# Now for the third subjective source. By chance, the process is practically identical to the first two.
# The only difference is that the headers appear twice, so duplicates must be removed.
r = requests.get('https://www.reneeroaming.com/americas-national-parks/')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
headers_with_tags = root.findAll("h2")
headers = []
for header in headers_with_tags:
    h_text = header.getText()
    if '.' in h_text:
        h_text = (h_text.split(',', 1))[0]
        headers.append(h_text)

headers = list(set(headers)) # removes duplicate entries
temp_lst = []
for header in headers:
    for s in header.split('.', 1):
        temp_lst.append(s)
        
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(len(headers), 2)
ranksV3_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV3_df['Rank'] = [float(val) for val in ranksV3_df['Rank'].values]
ranksV3_df.sort_values(by = 'Rank', inplace = True)
ranksV3_df.set_index('Name', inplace = True)
#ranksV3_df

Scraping Subjective Source 4

In [9]:
# Begin scraping 4th subjective source
r = requests.get('https://www.leeabbamonte.com/north-america/national-parks/all-60-us-national-parks-ranked.html')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
headers_with_tags = root.findAll("h1")
headers = []
for header in headers_with_tags:
    h_text = header.getText()
    headers.append(h_text)

headers.pop(0) # Removes title of article

temp_lst = []
for header in headers:
    for s in header.split(' ', 1): # No periods here, just spaces, unlike the previous sources
        temp_lst.append(s)
        
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(len(headers), 2)
ranksV4_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV4_df['Rank'] = [float(val) for val in ranksV4_df['Rank'].values]
ranksV4_df.sort_values(by = 'Rank', inplace = True)
ranksV4_df.set_index('Name', inplace = True)
#ranksV4_df

Scraping Subjective Source 5

In [10]:
# Begin scraping 5th (last) subjective source
# Note this list is hosted on the same site as the second, but consulted different panelists to determine rankings
# More info about criteria is available at https://www.theactivetimes.com/national-parks-ranked
r = requests.get('https://www.theactivetimes.com/59-national-parks-ranked')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
headers_with_tags = root.findAll("h2")
headers = []
for header in headers_with_tags:
    h_text = header.getText()
    if '#' in h_text:
        temp = h_text.split('#', 1) # this line and next line remove hashtag at beginning of rank number
        h_text = temp[len(temp) - 1]
        headers.append(h_text)

temp_lst = []
for header in headers:
    for s in header.split(' ', 1): # No periods here, just spaces, unlike the previous sources
        temp_lst.append(s)
        
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(len(headers), 2)
ranksV5_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV5_df['Rank'] = [float(val) for val in ranksV5_df['Rank'].values]
ranksV5_df.sort_values(by = 'Rank', inplace = True)
ranksV5_df.set_index('Name', inplace = True)
#ranksV5_df

Now we're going to tidy up the individual subjective ranking tables into one table. Since the way each list writes certain park names can differ slightly, we're going to use the park naming convention from the official NPS dataset for consistency. This means we'll need to use fuzzy string matching (a subset of Natural Language Processing) to match parks between dataframes. I chose to use the fuzzywuzzy package for this.

In [11]:
# Identify a list of parks
parks = nps_df['Park'].unique()

all_ranks_df = pd.DataFrame(index = parks, columns = ['Ranks_V1', 'Ranks_V2', 'Ranks_V3', 'Ranks_V4', 'Ranks_V5'])
ranking_frames = [ranksV1_df, ranksV2_df, ranksV3_df, ranksV4_df, ranksV5_df]

i = 0
for frame in ranking_frames:
    i = i + 1
    for idx, row in frame.iterrows():
        # This section uses fuzzy string matching to match park names between lists
        name = row.name
        shortname = name.replace('National ', '') # This should prevent resolution issues with Natl. Park of Amer. Samoa
        shortname = shortname.replace('Park', '')
        shortname = shortname.replace('State ', '')
        shortname = shortname.replace('Preserve', '') # preventing resolution issues
        match = (process.extractOne(shortname, parks, scorer = fuzz.token_sort_ratio))[0] # fuzzywuzzy
        all_ranks_df.at[match, ('Ranks_V' + str(i))] = row['Rank']

# Creating new rankings from the rank averages
all_ranks_df['Mean_Ranks'] = all_ranks_df.mean(axis=1)
all_ranks_df.sort_values(by = 'Mean_Ranks', inplace = True)
# enumerating in order
all_ranks_df['Derived_Rank'] = [i for i in np.arange(1,62)]
all_ranks_df.head()
Out[11]:
Ranks_V1 Ranks_V2 Ranks_V3 Ranks_V4 Ranks_V5 Mean_Ranks Derived_Rank
Yosemite NP 1 1 6 1 9 3.6 1
Glacier NP 2 2 8 12 8 6.4 2
Zion NP 7 3 19 7 4 8.0 3
Yellowstone NP 5 7 16 13 1 8.4 4
Grand Canyon NP 15 6 18 6 2 9.4 5

Next, we'll scrape some additional basic park data compiled on the wikipedia page for US national parks, which can be found here. The data of greatest interest to us will be the latitude and longitude of each park, and their size in acres.

In [12]:
# This section scrapes the table structure from the site and reformats into dataframe
r = requests.get('https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
tables = root.findAll("table")
df_wrapped = pd.read_html(str(tables[1]))
wiki_df = df_wrapped[0]

wiki_df.drop(['Image', 'Description'], axis = 1, inplace = True)
wiki_df.columns = ['Name', 'Location', 'Established', 'Area_Acres_2017', 'Visits_2018']

#wiki_df
In [13]:
# Regex expression to strip unnecessary characters from lat and long vals
regex = re.compile('[^0-9\.]')

wiki_df['Latitude_N'] = np.nan
wiki_df['Longitude_W'] = np.nan
wiki_df['Derived_rank'] = np.nan

for idx, row in wiki_df.iterrows():
    
    # split location into lat and long
    
    loc = (row['Location'].split('/ ', 1))[1]
    lat, long = loc.split(' ', 1)
    if 'N' in lat[-1]:
        lat = float(regex.sub('', lat))
    else:
        lat = 0.0 - float(regex.sub('', lat))
    if 'W' in long[-1]:
        long = float(regex.sub('', long))
    else:
        long = 0.0 - float(regex.sub('', long))
    wiki_df.at[row.name, 'Latitude_N'] = lat
    wiki_df.at[row.name, 'Longitude_W'] = long
    
    # Now deal with tidying area column into float values
    
    acres = (row['Area_Acres_2017'].split(' acres', 1))[0]
    acres = float(acres.replace(',', ''))
    wiki_df.at[row.name, 'Area_Acres_2017'] = acres
    
    # Fixing names to NPS dataset standard and adding in the derived ranks
    
    name = row['Name']
    match = (process.extractOne(name, parks, scorer = fuzz.token_sort_ratio))[0] # fuzzywuzzy
    wiki_df.at[row.name, 'Name'] = match
    wiki_df.at[row.name, 'Derived_rank'] = all_ranks_df.at[match, 'Derived_Rank']

wiki_df.head()
Out[13]:
Name Location Established Area_Acres_2017 Visits_2018 Latitude_N Longitude_W Derived_rank
0 Acadia NP Maine44°21′N 68°13′W / 44.35°N 68.21°W February 26, 1919 49075.3 3537575 44.35 68.21 11.0
1 National Park of American Samoa American Samoa14°15′S 170°41′W / 14.25°S 170... October 31, 1988 8256.67 28626 -14.25 170.68 46.0
2 Arches NP Utah38°41′N 109°34′W / 38.68°N 109.57°W November 12, 1971 76679 1663557 38.68 109.57 8.0
3 Badlands NP South Dakota43°45′N 102°30′W / 43.75°N 102.50°W November 10, 1978 242756 1008942 43.75 102.50 37.0
4 Big Bend NP Texas29°15′N 103°15′W / 29.25°N 103.25°W June 12, 1944 801163 440091 29.25 103.25 17.0

For our last bit of data scraping, we'll use the Missouri Census Data Center's Circular Area Profiles System to get a prediction for population within 400 miles of the national park coordinates. To estimate this, MCDC aggregates American Community Survey data from 2013 to 2017. More details about how this data is aggregated and weighted can be found here. I selected the radius off of an estimate of a single day's drive from the park, calculated by multiplying an average speed of 60mph by 8 hours of driving time and then rounding down to the nearest hundred to account for slower back roads and pit stops. Since performing all queries to the CAPS service took roughly 10 minutes, I store the results in a .csv file and check for it when rerunning the code. In the output below, you may notice that the National Park of American Samoa and Virgin Islands National Park do not have a value for population within the radius. This is due to the limitations of the ACS data. Since these parks are located in US territories rather than states, their data is not collected as part of the survey. As a result, the CAPS tool returns an error for the coordinates of the two parks. I chose to exclude these parks when examining the population variable.

In [14]:
radius = '400' # mile radius around park coordinates
# If you change the radius, you must delete any csv file previously generated
# by this portion of the code.

wiki_df['Pop_in_radius'] = np.nan

# Begin population retrieval:
# I store the results in a csv file and retrieve them when the code re-runs
# due to the amount of time it takes to query the population data (~10 mins).
try:
    # if csv file does exist
    temp_df = pd.read_csv('pop_in_radius.csv')
    wiki_df['Pop_in_radius'] = temp_df['Pop_in_radius']
except:
    # if csv file does not exist
    # request radial data via lat and long of park
    for idx, row in wiki_df.iterrows():
        lat = str(row['Latitude_N'])
        long = str(row['Longitude_W'])
        link = 'http://mcdc.missouri.edu/cgi-bin/broker?_PROGRAM=apps.capsACS.sas&_SERVICE=MCDC_long&latitude=' 
        link = link + lat + '&longitude='
        link = link + long + '&radii='
        link = link + radius + '&dprofile=on&units=+'
        r = requests.get(link)
        root = BeautifulSoup(r.content, "lxml")
        #print(root.prettify())
        tables = root.findAll("table")
        # if coordinates did not throw an error on the site, format and store
        if len(tables) > 0:
            demo_df = (pd.read_html(str(tables[0])))[0]
            pop = float(demo_df.iat[1, 1])
            wiki_df.at[row.name, 'Pop_in_radius'] = pop
    # store population data for each park as csv
    wiki_df[['Name', 'Pop_in_radius']].to_csv('pop_in_radius.csv', index = False)
    
wiki_df.head()
Out[14]:
Name Location Established Area_Acres_2017 Visits_2018 Latitude_N Longitude_W Derived_rank Pop_in_radius
0 Acadia NP Maine44°21′N 68°13′W / 44.35°N 68.21°W February 26, 1919 49075.3 3537575 44.35 68.21 11.0 34113334.0
1 National Park of American Samoa American Samoa14°15′S 170°41′W / 14.25°S 170... October 31, 1988 8256.67 28626 -14.25 170.68 46.0 NaN
2 Arches NP Utah38°41′N 109°34′W / 38.68°N 109.57°W November 12, 1971 76679 1663557 38.68 109.57 8.0 18216153.0
3 Badlands NP South Dakota43°45′N 102°30′W / 43.75°N 102.50°W November 10, 1978 242756 1008942 43.75 102.50 37.0 10288277.0
4 Big Bend NP Texas29°15′N 103°15′W / 29.25°N 103.25°W June 12, 1944 801163 440091 29.25 103.25 17.0 10687922.0

Exploratory Analysis and Data Visualization

Now that we've acquired all of our data, let's start our invesigation by producing a simple line plot of park visitor numbers over time. We will plot each park on the same graph.

In [15]:
# first graph: park visitor numbers over time

plt.figure(figsize=(10,8))
# plot lines for each parl on the same graph
for park in parks:
    park_df = nps_df.loc[nps_df['Park'] == park]
    plt.plot(park_df['Year'], park_df['Recreation Visits'])

plt.ylabel('Annual Recreation Visits')
plt.xlabel('Year')
plt.show()

The above plot is good for visualizing visitor numbers over time, but it's somewhat difficult to visualize how popular each park is in a given year relative to the other parks. To mitigate this, we'll make a new plot that is standardized by dividing by the total number of national park visitors for each year, thus giving us the percentage of visitors that choose a particular park over the others.

First, let's find the total annual number of recreation visits and visualize its trend line for the sake of exploration.

In [16]:
# Group park data by year, then sum
sum_df = nps_df.groupby(['Year']).sum()
plt.figure()
plt.plot(sum_df.index.values, sum_df['Recreation Visits'])
plt.ylabel('Total Annual Recreation Visits for ALL Parks')
plt.xlabel('Year')
plt.show()

Now we'll use this sum data to compute the percentages.

In [17]:
nps_df['pct_annual_visits'] = np.nan
for idx, row in nps_df.iterrows():
    row['pct_annual_visits'] = float(row['Recreation Visits']) / float(sum_df.loc[row['Year'], 'Recreation Visits'])
    nps_df.loc[row.name] = row

nps_df

plt.figure(figsize=(10,8))
for park in parks:
    park_df = nps_df.loc[nps_df['Park'] == park]
    plt.plot(park_df['Year'], park_df['pct_annual_visits'])

plt.ylabel('Fraction of total annual recreation visits')
plt.xlabel('Year')
plt.show()

This line plot of park percentages gives a better idea of how rankings fluctuate, but due to a high density of parks in the smaller percentages, picking out an individual park becomes difficult even if park labels are added for each trendline. Let's rectify this by simply assigning integer ranks from 1 to 61 in order of greatest to least percentage of total visitors.

In [18]:
nps_df['annual_rank'] = np.nan
for year in nps_df['Year'].unique():
    # enumerate visitation ranking values per year per park
    year_df = nps_df.loc[nps_df['Year'] == year]
    year_df_sorted = year_df.sort_values(by = 'pct_annual_visits', ascending = False)
    i = 0
    for idx, row in year_df_sorted.iterrows():
        i = i + 1
        nps_df.at[row.name, 'annual_rank'] = i
        
plt.figure(figsize=(12,14))
# graph line for each park over time
for park in parks:
    park_df = nps_df.loc[nps_df['Park'] == park]
    plt.plot(park_df['Year'], park_df['annual_rank'], label = park)
    # everything below this is used to add the text where the line ends
    last_year = np.amax(park_df['Year'].values)
    last_year_row = park_df.loc[park_df['Year'] == last_year]
    recent_rank = int(last_year_row['annual_rank'])
    plt.annotate((str(recent_rank) + '. ' + park), (last_year, recent_rank))

plt.ylabel('Popularity ranked by percentage of total annual visits')
plt.xlabel('Year')
plt.show()

That's much better. Now we can easily see the current popularity rankings for each park, as well as how they've changed over time.

Now that we have the most recent ranks by popularity as well as a system of subjective ranking, let's compare the top 20 parks in each rank list.

In [19]:
# get the top 20 visit rankings for the year 2018
top_20_visits = (nps_df.loc[nps_df['Year'] == 2018]).sort_values(by = 'annual_rank')[['Park', 'annual_rank']].head(20).copy()
# get the top 20 subjective rankings
top_20_subjective = all_ranks_df[['Derived_Rank']].head(20).copy().reset_index()
print(top_20_visits)
print(top_20_subjective)
                          Park  annual_rank
1073  Great Smoky Mountains NP          1.0
913            Grand Canyon NP          2.0
1958         Rocky Mountain NP          3.0
2395                   Zion NP          4.0
2315            Yellowstone NP          5.0
2355               Yosemite NP          6.0
42                   Acadia NP          7.0
953             Grand Teton NP          8.0
1798                Olympic NP          9.0
873                 Glacier NP         10.0
1353            Joshua Tree NP         11.0
282            Bryce Canyon NP         12.0
556         Cuyahoga Valley NP         13.0
793            Gateway Arch NP         14.0
1273          Indiana Dunes NP         15.0
596            Death Valley NP         16.0
82                   Arches NP         17.0
1703          Mount Rainier NP         18.0
1233            Hot Springs NP         19.0
2078             Shenandoah NP         20.0
                           index  Derived_Rank
0                    Yosemite NP             1
1                     Glacier NP             2
2                        Zion NP             3
3                 Yellowstone NP             4
4                Grand Canyon NP             5
5                     Olympic NP             6
6               Denali NP & PRES             7
7                      Arches NP             8
8               Mount Rainier NP             9
9                Bryce Canyon NP            10
10                     Acadia NP            11
11  Wrangell-St. Elias NP & PRES            12
12               Death Valley NP            13
13                Grand Teton NP            14
14             Rocky Mountain NP            15
15             North Cascades NP            16
16                   Big Bend NP            17
17              Katmai NP & PRES            18
18         Glacier Bay NP & PRES            19
19                Canyonlands NP            20

Looking at these two distinct top 20 rankings, you'll notice that there are items that appear one one list that do not appear on the other. To identify some undervalued parks that are worth visiting, let's isolate the rows that are present in the subjective ranks top 20, but not in the popularity ranks top 20.

In [20]:
# eliminate all parks from the top 20 subjective that are in the top 20 by visits
# then store the resulting subset in a new frame
undervalued = top_20_subjective[~top_20_subjective['index'].isin(top_20_visits['Park'].values)].copy()    
# find how the parks ranked by visits in 2018
undervalued['Visits_Rank_2018'] = np.nan
for idx, row in undervalued.iterrows():
    nps_row = nps_df.loc[(nps_df['Park'] == row['index']) & (nps_df['Year'] == 2018)]
    visits_rank = nps_row['annual_rank']
    undervalued.at[row.name, 'Visits_Rank_2018'] = visits_rank

# subtract subj. rank from visits rank to determine how undervalued a park is
undervalued['Num_Ranks_Undervalued_By'] = undervalued['Visits_Rank_2018'] - undervalued['Derived_Rank']
undervalued
Out[20]:
index Derived_Rank Visits_Rank_2018 Num_Ranks_Undervalued_By
6 Denali NP & PRES 7 35.0 28.0
11 Wrangell-St. Elias NP & PRES 12 53.0 41.0
15 North Cascades NP 16 56.0 40.0
16 Big Bend NP 17 43.0 26.0
17 Katmai NP & PRES 18 55.0 37.0
18 Glacier Bay NP & PRES 19 33.0 14.0
19 Canyonlands NP 20 28.0 8.0

From the rows above, we have identified some subjectively high-value parks to visit that the average traveller likely wouldn't prioritize. However, something worth noting is that most of these parks are in areas of low population density. We should probably investigate the relationship between visits and population density in the hypothesis testing segment below.

Additional Visualization and Analysis for Hypothesis Testing

Let's start by checking if a significant relationship exists between our two ranking paradigms. This will tell us whether visitor experience has an effect on actual popularity (in general).

In [21]:
x_visits_rank_2018 = ((nps_df.loc[nps_df['Year'] == 2018]).sort_values(by = 'Park'))['annual_rank']
y_subjective_rank = (all_ranks_df.sort_index())['Derived_Rank']
plt.figure()
# calculate linear regression values
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x_visits_rank_2018, y_subjective_rank)
# calculate y values for line
line = intercept + slope * x_visits_rank_2018
# plot points and reg line on same graph
plt.plot(x_visits_rank_2018, y_subjective_rank, 'o', x_visits_rank_2018, line)
plt.xlabel('Rank by visit count')
plt.ylabel('Subjective rank')
plt.show()

# print statistical significance
print(p_value)
0.00028002543606850714

Our p-value tells us that a statistically significant relationship does exist. This is not entirely surprising given that the top 20 parks in each ranking system had 13 parks in common, but it is obviously not the only factor that affects visitation. Let's dig in further by looking at nearby population versus visitation.

In [22]:
# prevent regression errors by eliminating rows for which we could not find
# a radial population value
temp_reg_df = wiki_df.dropna()

# similar process to last plot
plt.figure()
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(temp_reg_df['Pop_in_radius'], temp_reg_df['Visits_2018'])
line = intercept + slope * wiki_df['Pop_in_radius']
plt.plot(wiki_df['Pop_in_radius'], wiki_df['Visits_2018'], 'o', wiki_df['Pop_in_radius'], line)
plt.xlabel('Population within 400 miles of park')
plt.ylabel('Recreation Visits in 2018')
plt.show()

print(p_value)
0.014596816541819078

Unsurprisingly, nearby population does correlate significantly with visits.

Now that we've established that multiple factors can play a role in visitation, let's see if we can get a good fit of the visitation data using purely objective statistics. We will exclude the subjective ranking for this case due to the currently unknown factors that could affect a visitor's experience. We will also exclude a number of variables based on their direct dependency on visitation. A good example of this is the "Recreation Hours" column of the NPS dataframe. Lastly, we will exclude categorical variables that could be used to identify a single park, due to the potential for overfitting. An example of this would be the state a park is located in, since many states only have one park. We will refrain from adding any interaction terms or ratios until we determine the quality of the fit without them. As such, we will initially be looking at the following variables: population within 400 mile radius, region of country, latitude and longitude, park area, and year. Let's begin by throwing all relevant data into the NPS dataframe, so that we can easily use it with a StatsModels formula.

In [23]:
# Multi-variate regression (OLS):

# First we compile into one dataframe

nps_df['lat_N'] = np.nan
nps_df['long_W'] = np.nan
nps_df['recent_area_acres'] = np.nan
nps_df['recent_pop_in_radius'] = np.nan

# add data from wiki_df to nps_df
for idx, row in nps_df.iterrows():
    wiki_row = (wiki_df.loc[wiki_df['Name'] == row['Park']]).iloc[0]
    nps_df.at[row.name, 'lat_N'] = wiki_df.at[wiki_row.name, 'Latitude_N']
    nps_df.at[row.name, 'long_W'] = wiki_df.at[wiki_row.name, 'Longitude_W']
    nps_df.at[row.name, 'recent_pop_in_radius'] = wiki_df.at[wiki_row.name, 'Pop_in_radius']
    nps_df.at[row.name, 'recent_area_acres'] = wiki_df.at[wiki_row.name, 'Area_Acres_2017']
    
nps_df.head()
Out[23]:
Park Unit Code Park Type Region State Year Recreation Visits Non-Recreation Visits Recreation Hours Non-Recreation Hours ... RV Campers Backcountry Campers Non-Recreation Overnight Stays Misc. Overnight Stays pct_annual_visits annual_rank lat_N long_W recent_area_acres recent_pop_in_radius
3 Acadia NP ACAD National Park Northeast ME 1979 2787366 395913 17155530 198056 ... 74279 0 0 0 0.053578 2.0 44.35 68.21 49075.26 34113334.0
4 Acadia NP ACAD National Park Northeast ME 1980 2779666 493429 18671588 276719 ... 76124 0 0 0 0.052044 2.0 44.35 68.21 49075.26 34113334.0
5 Acadia NP ACAD National Park Northeast ME 1981 2997972 405001 20907596 202499 ... 76309 0 0 0 0.052961 2.0 44.35 68.21 49075.26 34113334.0
6 Acadia NP ACAD National Park Northeast ME 1982 3572114 183047 24505138 91512 ... 84841 0 0 0 0.062820 2.0 44.35 68.21 49075.26 34113334.0
7 Acadia NP ACAD National Park Northeast ME 1983 4124639 183024 27263187 91512 ... 72335 0 0 0 0.071116 2.0 44.35 68.21 49075.26 34113334.0

5 rows × 23 columns

With our dataframe properly prepared, we'll throw our variables into the StatsModels formula to check for statistical significance.

In [24]:
# renaming a column to remove spaces so it can easily be used in a statsmodels OLS formula
nps_df = nps_df.rename(columns = {'Recreation Visits':'Recreation_Visits'})

# multivariate regression fit
reg_results = smf.ols(formula = 'Recreation_Visits ~ Region + Year + \
    lat_N + long_W + recent_area_acres + recent_pop_in_radius', data = nps_df).fit()
reg_results.summary()
Out[24]:
OLS Regression Results
Dep. Variable: Recreation_Visits R-squared: 0.289
Model: OLS Adj. R-squared: 0.286
Method: Least Squares F-statistic: 94.39
Date: Fri, 19 Jul 2019 Prob (F-statistic): 5.30e-164
Time: 23:40:45 Log-Likelihood: -36146.
No. Observations: 2335 AIC: 7.231e+04
Df Residuals: 2324 BIC: 7.238e+04
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -2.902e+07 4.69e+06 -6.192 0.000 -3.82e+07 -1.98e+07
Region[T.Intermountain ] 4.972e+06 2.63e+05 18.894 0.000 4.46e+06 5.49e+06
Region[T.Midwest ] 4.748e+06 2.98e+05 15.937 0.000 4.16e+06 5.33e+06
Region[T.Northeast ] 6.027e+06 3.85e+05 15.650 0.000 5.27e+06 6.78e+06
Region[T.Pacific West ] 3.545e+06 2.18e+05 16.268 0.000 3.12e+06 3.97e+06
Region[T.Southeast ] 6.831e+06 3.71e+05 18.417 0.000 6.1e+06 7.56e+06
Year 8269.3172 2313.739 3.574 0.000 3732.109 1.28e+04
lat_N 6.357e+04 5031.990 12.633 0.000 5.37e+04 7.34e+04
long_W 5.365e+04 3958.233 13.555 0.000 4.59e+04 6.14e+04
recent_area_acres 0.1544 0.026 6.003 0.000 0.104 0.205
recent_pop_in_radius 0.0316 0.002 20.047 0.000 0.029 0.035
Omnibus: 1071.759 Durbin-Watson: 0.073
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8173.232
Skew: 2.010 Prob(JB): 0.00
Kurtosis: 11.237 Cond. No. 5.93e+09


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.93e+09. This might indicate that there are
strong multicollinearity or other numerical problems.

Remarkably, all of the variables we identified are statistically significant. Let's print the p-values with greater precision to determine which variable has the greatest effect on visitation.

In [25]:
reg_results.pvalues
Out[25]:
Intercept                   6.988523e-10
Region[T.Intermountain ]    3.472005e-74
Region[T.Midwest ]          2.397601e-54
Region[T.Northeast ]        1.450459e-52
Region[T.Pacific West ]     1.944716e-56
Region[T.Southeast ]        8.221521e-71
Year                        3.587218e-04
lat_N                       1.973711e-35
long_W                      2.436718e-40
recent_area_acres           2.240933e-09
recent_pop_in_radius        1.393774e-82
dtype: float64

This shows that the population within a 400 mile radius easily has the greatest effect on visitation, followed in order by the categorical regions, the longitude, the latitude, the park area, and lastly the year. Note, however, that our R-squared value indicates that our model only explains 29% of the variance around the mean. This suggests that there could be many more variables that play a role in visitation. Let's calculate a few more values to use in our fit. Whereas typical interaction terms multiply two variables together, we're going to see if we can create some useful ratios. Examples of this that we'll use are average time per visit (recreation hours divided by recreation visits), the percentage of visitors for a given year and park who are tent campers, the percent RV campers, and the percent backcountry campers. We'll also add a normal interaction term between park area and population within the radius.

In [26]:
# Adding the ratio columns to our dataset
nps_df['pct_tent'] = nps_df['Tent Campers']/nps_df['Recreation_Visits']
nps_df['pct_backcountry'] = nps_df['Backcountry Campers']/nps_df['Recreation_Visits']
nps_df['pct_RV'] = nps_df['RV Campers']/nps_df['Recreation_Visits']
nps_df['mean_visit_length'] = nps_df['Recreation Hours']/nps_df['Recreation_Visits']

# Calculating the best fit for the new formula
reg_results = smf.ols(formula = 'Recreation_Visits ~ Region + Year + \
    lat_N + long_W + recent_area_acres + recent_pop_in_radius + \
    pct_tent + pct_backcountry + pct_RV + mean_visit_length + \
    recent_area_acres:recent_pop_in_radius', data = nps_df).fit()
reg_results.summary()
Out[26]:
OLS Regression Results
Dep. Variable: Recreation_Visits R-squared: 0.341
Model: OLS Adj. R-squared: 0.338
Method: Least Squares F-statistic: 109.4
Date: Fri, 19 Jul 2019 Prob (F-statistic): 2.33e-201
Time: 23:40:45 Log-Likelihood: -36057.
No. Observations: 2335 AIC: 7.214e+04
Df Residuals: 2323 BIC: 7.221e+04
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.909e+04 3.14e+04 1.564 0.118 -1.25e+04 1.11e+05
Region[T.Intermountain ] 4.848e+06 2.57e+05 18.878 0.000 4.34e+06 5.35e+06
Region[T.Midwest ] 4.687e+06 2.91e+05 16.117 0.000 4.12e+06 5.26e+06
Region[T.Northeast ] 5.995e+06 3.73e+05 16.087 0.000 5.26e+06 6.73e+06
Region[T.Pacific West ] 3.31e+06 2.17e+05 15.263 0.000 2.88e+06 3.74e+06
Region[T.Southeast ] 6.697e+06 3.6e+05 18.597 0.000 5.99e+06 7.4e+06
Year -6376.6419 387.275 -16.465 0.000 -7136.084 -5617.200
lat_N 7.47e+04 4955.336 15.074 0.000 6.5e+04 8.44e+04
long_W 5.358e+04 3900.452 13.738 0.000 4.59e+04 6.12e+04
recent_area_acres 0.0285 0.031 0.923 0.356 -0.032 0.089
recent_pop_in_radius 0.0261 0.002 16.442 0.000 0.023 0.029
pct_tent -8.835e+05 3.82e+05 -2.315 0.021 -1.63e+06 -1.35e+05
pct_backcountry -1.242e+06 1.17e+05 -10.609 0.000 -1.47e+06 -1.01e+06
pct_RV -6.027e+06 7.23e+05 -8.330 0.000 -7.45e+06 -4.61e+06
mean_visit_length 2.856e+04 3567.417 8.005 0.000 2.16e+04 3.56e+04
recent_area_acres:recent_pop_in_radius 1.642e-08 1.57e-09 10.484 0.000 1.34e-08 1.95e-08
Omnibus: 1049.012 Durbin-Watson: 0.095
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7755.859
Skew: 1.969 Prob(JB): 0.00
Kurtosis: 11.013 Cond. No. 4.46e+15


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.46e+15. This might indicate that there are
strong multicollinearity or other numerical problems.

We get a better fit overall, but due to the new interaction terms, the park area by itself is no longer significant. Neither is the constant term. Let's remove these values to achieve an optimal fit.

In [27]:
reg_results = smf.ols(formula = 'Recreation_Visits ~ Region + Year + \
    lat_N + long_W + recent_pop_in_radius + \
    pct_tent + pct_backcountry + pct_RV + mean_visit_length + \
    recent_area_acres:recent_pop_in_radius - 1', data = nps_df).fit()
reg_results.summary()
Out[27]:
OLS Regression Results
Dep. Variable: Recreation_Visits R-squared: 0.343
Model: OLS Adj. R-squared: 0.340
Method: Least Squares F-statistic: 110.3
Date: Fri, 19 Jul 2019 Prob (F-statistic): 9.54e-203
Time: 23:40:45 Log-Likelihood: -36053.
No. Observations: 2335 AIC: 7.213e+04
Df Residuals: 2323 BIC: 7.220e+04
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Region[Alaska] -4.291e+06 2.26e+05 -19.018 0.000 -4.73e+06 -3.85e+06
Region[Intermountain ] 5.539e+05 5.88e+04 9.424 0.000 4.39e+05 6.69e+05
Region[Midwest ] 4.123e+05 7.97e+04 5.176 0.000 2.56e+05 5.68e+05
Region[Northeast ] 1.747e+06 1.67e+05 10.445 0.000 1.42e+06 2.08e+06
Region[Pacific West ] -1.012e+06 8.94e+04 -11.320 0.000 -1.19e+06 -8.37e+05
Region[Southeast ] 2.458e+06 1.4e+05 17.571 0.000 2.18e+06 2.73e+06
Year -4311.2735 274.833 -15.687 0.000 -4850.218 -3772.329
lat_N 7.617e+04 4953.399 15.377 0.000 6.65e+04 8.59e+04
long_W 5.501e+04 3930.593 13.994 0.000 4.73e+04 6.27e+04
recent_pop_in_radius 0.0260 0.002 16.561 0.000 0.023 0.029
pct_tent -9.587e+05 3.7e+05 -2.591 0.010 -1.68e+06 -2.33e+05
pct_backcountry -1.26e+06 1.16e+05 -10.820 0.000 -1.49e+06 -1.03e+06
pct_RV -5.863e+06 7.22e+05 -8.118 0.000 -7.28e+06 -4.45e+06
mean_visit_length 2.997e+04 3352.392 8.941 0.000 2.34e+04 3.65e+04
recent_area_acres:recent_pop_in_radius 1.705e-08 1.42e-09 11.968 0.000 1.43e-08 1.98e-08
Omnibus: 1033.820 Durbin-Watson: 0.095
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7518.205
Skew: 1.940 Prob(JB): 0.00
Kurtosis: 10.888 Cond. No. 1.06e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.06e+16. This might indicate that there are
strong multicollinearity or other numerical problems.

Our model is now significantly better than the one we started with. This is reflected in the F-statistic, value for R-squared, and the overall P-value. As our final step, let's print our P-values with more precision to identify exactly how well each one correlates with visitation.

In [28]:
reg_results.pvalues
Out[28]:
Region[Alaska]                            4.550576e-75
Region[Intermountain ]                    1.012827e-20
Region[Midwest ]                          2.465954e-07
Region[Northeast ]                        5.494341e-25
Region[Pacific West ]                     5.894006e-29
Region[Southeast ]                        5.433301e-65
Year                                      8.637560e-53
lat_N                                     6.898967e-51
long_W                                    8.754834e-43
recent_pop_in_radius                      2.577726e-58
pct_tent                                  9.631872e-03
pct_backcountry                           1.184608e-26
pct_RV                                    7.605589e-16
mean_visit_length                         7.664054e-19
recent_area_acres:recent_pop_in_radius    4.504113e-32
dtype: float64

Final Insights

Region of the country and nearby population numbers had the greatest correlation with visitation. While the effect of population is intuitive, the effect of the region may require further investigation. The underlying reason for this could be due to climate, geography, accessibility, or even public interest, among other variables worth investigating. The NPS could perform further analysis using my work to identify the best ways to increase visitation to less-visited parks.

Additional Resources for Further Investigation

A better-fitting model could potentially be created by looking at additional predictors. I had considered looking at distance from high-traffic airports to determine accessibility. Data on airport traffic, along with a list of those considered hubs, can be found here. Airport location by latitude and longitude can be found here. Other factors potentially worth investigating are the climate, biodiversity, and elevation variance near and within the park area. Miscellaneous useful data might be found in the NPS datastore, located here.