The National Park system contains some of the USA's most iconic treasures. Yosemite, the Grand Canyon, and Yellowstone are all household names, each receiving millions of visitors annually. The Grand Canyon has even occasionally been classified as one of the seven wonders of the world. Yet, these undoubtedly beautiful places are only a small fraction of the 61 National Parks distributed throughout our massive country. For an avid lover of the outdoors, this raises a major question: what "hidden gems" exist within the park system?
In this analysis, I sought to identify whether the most visited parks correlate with those that provide the best experience for their visitors. Then I narrow down which undervalued parks are most worth visiting, and look at some other factors that could be driving visitation.
These insights will be particularly useful for any person seeking relatively unknown and crowd-free places to savor the natural world. Additionally, my research could assist the National Park Service in identifying where to expand their visitor resources and the factors that may cause a traveller to pick one park over another. This could potentially lead to increased traffic through some lesser-known parks, resulting in increased revenue for the park service. Using this data to lobby for increased NPS funding is also a possibility.
Let's start by scraping and cleaning the data we think might be useful.
# Remember to install these packages before running the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
import requests
from bs4 import BeautifulSoup
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
import re
import statsmodels.formula.api as smf
Our primary dataset will be sourced from the National Park Service's database of use statistics. The data we want can be found at this link. Due to the implementation of this website, I had to open it manually with a browser and select options from a user interface before the page would load the necessary data. There is a button to download the displayed data in various data formats after the data options are submitted. I will briefly summarize the menu options we selected, for the sake of reproducibility.
Menu options: years = all from 1979 thru 2018, all months, all regions, park type = National Park, all parks, all fields, all additional fields, annual summary only = TRUE
I downloaded the corresponding data as an Excel file. If you're attempting to reproduce this analysis, you should host the file locally in the same folder as this python script/notebook.
# Loading excel file as dataframe
nps_df = pd.read_excel("Query Builder for Public Use Statistics (1979 - Last Calendar Year).xlsx")
nps_df.head()
As you can easily see, we need to do some tidying up of our table to make it usable. We drop unneeded rows/cols below, and change the column indices to match the row at index 2.
# Correctly set column names
nps_df.columns = (nps_df.iloc[2]).values
# Drop padding and unneeded notes
nps_df.drop([0, 1, 2], inplace = True)
After performing some brief inspection of the data table via the excel file or website, you'll notice that there are a few parks that are recorded as having zero recreation visits in particular years.
nps_df.loc[(nps_df['Recreation Visits'] == 0)]
If you look at the surrounding data for these parks, it becomes apparent that the zero value is most likely due to temporary park closures, issues with data collection, or data loss. Since these data points are not representative of the public's desire to visit these parks during the years in question, we will exclude them from our analysis.
nps_df.drop([1370, 1502, 1503, 1705], inplace = True)
Next, we're going to scrape some carefully selected subjective rankings of the national parks for the purpose of averaging them into an overall subjective ranking. I sought out rank lists created specifically by individuals or panels of individuals who had visited ALL the national parks. There were five such lists that I determined were suitable for use. Each selected website documents the reasoning behind their rankings. You can find these rank lists here: List 1, List 2, List 3, List 4, List 5. List 5 has separate documentation available here. You may notice that lists 2 and 5 are hosted on the same website, but they compiled their rankings using different panelists. It is also worth noting that list 3 ranks the parks by five subjective criteria. I did not scrape that part of the data since it was only given by a single source, but it may provide some insight into what makes a "good" park experience. Further investigation would be required.
r = requests.get('https://blog.thediscoverer.com/every-us-national-park-ranked/')
root = BeautifulSoup(r.content, "lxml")
# By investigating the website's HTML code via a browser or print statement,
# we can see that every park ranking is contained in a header of consistent
# size.
#print(root.prettify())
headers_with_tags = root.findAll("h2")
headers = []
for header in headers_with_tags:
h_text = header.getText()
if '.' in h_text:
headers.append(h_text)
# The following code is data cleaning necessary to separate the parks at rankings 16 and 17
headers.remove('16./17. Sequoia & Kings Canyon National Parks')
headers.append('16. Sequoia National Park')
headers.append('17. Kings Canyon National Park')
# Splitting rankings from names
temp_lst = []
for header in headers:
for s in header.split('.', 1):
temp_lst.append(s)
# Putting rankings and park names into a dataframe
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(61, 2)
ranksV1_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV1_df['Rank'] = [float(val) for val in ranksV1_df['Rank'].values]
ranksV1_df.sort_values(by = 'Rank', inplace = True)
ranksV1_df.set_index('Name', inplace = True)
#ranksV1_df
# Now for the second subjective source. The process is practically identical to the first.
r = requests.get('https://www.theactivetimes.com/travel/all-59-national-parks-ranked')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
headers_with_tags = root.findAll("h2")
headers = []
for header in headers_with_tags:
h_text = header.getText()
if '.' in h_text:
headers.append(h_text)
temp_lst = []
for header in headers:
for s in header.split('.', 1):
temp_lst.append(s)
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(len(headers), 2)
ranksV2_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV2_df['Rank'] = [float(val) for val in ranksV2_df['Rank'].values]
ranksV2_df.sort_values(by = 'Rank', inplace = True)
ranksV2_df.set_index('Name', inplace = True)
#ranksV2_df
# Now for the third subjective source. By chance, the process is practically identical to the first two.
# The only difference is that the headers appear twice, so duplicates must be removed.
r = requests.get('https://www.reneeroaming.com/americas-national-parks/')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
headers_with_tags = root.findAll("h2")
headers = []
for header in headers_with_tags:
h_text = header.getText()
if '.' in h_text:
h_text = (h_text.split(',', 1))[0]
headers.append(h_text)
headers = list(set(headers)) # removes duplicate entries
temp_lst = []
for header in headers:
for s in header.split('.', 1):
temp_lst.append(s)
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(len(headers), 2)
ranksV3_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV3_df['Rank'] = [float(val) for val in ranksV3_df['Rank'].values]
ranksV3_df.sort_values(by = 'Rank', inplace = True)
ranksV3_df.set_index('Name', inplace = True)
#ranksV3_df
# Begin scraping 4th subjective source
r = requests.get('https://www.leeabbamonte.com/north-america/national-parks/all-60-us-national-parks-ranked.html')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
headers_with_tags = root.findAll("h1")
headers = []
for header in headers_with_tags:
h_text = header.getText()
headers.append(h_text)
headers.pop(0) # Removes title of article
temp_lst = []
for header in headers:
for s in header.split(' ', 1): # No periods here, just spaces, unlike the previous sources
temp_lst.append(s)
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(len(headers), 2)
ranksV4_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV4_df['Rank'] = [float(val) for val in ranksV4_df['Rank'].values]
ranksV4_df.sort_values(by = 'Rank', inplace = True)
ranksV4_df.set_index('Name', inplace = True)
#ranksV4_df
# Begin scraping 5th (last) subjective source
# Note this list is hosted on the same site as the second, but consulted different panelists to determine rankings
# More info about criteria is available at https://www.theactivetimes.com/national-parks-ranked
r = requests.get('https://www.theactivetimes.com/59-national-parks-ranked')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
headers_with_tags = root.findAll("h2")
headers = []
for header in headers_with_tags:
h_text = header.getText()
if '#' in h_text:
temp = h_text.split('#', 1) # this line and next line remove hashtag at beginning of rank number
h_text = temp[len(temp) - 1]
headers.append(h_text)
temp_lst = []
for header in headers:
for s in header.split(' ', 1): # No periods here, just spaces, unlike the previous sources
temp_lst.append(s)
temp_arr = np.array(temp_lst)
reshaped = temp_arr.reshape(len(headers), 2)
ranksV5_df = pd.DataFrame(reshaped, columns = ['Rank', 'Name'])
ranksV5_df['Rank'] = [float(val) for val in ranksV5_df['Rank'].values]
ranksV5_df.sort_values(by = 'Rank', inplace = True)
ranksV5_df.set_index('Name', inplace = True)
#ranksV5_df
Now we're going to tidy up the individual subjective ranking tables into one table. Since the way each list writes certain park names can differ slightly, we're going to use the park naming convention from the official NPS dataset for consistency. This means we'll need to use fuzzy string matching (a subset of Natural Language Processing) to match parks between dataframes. I chose to use the fuzzywuzzy package for this.
# Identify a list of parks
parks = nps_df['Park'].unique()
all_ranks_df = pd.DataFrame(index = parks, columns = ['Ranks_V1', 'Ranks_V2', 'Ranks_V3', 'Ranks_V4', 'Ranks_V5'])
ranking_frames = [ranksV1_df, ranksV2_df, ranksV3_df, ranksV4_df, ranksV5_df]
i = 0
for frame in ranking_frames:
i = i + 1
for idx, row in frame.iterrows():
# This section uses fuzzy string matching to match park names between lists
name = row.name
shortname = name.replace('National ', '') # This should prevent resolution issues with Natl. Park of Amer. Samoa
shortname = shortname.replace('Park', '')
shortname = shortname.replace('State ', '')
shortname = shortname.replace('Preserve', '') # preventing resolution issues
match = (process.extractOne(shortname, parks, scorer = fuzz.token_sort_ratio))[0] # fuzzywuzzy
all_ranks_df.at[match, ('Ranks_V' + str(i))] = row['Rank']
# Creating new rankings from the rank averages
all_ranks_df['Mean_Ranks'] = all_ranks_df.mean(axis=1)
all_ranks_df.sort_values(by = 'Mean_Ranks', inplace = True)
# enumerating in order
all_ranks_df['Derived_Rank'] = [i for i in np.arange(1,62)]
all_ranks_df.head()
Next, we'll scrape some additional basic park data compiled on the wikipedia page for US national parks, which can be found here. The data of greatest interest to us will be the latitude and longitude of each park, and their size in acres.
# This section scrapes the table structure from the site and reformats into dataframe
r = requests.get('https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States')
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
tables = root.findAll("table")
df_wrapped = pd.read_html(str(tables[1]))
wiki_df = df_wrapped[0]
wiki_df.drop(['Image', 'Description'], axis = 1, inplace = True)
wiki_df.columns = ['Name', 'Location', 'Established', 'Area_Acres_2017', 'Visits_2018']
#wiki_df
# Regex expression to strip unnecessary characters from lat and long vals
regex = re.compile('[^0-9\.]')
wiki_df['Latitude_N'] = np.nan
wiki_df['Longitude_W'] = np.nan
wiki_df['Derived_rank'] = np.nan
for idx, row in wiki_df.iterrows():
# split location into lat and long
loc = (row['Location'].split('/ ', 1))[1]
lat, long = loc.split(' ', 1)
if 'N' in lat[-1]:
lat = float(regex.sub('', lat))
else:
lat = 0.0 - float(regex.sub('', lat))
if 'W' in long[-1]:
long = float(regex.sub('', long))
else:
long = 0.0 - float(regex.sub('', long))
wiki_df.at[row.name, 'Latitude_N'] = lat
wiki_df.at[row.name, 'Longitude_W'] = long
# Now deal with tidying area column into float values
acres = (row['Area_Acres_2017'].split(' acres', 1))[0]
acres = float(acres.replace(',', ''))
wiki_df.at[row.name, 'Area_Acres_2017'] = acres
# Fixing names to NPS dataset standard and adding in the derived ranks
name = row['Name']
match = (process.extractOne(name, parks, scorer = fuzz.token_sort_ratio))[0] # fuzzywuzzy
wiki_df.at[row.name, 'Name'] = match
wiki_df.at[row.name, 'Derived_rank'] = all_ranks_df.at[match, 'Derived_Rank']
wiki_df.head()
For our last bit of data scraping, we'll use the Missouri Census Data Center's Circular Area Profiles System to get a prediction for population within 400 miles of the national park coordinates. To estimate this, MCDC aggregates American Community Survey data from 2013 to 2017. More details about how this data is aggregated and weighted can be found here. I selected the radius off of an estimate of a single day's drive from the park, calculated by multiplying an average speed of 60mph by 8 hours of driving time and then rounding down to the nearest hundred to account for slower back roads and pit stops. Since performing all queries to the CAPS service took roughly 10 minutes, I store the results in a .csv file and check for it when rerunning the code. In the output below, you may notice that the National Park of American Samoa and Virgin Islands National Park do not have a value for population within the radius. This is due to the limitations of the ACS data. Since these parks are located in US territories rather than states, their data is not collected as part of the survey. As a result, the CAPS tool returns an error for the coordinates of the two parks. I chose to exclude these parks when examining the population variable.
radius = '400' # mile radius around park coordinates
# If you change the radius, you must delete any csv file previously generated
# by this portion of the code.
wiki_df['Pop_in_radius'] = np.nan
# Begin population retrieval:
# I store the results in a csv file and retrieve them when the code re-runs
# due to the amount of time it takes to query the population data (~10 mins).
try:
# if csv file does exist
temp_df = pd.read_csv('pop_in_radius.csv')
wiki_df['Pop_in_radius'] = temp_df['Pop_in_radius']
except:
# if csv file does not exist
# request radial data via lat and long of park
for idx, row in wiki_df.iterrows():
lat = str(row['Latitude_N'])
long = str(row['Longitude_W'])
link = 'http://mcdc.missouri.edu/cgi-bin/broker?_PROGRAM=apps.capsACS.sas&_SERVICE=MCDC_long&latitude='
link = link + lat + '&longitude='
link = link + long + '&radii='
link = link + radius + '&dprofile=on&units=+'
r = requests.get(link)
root = BeautifulSoup(r.content, "lxml")
#print(root.prettify())
tables = root.findAll("table")
# if coordinates did not throw an error on the site, format and store
if len(tables) > 0:
demo_df = (pd.read_html(str(tables[0])))[0]
pop = float(demo_df.iat[1, 1])
wiki_df.at[row.name, 'Pop_in_radius'] = pop
# store population data for each park as csv
wiki_df[['Name', 'Pop_in_radius']].to_csv('pop_in_radius.csv', index = False)
wiki_df.head()
Now that we've acquired all of our data, let's start our invesigation by producing a simple line plot of park visitor numbers over time. We will plot each park on the same graph.
# first graph: park visitor numbers over time
plt.figure(figsize=(10,8))
# plot lines for each parl on the same graph
for park in parks:
park_df = nps_df.loc[nps_df['Park'] == park]
plt.plot(park_df['Year'], park_df['Recreation Visits'])
plt.ylabel('Annual Recreation Visits')
plt.xlabel('Year')
plt.show()
The above plot is good for visualizing visitor numbers over time, but it's somewhat difficult to visualize how popular each park is in a given year relative to the other parks. To mitigate this, we'll make a new plot that is standardized by dividing by the total number of national park visitors for each year, thus giving us the percentage of visitors that choose a particular park over the others.
First, let's find the total annual number of recreation visits and visualize its trend line for the sake of exploration.
# Group park data by year, then sum
sum_df = nps_df.groupby(['Year']).sum()
plt.figure()
plt.plot(sum_df.index.values, sum_df['Recreation Visits'])
plt.ylabel('Total Annual Recreation Visits for ALL Parks')
plt.xlabel('Year')
plt.show()
Now we'll use this sum data to compute the percentages.
nps_df['pct_annual_visits'] = np.nan
for idx, row in nps_df.iterrows():
row['pct_annual_visits'] = float(row['Recreation Visits']) / float(sum_df.loc[row['Year'], 'Recreation Visits'])
nps_df.loc[row.name] = row
nps_df
plt.figure(figsize=(10,8))
for park in parks:
park_df = nps_df.loc[nps_df['Park'] == park]
plt.plot(park_df['Year'], park_df['pct_annual_visits'])
plt.ylabel('Fraction of total annual recreation visits')
plt.xlabel('Year')
plt.show()
This line plot of park percentages gives a better idea of how rankings fluctuate, but due to a high density of parks in the smaller percentages, picking out an individual park becomes difficult even if park labels are added for each trendline. Let's rectify this by simply assigning integer ranks from 1 to 61 in order of greatest to least percentage of total visitors.
nps_df['annual_rank'] = np.nan
for year in nps_df['Year'].unique():
# enumerate visitation ranking values per year per park
year_df = nps_df.loc[nps_df['Year'] == year]
year_df_sorted = year_df.sort_values(by = 'pct_annual_visits', ascending = False)
i = 0
for idx, row in year_df_sorted.iterrows():
i = i + 1
nps_df.at[row.name, 'annual_rank'] = i
plt.figure(figsize=(12,14))
# graph line for each park over time
for park in parks:
park_df = nps_df.loc[nps_df['Park'] == park]
plt.plot(park_df['Year'], park_df['annual_rank'], label = park)
# everything below this is used to add the text where the line ends
last_year = np.amax(park_df['Year'].values)
last_year_row = park_df.loc[park_df['Year'] == last_year]
recent_rank = int(last_year_row['annual_rank'])
plt.annotate((str(recent_rank) + '. ' + park), (last_year, recent_rank))
plt.ylabel('Popularity ranked by percentage of total annual visits')
plt.xlabel('Year')
plt.show()
That's much better. Now we can easily see the current popularity rankings for each park, as well as how they've changed over time.
Now that we have the most recent ranks by popularity as well as a system of subjective ranking, let's compare the top 20 parks in each rank list.
# get the top 20 visit rankings for the year 2018
top_20_visits = (nps_df.loc[nps_df['Year'] == 2018]).sort_values(by = 'annual_rank')[['Park', 'annual_rank']].head(20).copy()
# get the top 20 subjective rankings
top_20_subjective = all_ranks_df[['Derived_Rank']].head(20).copy().reset_index()
print(top_20_visits)
print(top_20_subjective)
Looking at these two distinct top 20 rankings, you'll notice that there are items that appear one one list that do not appear on the other. To identify some undervalued parks that are worth visiting, let's isolate the rows that are present in the subjective ranks top 20, but not in the popularity ranks top 20.
# eliminate all parks from the top 20 subjective that are in the top 20 by visits
# then store the resulting subset in a new frame
undervalued = top_20_subjective[~top_20_subjective['index'].isin(top_20_visits['Park'].values)].copy()
# find how the parks ranked by visits in 2018
undervalued['Visits_Rank_2018'] = np.nan
for idx, row in undervalued.iterrows():
nps_row = nps_df.loc[(nps_df['Park'] == row['index']) & (nps_df['Year'] == 2018)]
visits_rank = nps_row['annual_rank']
undervalued.at[row.name, 'Visits_Rank_2018'] = visits_rank
# subtract subj. rank from visits rank to determine how undervalued a park is
undervalued['Num_Ranks_Undervalued_By'] = undervalued['Visits_Rank_2018'] - undervalued['Derived_Rank']
undervalued
From the rows above, we have identified some subjectively high-value parks to visit that the average traveller likely wouldn't prioritize. However, something worth noting is that most of these parks are in areas of low population density. We should probably investigate the relationship between visits and population density in the hypothesis testing segment below.
Let's start by checking if a significant relationship exists between our two ranking paradigms. This will tell us whether visitor experience has an effect on actual popularity (in general).
x_visits_rank_2018 = ((nps_df.loc[nps_df['Year'] == 2018]).sort_values(by = 'Park'))['annual_rank']
y_subjective_rank = (all_ranks_df.sort_index())['Derived_Rank']
plt.figure()
# calculate linear regression values
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x_visits_rank_2018, y_subjective_rank)
# calculate y values for line
line = intercept + slope * x_visits_rank_2018
# plot points and reg line on same graph
plt.plot(x_visits_rank_2018, y_subjective_rank, 'o', x_visits_rank_2018, line)
plt.xlabel('Rank by visit count')
plt.ylabel('Subjective rank')
plt.show()
# print statistical significance
print(p_value)
Our p-value tells us that a statistically significant relationship does exist. This is not entirely surprising given that the top 20 parks in each ranking system had 13 parks in common, but it is obviously not the only factor that affects visitation. Let's dig in further by looking at nearby population versus visitation.
# prevent regression errors by eliminating rows for which we could not find
# a radial population value
temp_reg_df = wiki_df.dropna()
# similar process to last plot
plt.figure()
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(temp_reg_df['Pop_in_radius'], temp_reg_df['Visits_2018'])
line = intercept + slope * wiki_df['Pop_in_radius']
plt.plot(wiki_df['Pop_in_radius'], wiki_df['Visits_2018'], 'o', wiki_df['Pop_in_radius'], line)
plt.xlabel('Population within 400 miles of park')
plt.ylabel('Recreation Visits in 2018')
plt.show()
print(p_value)
Unsurprisingly, nearby population does correlate significantly with visits.
Now that we've established that multiple factors can play a role in visitation, let's see if we can get a good fit of the visitation data using purely objective statistics. We will exclude the subjective ranking for this case due to the currently unknown factors that could affect a visitor's experience. We will also exclude a number of variables based on their direct dependency on visitation. A good example of this is the "Recreation Hours" column of the NPS dataframe. Lastly, we will exclude categorical variables that could be used to identify a single park, due to the potential for overfitting. An example of this would be the state a park is located in, since many states only have one park. We will refrain from adding any interaction terms or ratios until we determine the quality of the fit without them. As such, we will initially be looking at the following variables: population within 400 mile radius, region of country, latitude and longitude, park area, and year. Let's begin by throwing all relevant data into the NPS dataframe, so that we can easily use it with a StatsModels formula.
# Multi-variate regression (OLS):
# First we compile into one dataframe
nps_df['lat_N'] = np.nan
nps_df['long_W'] = np.nan
nps_df['recent_area_acres'] = np.nan
nps_df['recent_pop_in_radius'] = np.nan
# add data from wiki_df to nps_df
for idx, row in nps_df.iterrows():
wiki_row = (wiki_df.loc[wiki_df['Name'] == row['Park']]).iloc[0]
nps_df.at[row.name, 'lat_N'] = wiki_df.at[wiki_row.name, 'Latitude_N']
nps_df.at[row.name, 'long_W'] = wiki_df.at[wiki_row.name, 'Longitude_W']
nps_df.at[row.name, 'recent_pop_in_radius'] = wiki_df.at[wiki_row.name, 'Pop_in_radius']
nps_df.at[row.name, 'recent_area_acres'] = wiki_df.at[wiki_row.name, 'Area_Acres_2017']
nps_df.head()
With our dataframe properly prepared, we'll throw our variables into the StatsModels formula to check for statistical significance.
# renaming a column to remove spaces so it can easily be used in a statsmodels OLS formula
nps_df = nps_df.rename(columns = {'Recreation Visits':'Recreation_Visits'})
# multivariate regression fit
reg_results = smf.ols(formula = 'Recreation_Visits ~ Region + Year + \
lat_N + long_W + recent_area_acres + recent_pop_in_radius', data = nps_df).fit()
reg_results.summary()
Remarkably, all of the variables we identified are statistically significant. Let's print the p-values with greater precision to determine which variable has the greatest effect on visitation.
reg_results.pvalues
This shows that the population within a 400 mile radius easily has the greatest effect on visitation, followed in order by the categorical regions, the longitude, the latitude, the park area, and lastly the year. Note, however, that our R-squared value indicates that our model only explains 29% of the variance around the mean. This suggests that there could be many more variables that play a role in visitation. Let's calculate a few more values to use in our fit. Whereas typical interaction terms multiply two variables together, we're going to see if we can create some useful ratios. Examples of this that we'll use are average time per visit (recreation hours divided by recreation visits), the percentage of visitors for a given year and park who are tent campers, the percent RV campers, and the percent backcountry campers. We'll also add a normal interaction term between park area and population within the radius.
# Adding the ratio columns to our dataset
nps_df['pct_tent'] = nps_df['Tent Campers']/nps_df['Recreation_Visits']
nps_df['pct_backcountry'] = nps_df['Backcountry Campers']/nps_df['Recreation_Visits']
nps_df['pct_RV'] = nps_df['RV Campers']/nps_df['Recreation_Visits']
nps_df['mean_visit_length'] = nps_df['Recreation Hours']/nps_df['Recreation_Visits']
# Calculating the best fit for the new formula
reg_results = smf.ols(formula = 'Recreation_Visits ~ Region + Year + \
lat_N + long_W + recent_area_acres + recent_pop_in_radius + \
pct_tent + pct_backcountry + pct_RV + mean_visit_length + \
recent_area_acres:recent_pop_in_radius', data = nps_df).fit()
reg_results.summary()
We get a better fit overall, but due to the new interaction terms, the park area by itself is no longer significant. Neither is the constant term. Let's remove these values to achieve an optimal fit.
reg_results = smf.ols(formula = 'Recreation_Visits ~ Region + Year + \
lat_N + long_W + recent_pop_in_radius + \
pct_tent + pct_backcountry + pct_RV + mean_visit_length + \
recent_area_acres:recent_pop_in_radius - 1', data = nps_df).fit()
reg_results.summary()
Our model is now significantly better than the one we started with. This is reflected in the F-statistic, value for R-squared, and the overall P-value. As our final step, let's print our P-values with more precision to identify exactly how well each one correlates with visitation.
reg_results.pvalues