Introduction
In the world of Data Science, I’m most strongly drawn to geographically-linked data. Choropleth maps, for example, are about the most powerful way to convey statistical information and get a point across, if you will forgive the intolerable pun. Attaching data to geography somehow makes information more relatable, more personal, and more easily absorbable. It gives people a reference point, a “you are here”, if you will, or maybe “There, but for the grace of God…”.
I am fortunate to live in New Mexico where I can take beautiful and varied walks, hikes and bike rides nearly every day, which I do. I record most of my excursions with an app on my phone, and I realized recently that I must have lots of data that I can play with to do mapping and analysis. I was pleased to discover that the simple app I use on my phone to track my excursions can easily export the data to GPX format. I had no idea what GPX format was then, but assumed it was some standard, so things looked promising, and off I went.
The data I use comes from an app called SportActive. Apps like AllTrails and Strava are wonderful, especially when exploring new places, but they are overkill IMO for simple tracking of daily, and largely repetitive, activities. SportActive simply records my walks and rides, without asking if I want to share my walk with my “friends” like a meal on Facebook. (Disclaimer: I have no financial relationship with SportActive, although if you could arrange such a thing I’d happily change this disclaimer.)
In this and two following articles, I will show how the data can be used with Python to map and analyze the GPS information. I will show how to do analyses such as profiling elevations, calculating speeds and durations, identifying pauses, and segmenting paths based on various criteria. This type of analysis, which I’m doing for fun, is the very same as could be used to, for example, study bird migrations or the movement of container ships.
Python provides many libraries based around the pandas ecosystem which make working with geospatial data easy. GeoPandas extends Pandas to incorporate geometries and coordinate reference systems. GPX data is a series of geolocated points, which is easily handled by geopandas. MovingPandas facilitates turning the point geometries into “trajectories”, allowing for calculations of speed, duration and direction.
This article will cover parsing the raw GPS data to a csv file, which I then import into a geopandas GeoDataFrame
. From that I will create maps and generate some basic statistical information about the trek such as distance and elevation profiles. These articles assume basic familiarity with Python, experience with Pandas being helpful. An expanded version of the code in the articles can be obtained from my GitHub repository at https://github.com/bisotty666/GPX.
XML to CSV
The GPX file turns out to be an xml file with standardized tags. Most importantly, it contains a list of tags called trkpt
which, as the name implies, are points along the trek, each recording the geographic location in latitude and longitude, together with the elevation and the time of the recording. An example is
trkpt lat="41.043858" lon="-74.063712">
<ele>79</ele>
<time>2024-09-05T15:25:13.858Z</time>
<trkpt> </
This format cannot be directly imported into a GeoDataFrame
, but csv
files can be, and fortunately, there is a Python library called BeautifulSoup that can not only efficiently parse HTML, it can also process XML files, which I can write a csv
file. So I’ll start by importing it as well as the csv
library, and “make the soup”:
from bs4 import BeautifulSoup
import csv
with open(gpx_file, 'r') as f:
= f.read()
contents
= BeautifulSoup(contents, 'xml') soup
I’ll grab the elements of interest, as well as some metadata, with:
= soup.find_all('trkpt')
tracks = soup.find_all('ele')
elevations = soup.find_all('time')
times = soup.find('s2t:temperature').text
temp = soup.find('s2t:weather').text
weather = soup.find('name').text name
I expect to concatenate a bunch of these files, so I’ll also add the filename itself as a column:
= os.path.splitext(gpx_file)[0]
sf_name id = os.path.split(sf_name)[1]
= sf_name + '.csv' csv_file
Then I make a list of the points with each of these elements connected with their geographic coordinates:
= []
data for track, elevation, time in zip(tracks, elevations, times):
= track['lat']
latitude = track['lon']
longitude = elevation.text
elevation_value = time.text
time id, latitude, longitude, elevation_value,
data.append([ time, temp, weather])
And now I can write the CSV file:
with open(csv_file, 'w', newline='') as csvfile:
= csv.writer(csvfile)
writer 'Id', 'Lat', 'Lon', 'Elev', 'Time',
writer.writerow(['Temp', 'Weather'])
writer.writerows(data)
Since I want to process a lot of these files, as well as concatenate them, I put this together in a script for batch processing, which I’ll add at the end. # Importing into Geopandas With the data in a csv file, I am ready to make the GeoDataFrame
. First I’ll import the libraries I’ll be using in this section:
import pandas as pd
import geopandas as gpd
import contextily as ctx
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.nonparametric.smoothers_lowess import lowess
from shapely import LineString
I’ll use pandas to import the csv
file to a pandas DataFrame
:
= pd.read_csv('data.csv')
trek_df 0] trek_df.iloc[
Id wo-2024-09-05
Lat 41.043858
Lon -74.063712
Elev 79
Time 2024-09-05T15:25:13.004Z
Temp 20.51
Weather 0
Name: 0, dtype: object
Then I’ll create a GeoPandas GeoDataFrame
using the lat/lon coordinates and an appropriate coordinate reference:
= gpd.GeoDataFrame(
trek_gdf
trek_df, =gpd.points_from_xy(x=trek_df.Lon, y=trek_df.Lat)
geometry4269)
).set_crs( trek_gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 650 entries, 0 to 649
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 650 non-null object
1 Lat 650 non-null float64
2 Lon 650 non-null float64
3 Elev 650 non-null int64
4 Time 650 non-null object
5 Temp 650 non-null float64
6 Weather 650 non-null int64
7 geometry 650 non-null geometry
dtypes: float64(3), geometry(1), int64(2), object(2)
memory usage: 40.8+ KB
That looks good, with no null values. The Time
should be a datetime
type, so I’ll go ahead and change that:
'Time'] = pd.to_datetime(trek_gdf.Time) trek_gdf[
OK, let’s make the first map. I’ll use matplotlib
together which contextily
, a library which provides a wide variety of base maps with ease:
import matplotlib.pyplot as plt
import contextily as ctx
= plt.subplots(figsize=(10,10))
f, ax =ax, alpha=0.3, markersize=4, c='b')
trek_gdf.plot(ax=trek_gdf.crs, source=ctx.providers.OpenStreetMap.Mapnik); ctx.add_basemap(ax, crs
Sweet. In order to do calculations, however, I’ll need to project the data to a crs appropriate for the location, which is New Jersey. A quick visit to epsg.io
suggests EPSG:32111
, so I’ll use that. It’s in meters, not feet, which I personally prefer anyway.:
= trek_gdf.to_crs(32111)
trek_proj = plt.subplots(figsize=(10,10))
f, ax =ax, alpha=0.3, markersize=4, c='b')
trek_proj.plot(ax=trek_proj.crs, source=ctx.providers.CartoDB.Voyager); ctx.add_basemap(ax, crs
Elevations
Let’s profile the elevations for the walk. First, some basic statistics:
= LineString(trek_proj.geometry)
line print(f'''
The walk was {line.length / 1000:.1f} kilometers
Elevations:
Initial: {trek_proj.iloc[0].Elev} meters
Final: {trek_proj.iloc[-1].Elev} meters
Highest: {trek_proj.Elev.max()} meters
Lowest: {trek_proj.Elev.min()} meters
''')
The walk was 7.0 kilometers
Elevations:
Initial: 79 meters
Final: 22 meters
Highest: 81 meters
Lowest: 11 meters
And a basic plot:
= plt.subplots(figsize=(9,5))
f, ax = trek_proj.Elev.plot()
ax f'''
ax.set_title( Elevation profile for {trek_gdf.Name[0]} on {trek_gdf.Time[0].strftime('%x')}
''')
'meters'); ax.set_ylabel(
Using Seaborn, I can use regression to smooth this out as much or as little as I want:
sns.set_theme()= sns.lmplot(data=trek_proj, x='Point', y='Elev',
plot =4, truncate=False, height=8, aspect=1)
order=0.9, left=.11)
plot.figure.subplots_adjust(topf'''
plot.fig.suptitle( Smoothed profile for {trek_gdf.Name[0]} on {trek_gdf.Time[0].strftime('%x')}
''', fontsize=18);
And here’s smoothing with lowess
:
={'figure.figsize':(6,6)})
sns.set_theme(rc= sns.lmplot(data=trek_proj, x='Point', y='Elev',
plot =True, truncate=False, height=8, aspect=1)
lowess=0.9, left=.11)
plot.figure.subplots_adjust(topf'''
plot.fig.suptitle( Lowess smoothed profile for {trek_gdf.Name[0]} on {trek_gdf.Time[0].strftime('%x')}
''', fontsize=18);
Next steps
I naturally want to be able to make calculations of speed and distance, identify pauses, and do other exploration. Starting from discrete points, the steps to do so manually would be simple but exceedingly tedious. Fortunately there is a wonderful library called movingpandas
which makes these things all very simple. I’ll explore that in the next articles.
I’ll go ahead and save the GeoDataFrames
for future use:
'data/trek_gdf.gpkg', driver='GPKG')
trek_gdf.to_file('data/trek_projected.gpkg', driver='GPKG') trek_projected.to_file(
Code
The code for this article is available in my Github Repository in the form of a Jupyter notebook. Here is the final script for batch processing the files:
'''
Convert a directory of gpx files to csv.
The script captures trek points with coordinates,
elevation and time stamp as well as trek name and
weather conditions.
The script requires an input directory path.
'''
from bs4 import BeautifulSoup
import csv
import sys
import pathlib
import os
if len(sys.argv) < 2:
"Must supply an input directory")
sys.exit(= sys.argv[1]
inPath
= [os.path.join(inPath, f)
files for f in os.listdir(inPath)
if f.endswith(".gpx")]
= str(os.path.join(inPath, 'combined.csv'))
cfile
with open(cfile, 'w', newline='') as f:
= csv.writer(f)
writer 'Id', 'Lat', 'Lon', 'Elev', 'Time',
writer.writerow(['Temp', 'Weather'])
for gpx_file in files:
with open(gpx_file, 'r') as f:
= f.read()
contents
= BeautifulSoup(contents, 'xml')
soup
= soup.find_all('trkpt')
tracks = soup.find_all('ele')
elevations = soup.find_all('time')
times = soup.find('s2t:temperature').text
temp = soup.find('s2t:weather').text
weather = soup.find('name').text
name = os.path.splitext(gpx_file)[0]
sf_name id = os.path.split(sf_name)[1]
= sf_name + '.csv'
csv_file = []
data
for track, elevation, time in zip(tracks, elevations, times):
= track['lat']
latitude = track['lon']
longitude = elevation.text
elevation_value = time.text
time id, latitude, longitude, elevation_value,
data.append([
time, temp, weather])
with open(csv_file, 'w', newline='') as csvfile:
= csv.writer(csvfile)
writer 'Id', 'Lat', 'Lon', 'Elev', 'Time',
writer.writerow(['Temp', 'Weather'])
writer.writerows(data)
with open(cfile, 'a', newline='') as f:
= csv.writer(f)
writer writer.writerows(data)
Citation
@online{carey2024,
author = {Carey, Brian},
title = {Trail {Mapping} with {Python}},
date = {2024-02-28},
url = {https://biscotty666.github.io/biscottys-workshop/2024-02-28-trail-mapping-with-python},
langid = {en}
}