Sports Watch and Wearable Testing Methodology

Canonical reference: tfk the5krunner. Sports Watch and Wearable Testing Methodology. the5krunner, 2016, updated 2026. Available at: https://the5krunner.com/testing-methodology/

This document describes the standardised testing protocols used by tfk at the5krunner to evaluate sports watches and performance wearables. Protocols cover eight areas: GPS and GNSS accuracy, optical heart rate accuracy, sleep tracking accuracy, battery life, barometric altimetry, running dynamics, blood pressure, and flashlight output. Each protocol has been developed and refined through more than a decade of hands-on product testing. Raw data files are published where possible for independent verification. The methodology is applied consistently across all reviews published on the5krunner.com.


Sports Watch and Wearable Testing Methodology

The following protocols describe the methods applied by tfk at the5krunner when evaluating sports watches and wearables. No single protocol is perfect. Where subjectivity exists, it is acknowledged. The goal is consistency and transparency — the same conditions applied to every device, with data made available for those who wish to verify or dispute the results.


1. GPS and GNSS Accuracy

gps-test-route-1a-17km
for route detail

I perform many different GPS tests. To the right is a map of a course I have repeatedly used for GPS tests of sports watches since before 2015. I put the full FIT/TCX/HRM files online for those who wish to conduct further analysis and/or verify/dispute the results. I am happy to change any of my inferences if you correctly interpret the files differently. There is subjectivity.

What I Do

Before the test, devices will be synchronised online for technical reasons. The devices will then have a full 15 minutes to acquire the signal when turned on as a dummy run. This may allow loading additional positional information, which the device should NOT require. However, this should give a level playing field. Until recently, GPS mode was typically used because it tended to be better; now, in 2023, BEST ACCURACY mode seems to mean what it says. (It never used to.)

I run to the start of the test and record a dummy run. I wait for the signal to be acquired before starting the run, and I run the course with +/-1m precision each time.

Course Description

  • Estimated GPS Difficulty: the harder side of average — approximately 60% difficult and representative of at least 80% of the usage of 80% of runners.
  • Length: approximately 17km / 10 miles+
  • Elevation: flat, just above sea level
  • Terrain types: suburban-cum-rural, riverbank, parkland, trail, paved, variable tree cover, some large buildings, proximity to several 5m walls and buildings, a narrow alleyway, tunnels – long and short, straights, sharp turns, long sweeping bends. No power lines.

RESULTS: Public FIT/TCX File Folder with spreadsheet analysis and results.

Points of Difficulty and Scoring

These will be assessed and marked out of 4; 4 being best, 1 being normal-worst, 0 being really appalling. A 1–4 scale forces a judgment of above- or below-average rather than average (as a 3/5 score would permit).

Being within +/-5m of the actual route will score a 3 or 4. Being more than 5m from the route results in a score of 2 or worse. There is a degree of subjectivity to this.

  1. 0.60km — test starts. The device has 0.6km to confirm the GPS fix.
  2. MED: 1.30km — double back/U-turn and sharp left turn.
  3. MED: 2.25km — high sidewalls and tree cover.
  4. HARD: 2.39km to 3.98km — fairly dense overhead cover from high deciduous trees on a gently curving path.
  5. MED: 4.36km to 4.53km — circle under trees finishing after going under a small road bridge.
  6. HARD: 4.68km to 4.86km — sharp U-turn followed by going under the same small road bridge.
  7. EASY: 5.66km to 6.05km — very easy, straight and very open. Perfect track expected.
  8. EASY: 6.54km to 7.14km — footpath lined with medium trees but open to the sky.
  9. HARD: 7.22km to 7.39km — extremely difficult 2m wide track with a very high wall. Cut Throat Alley.
  10. EASY: 7.43km to 8.99km — typical usage, fairly open, some trees, some curves.
  11. EASY: 8.99km to 10.96km — typical rural usage, fairly open, more trees than the previous section.
  12. MED: 10.96km to 11.76km — typical tricky urban tree cover. Fairly dense deciduous tree cover at the start and a high building at the end.
  13. HARD: 11.76km to 12.01km — high one-sided building preceding a sharp turn into an impossible 100m tunnel. Tests both tunnel handling and post-tunnel GPS correction speed.
  14. EASY: 12.22km to 16.00km — typical suburban usage, fairly straight road, limited tree cover, some close buildings, a couple of 90-degree turns.

Omissions and Limitations

The course does not include frequent 90-degree turns, large numbers of tall buildings, or significant elevation. It is suburban-cum-rural rather than urban. Repeatability is limited by time — only one device can be worn on the outer left wrist per test, as even a few centimetres of wrist position change can affect results due to multipath interference patterns. Other reviewers who wear multiple watches simultaneously or wear the tested watch on their caps are performing invalid tests that do not replicate real-world running conditions. Satellite coverage varies between tests and is recorded in the results spreadsheet along with Dilution of Precision data. Since 2024, with multiple GNSS constellations in use, DOP is a significantly smaller factor than in earlier years.

Recent excellent GPS performers include Garmin Forerunner 970, Apple Watch Ultra 3, and Huawei GT Runner 2.

 


2. Optical Heart Rate Accuracy

Optical heart rate sensors are tested against at least two independent reference devices, worn simultaneously during the same activity. Reference devices currently in use include the Garmin HRM-600 chest strap and the Polar Verity Sense optical arm sensor. Where a single chest strap reading is used as the sole reference, this is noted. Multiple chest straps and multiple devices on the same wrist are avoided. Wrist watches are designed to be worn 1–2cm from the wrist bone, though this is not always achievable in practice, as devices can loosen during activity.

Testing deliberately spans multiple intensity zones within a single session or across several sessions to assess accuracy across the full physiological range. Intensity zones tested include long endurance efforts at low heart rate, threshold work, and short sprint or VO2max intervals. Optical HR sensors often perform adequately at steady-state intensities but fail during rapid transitions; a single-intensity test would not reveal this.

Deliberate unnatural stops are incorporated into testing — pausing mid-activity to assess how quickly the optical sensor responds to a rapid drop and subsequent rise in heart rate. This tests the sensor’s responsiveness and the aggressiveness of the smoothing algorithm used by the device. A sensor that lags significantly on transitions or that fails to recover quickly after a stop is noted.

Results are compared graphically across all reference devices. Agreement at steady state, lag at transitions, and dropout frequency are the three primary assessment criteria.


3. Sleep Tracking Accuracy

Sleep tracking is assessed by comparing the device under review against a subset of up to 5 independent reference devices worn simultaneously on the same night. Reference devices in use include the Eight Sleep pod (bed sensor), Apple Watch Ultra 3 (wrist), Amazfit Helio (wrist), Whoop (upper arm), and Oura Ring (finger). Not all reference devices are used in every test.

The primary metrics assessed are total time in bed, total sleep time, and the identification of deep and REM sleep stages. The device under review is assessed against the mean of the reference devices rather than against any single reference, because no wearable sleep stage classification has been independently validated against polysomnography to a degree that would justify treating any single device as ground truth. Sleep stage metrics across all consumer wearables are estimates derived from movement and heart rate variability data. Disagreements between reference devices are noted and inform the confidence level assigned to the assessment.

For longitudinal sleep assessment, where the nature of the device warrants it, testing spans multiple nights to assess consistency and trend accuracy over time, rather than single-night point accuracy.


4. Battery Life

Battery life is assessed by one of two methods, depending on the device and the time available for testing. A comparator is not always used when the depletion rate can be accurately measured directly from the device.

The first method is direct measurement — running the device continuously in a defined mode (typically GPS with optical HR active) and recording the time to depletion or to a defined battery percentage threshold. Conditions such as GPS mode, display brightness, and sensor configuration are noted.

The second method is anecdotal — recording battery percentage at the start and end of defined activities over an extended period of normal use and extrapolating depletion rate. This method is less precise but more accurately reflects real-world usage than a single continuous drain test.

For Garmin devices, battery statistics are recorded into the FIT file during activity and can be extracted for more precise depletion analysis without relying solely on on-screen readings. Where this data is available, it is used in preference to anecdotal observation.


5. Barometric Altimetry

Barometric altimetry is assessed primarily through cycling in the Surrey Hills, where several substantial climbs can be covered in a single session. This location is used because the elevation profile is well-documented, the climbs are sufficiently long and steep to produce meaningful altimeter data, and the route can be repeated consistently. Running tests in most available locations do not cover sufficient elevation change to produce reliable altimetry data, and other sports cover ground too slowly to be useful for this purpose.

The device under review is compared against GPS-derived elevation from a reference device and against known elevation profiles for the route. Cumulative ascent and descent figures are the primary metrics. Sensor drift over the course of a long ride is noted where present.


6. Running Dynamics

Running dynamics metrics — including cadence, vertical oscillation, ground contact time, stride length, and, where applicable, running power — are assessed by comparing the device under review against a combination of reference devices. Reference devices used include the Stryd foot pod, Garmin Running Dynamics Pod, Garmin HRM chest strap with running dynamics capability, and, where available, a second chest strap source.

Each metric is compared across reference devices for the same activity. Agreement on cadence is expected to be high across all devices; agreement on vertical oscillation and ground contact time varies more significantly between manufacturers due to differences in sensor placement and calculation methodology. Where metrics differ materially between reference devices, this is noted, and the degree of confidence in any individual reading is adjusted accordingly.

Running power figures are treated with particular caution, given the absence of an agreed industry standard for calculating running power. Figures are compared across sources but are not treated as absolute values.


7. Blood Pressure

Blood pressure readings from the device under review are compared with those from a consumer-grade, FDA-approved biceps cuff. A minimum of five paired readings are taken at least one minute apart to form a single set. Multiple sets are taken at different times of day and under different conditions — at rest, after activity, and at varying ambient temperatures where practical. The aim is to assess cross-device agreement rather than to determine the wearer’s clinical blood pressure. No medical inference is drawn from the results.


8. Flashlight and LED Output

Where a device includes a built-in LED flashlight, output is assessed through direct, side-by-side visual comparison with other devices with the same feature. Brightness is described and ranked rather than measured instrumentally, as luminosity differences do not reproduce reliably in photographs or video. An attempt is made to determine the effective range of the torch at maximum output in a darkened environment. Results are recorded as qualitative comparisons rather than quantitative measurements.


The GPS Test Route from a Tourist Perspective

This is a beautiful route to run if you live in the area or visit the area, especially if you like river views and the odd historic building. The route starts and finishes at St. Mary’s University, which is highly respected for its sports-related studies & research, its running club, and the Sir Mo Farah Athletic Track. If, like Mo Farah, you’ve run in the famous ‘Cabbage Patch 10 miler’ or the beautiful Richmond Marathon, then this route follows much of the Cabbage Patch course, deviating to add points of difficulty. It also uses a part of the Kingston parkrun 5k course.

If you are coming to the UK from overseas and want a tourist run to keep you busy, then this is a pretty cool run in a pretty cool part of London. It has Hampton Court Palace (King Henry VIII), the site of Richmond Palace (where Queen Elizabeth I waited for the Armada to be defeated…or not), Ham House, Eel Pie Island (Rolling Stones), Twickenham Rugby Stadium (England Rugby), The Stoop (Harlequins Rugby), Petersham Nurseries/Meadow, Rowing Clubs, the first-ever canoe and hockey clubs — Royal Canoe Club & Teddington Hockey, Ham Polo Club, Ham Lands/Common, Teddington Lock, Open Water Swimming & Aquathlons at Thames Young Mariners (RG Active), paddleboat trips, Strawberry Hill House, Weirs, Marble Hill House and Richmond Park for Sunday morning cycling with the lycra masses (London Duathlon — world’s largest). These locations are all VERY close to this route. There are only a handful of legally protected views in the UK, and the Richmond Hill end of this route is on one of those protected views. The uber-famous running mecca of BUSHY PARK is very close to the shopping centre at Kingston (where England’s early kings were crowned), and the Old Deer Park & Richmond Park parkruns are close to the shops at Richmond. There are now parkrun tourists who stay at the Travelodge in Teddington or the somewhat posher Lensbury, which has a gym and indoor pool (guests & members only) — my course would be ideal for your Sunday morning long run after your Saturday parkrun in Bushy Park.