Geopath Streetlytics Simplified Process
Streetlytics Process
Streetlytics is calculated using a five-step iterative approach. The first three steps use the data sources to three types of movement: Reference, Sampled, and Modeled.
Reference Movement: The Reference Movement is comprised of the routable transportation system, hourly speeds, and traffic counts. These data go through a rigorous review process prior to use within the Reference Movement. These data are as close as possible to measured traffic flows and congestion. This dataset is important in providing targets in the Streetlytics process.
Sampled Movement: The Sampled Movement for the Streetlytics process comes from AirSage. This data is aggregated from GPS location data to capture movement between and within block groups during the analysis time period. The observed home locations (at the block group level) of transportation system users are also provided by AirSage and used to determine the demographic mix of people moving through the transportation system. Streetlytics segments pedestrian movement using a method based on the distance of the trip; it then sums these trips up across the area being analyzed and compare pedestrian modal share at the regional level and adjust in an iterative process.
Modeled Movement: The Modeled Movement is created using a behavioral travel model which combines the routable transportation system, points of interest, demographic data, business data, and government surveys to predict every vehicular and pedestrian trips’ path nationwide.
Optimized Total Movement. The optimization applies varying weights to the millions of Referenced, Sampled, and Modeled movements based on their characteristics and quality of the underlying data to combine the independent views into an optimized understanding of total population movement nationwide. This process produces a robust and accurate understanding of the total moving population. Telling us where people are coming from and going to, what they pass by, when they travel, where they live and work, and what mode they are using.
For the time period under analysis, the Optimized movement represents the total flow of trips within the analysis area at a block group level. This data is produced as a combination of three data structures: origin-destination matrices that represent the number of trips by purpose (i.e. home-based work, home-based school, home-based other, and non-home-based) between block groups; path files that store the routes these trips take within the transportation system; and roadway segment information including hourly traffic volumes (stratified by trip purpose and a variety of demographic attributes of roadway users) and operating speeds.
Hourly traffic speeds on each roadway segment in the Streetlytics products come directly from HERE Their probe source fleet is a mixture of connected vehicles, fleet telematics sources, in-vehicle navigation systems, and mobile apps. HERE uses a variety of techniques to validate and normalize probe data, casting out clearly invalid samples, but Traffic Analytics Speed Data passes through the sample variance to allow customized handling of sampling distribution. The hourly speed product represents the average travel speed on each roadway segment during the observation period.
Pedestrian Process
Pedestrian travel behavior is fundamentally different from vehicles due to short trip length and highly variable travel patterns. Due to this fundamental difference an alternative method is used to develop the Streetlytics pedestrian product; however, this method leverages a majority of the same data sources.
To develop the Streetlytics pedestrian model, the pedestrian counts were divided into a training set, representing 80 percent of the counts, and a test set representing 20 percent of the counts. Many sensitivity tests were used to identify the best algorithm for the model, select model training variables, and refine the model calibration. To evaluate the results of each test, the model was estimated with the training set and applied to the count locations included in the test set. The test set was validated with a visual scatter plot, statistics such as R squared, % RMSE, and with a visual geospatial review.
During these sensitivity tests, the training set and test set were randomly changed to control for any bias included in training set definition. These tests informed both the selection of the AdaBoost Regression ML method and the list of pedestrian variables provided above. The tests were also used to identify and review outliers in the pedestrian source datasets.
The following figure shows the model estimate versus the observed count for the 20 percent of locations not used in training the model. This analysis provides an understanding of how well the model performs on locations where a count is not available.
Validation of Test Set (20% of Counts)
All counts greater than 100,000 pedestrians per day were included in the product directly and therefore were not included in this figure.
The overall R2 and % RMSE values for this model are 0.80 and 160 respectively.
Home Location Estimation
The Home Locations Process consists in the step of Streetlytics which allocates a home location distribution on the block group level to all segments of the country. This process is done for both vehicle and pedestrians and each of them have a different methodology. AirSage home locations data is used in this process.
AirSage sources its data from various data partners. These sources include carrier data, which has been available for AirSage to source for many years, and data from Smartphone Apps SDKs providing GPS data. This data requires dedicated cleansing before it may be used for the AirSage products. AirSage systematically evaluated more than 50 potential providers in the U.S. market before choosing the best of the best to enter their data panel. AirSage uses metrics like user persistency, temporal distribution, and geo-temporal distribution to identify patterns of movement that are consistent with a mobile person. AirSage screens out devices with abnormal number of sightings and locations with abnormal number of devices seen there. AirSage further black/white lists Apps if they prove to have a significant number of non-genuine readings and uses data from devices who provide a continuous trackable path throughout the day. After all the screenings, AirSage still ingests more than 120 billion records per month, which are then the basis of a monthly nationwide trip matrix for Geopath.
AirSage uses its proprietary algorithm to analyze GPS sightings generated by each sample device across the nation on a monthly basis. Using pattern recognition techniques, each device gets assigned a robust home and work location which is further tagged to census block group (this data has been proven to be the most accurate in the market). The technique also checks for a home location correctness by comparing the data with census-based population. The device sample and census population count are subsequently used by AirSage to compute device weights. The device weights are then used to represent the mobility pattern of the population. Trips are further weighted based on the device penetration weight to represent population movement.
AirSage applies detailed quality assurance (QA) measures to check for reasonableness of trips at an aggregate level. The QA checks listed below represent a subset of the checks performed by AirSage.
Check for consistency in trip rates at a county and state geography
Check for distribution by trip purpose
Check for travel time distribution by trip purpose at a county geography
Check for travel distance distribution by trip purpose at a county geography
Check for trip distribution by time of day at a county geography
Check for trip magnitudes by day of week
In the Streetlytics process for vehicles, home locations are calculated by joining the AirSage data with a path file generated by the trip model. Vehicle home locations are calculated for every link in the country where vehicles are allowed. Pedestrians home locations are based directly on the AirSage data. All trip modes from that data source are considered under the assumption that every trip start and end with a pedestrian leg to it.
A few assumptions are used during this process. Listed below:
Because Home Locations is very computationally intensive, the roadway network used in the process is a simplification of the actual network. A new network is used, defined as “super links” which assumes that links next to each other have the same home location distribution.
For segments where there is insufficient AirSage or path file data, the home location distribution is based on the block groups that are inside a pre-defined buffer of the corresponding super link. The radius of the buffer will vary depending on the type of link. For example, a highway link will have a bigger buffer than a residential street. This assumption only applies to vehicle home locations.
Pedestrian home locations are only provided for segments where pedestrian volumes are higher than zero.