Forecasting bus ridership with trip planner usage data: a machine learning application
Currently, public transport gives much attention to environmental impact, costs and traveler satisfaction. Good short-term demand forecasting models can help improve these performance indicators. It can help prevent denied boarding and overcrowding in busses by detecting insufficient capacity beforehand. It could be used to operate more economically by decreasing the frequency or the size of the bus if there is overcapacity. Moreover, it could help operators plan their busses during incidental occasions like big public events where little information is known. Finally, it could be used to reliably inform the travelers on the current crowdedness.
This study investigates the usefulness of a new data source; the usage data of a trip planner. In the Netherlands there are multiple trip planners available for users to help find the most optimal (multimodal) journeys. These trip planners require a date, a time and an origin and destination, which they use to construct multiple alternative journeys from which the user can choose. For this study the data of 9292 was used, being the major trip planner in the Netherlands including all public transport modes.
We developed a model for forecasting the number of people boarding and a model for forecasting the number of people alighting at a certain stop. These forecasts are defined at the vehicle-stop level. By summing the number of people boarding and subtracting the number of people alighting along the trip the forecasted number of passengers after a stop is calculated.
We compare five different machine learning models: multiple linear regression, decision tree, random forests, neural networks and support vector regression with a radial basis kernel. We compare these models with two simple rules: 1 predict the same number as last week, and 2 predict the historic average as number. The models are implemented in the Scikit-Learn library of Python. The data is stored in a PostgresSQL database.
The trip planner datasets and smart card dataset are merged and preprocessed. The resulted dataset is rather sparse; a lot of stops have zero passengers boarding or alighting or requests suggesting to do so. Therefore we investigated if subsampling is needed. From the datasets useful data is selected and features are constructed. The features are standardized. Different number of features are tested, these features are selected based on recursive elimination using a simple random forests model. Finally, the hyperparameters of the models are tuned and the optimal configurations are stored. The scores are validated by using cross validation.
Find more details in the following contributions by Jop van Roosmalen: Transit Data workshop presentation and MSc thesis
Comments are closed.