Analysis of Taxi Data for Understanding Urban Dynamics
Authors
Abstract
The growth of urban areas poses both challenges and opportunities. Challenges due to the increase in demand for resources and services needed. However, it also allows the opportunity for the development of new services and, collectively, urban areas can produce data to help better understand urban mobility.The taxi can be perceived as a probe for traffic conditions. Additionally, its flexibility and ubiquity can be used to retrieve large data sets of information, essential for studying urban mobility. In this study we explore a data set of taxi-GPS traces, collected in Lisbon, Portugal, to understand to what extent can taxi data represent urban mobility. More specifically, in this study we aimed to answer three research questions: (A) Is it possible to develop a model to estimate the taxi demand throughout the city? (B) Are urban data sources correlated among them? More specifically, is taxi activity correlated with mobile phone activity, two of the major urban data sources? (C) Can taxi data be used as a probe to infer the concentrations of exhaust gases in urban areas? To aid the analysis, additional data sets were collected for the same spatiotemporal period, regarding mobile phone activity, information on atmospheric pollutants and meteorological conditions.
In order to develop a model to estimate taxi demand, an exploratory analysis was performed. The study was able to visualize the spatiotemporal variation, identifying the main pick-up and drop-off locations and busy hours, and observe that trip distance and duration follow Gamma and Exponential distributions. The study was also able to identify the link between pick-up and drop-off locations, observing strong links between public transportation hubs. Additionally, an analysis of taxi driver behavior during downtime was performed. The analysis of taxi-GPS from top drivers have shown specific strategies used to maximize their profit. Either by waiting for passengers in locations related with main public transportation hubs, during specific hours of the day, or by avoiding traveling great distances to the next pick-up location. The inference analysis explored the possibility of estimating the next pick-up area given the current location (last drop-off), day of the week, hour, weather conditions and area type (characterized by points of interest). The inference engine is based on a naïve Bayesian classifier, achieving 56.3% of accuracy of the training sample. Current location turned out to be the main contributor to the algorithm, contrary to weather conditions which is the variable with the least weight in the calculation.
The investigation of the relationship between taxi and mobile phone activity started by performing an exploratory analysis of the mobile phone call intensity. The study showed a fairly regular pattern, consistent throughout the day and during the entire time series. During data analysis, a significant correlation between the taxi volume and mobile phone call intensity was found, with a coefficient of determination of 0.8047. The strongest correlation was achieved over active hours of the day (8 AM-10 PM) and active days of the week (weekdays), in areas with medium and high taxi activity. Moreover, mobile phone call intensity had a significant correlation with taxi volume of the previous two hours. Furthermore, we found that this inter-predictability could be modeled with a linear function and varied across different times of the day.
To model and estimate the concentration of exhaust gases, taxi activity and meteorological conditions (temperature, wind, humidity, and weather conditions) were considered. The study revealed the daily and seasonal patterns of exhaust gases, how they are correlated with the weather conditions, and how nitrogen dioxide - a marker for atmospheric pollution - is strongly correlated with other exhaust gases. Using a multilayer perceptron, with 15 hidden layers and a sigmoid activation function, we were able to estimate the nitrogen dioxide concentrations, with a coefficient of correlation of 0.7869, showing a relationship between the exhaust gas concentration and other urban variables, especially on traffic stations. The multicollinearity analysis was applied to ensure non-correlated predictor variables and avoid overfitting of the model.
This study contributes to a better comprehension of the complex interactions between the diversity of urban data sources. Our findings, to some extent, unveil the relationships between different urban data sources, especially the role of taxi service as a predictor variable for other urban variables.