Monthly Archives: October 2012

Making GTFS query more convenient

I have been spending a lot of time parsing the GTFS database. On the surface it is just a simple CSV files. But to extract useful information from GTFS is often unexpected difficult. For example, find the stops from a bus line in sequential order might sounds like basic thing to do. But it is actually non-trivial with GTFS.

One reason is transit service is more complex it seems. It might seems a bus service just hit all the stops in sequence. But the actual service has a lot of variables. The schedule is often different in weekend compare to weekdays. And so does the exact route that it covers. Sometimes a bus is scheduled to run a short route rather than covering the whole length. In more complex case there can be branching where there is a common main trunk and then the buses split to serve two or more alternative destination.

This is the reason why in GTFS one “route” may associate with multiple “shapes”. To find out what shapes are associate with a route, we will have to make a query like this

SELECT
 shape_id
FROM route
 JOIN trips
 JOIN shape
GROUP BY shape_id;

To find out the stops is even more complex. Here we need to join one more table the stop_times. It is also the biggest tables in the GTFS. So this is also the most computation intensive query to do.

SELECT
 shape_id, stop_id
FROM route
 JOIN trips
 JOIN stop_times
 JOIN stops
GROUP BY shape_id, stop_id;

Still most people have a clear concept of what a transit line is where it runs. It shouldn’t be such a pain to compute. A more useful structure should look like below.

    GTFS             More Useful
  Structure           Structure

    route              line
     |                   |
     |                   V
     |                 route*
     |                   | \
     |    shape          |  +-> route_shape
     |     ^             |  |
     |    /              |  +-> route_stops*
     |   /               |
     V  /                V
    trips              trips
     |                   |
     |        stops      |          stops
     |        ^          |
     |       /           |
     V      /            V
    stop_times         stop_times

Here a shift the terminology a bit. The top level entity is a line (i.e. GTFS’ route). This is service that people know of, like a numbered bus line or a metro line. Below that is routes. These are the collection of alternative routes a line may run. The routes are not explicitly represented in GTFS. You can find that by querying all unique shape_id using the first SQL. Another missing piece is the stops. If we can pre-compute all the route_stops using the second SQL once, for the most part we don’t need the giant stop_times table. For applications that do not deal with scheduled time, this is a huge saver. The is one assumption my structure makes though. It is that different lines do not shape that same route. If should be a reasonable assumption. And if there is indeed share route and shape, we should just replicated them as two separate entities.

The original GTFS structure seems to have a transit operator centric view. It allows them maximum flexibility to author and publish their service data. But for application developers, it is not structured for easy traversal. By adding the route and route_stops tables as indicated, it will greatly facilitate the query and operation of transit information.

Leave a comment

Filed under Uncategorized

Expected Wait Time and Variance in Headway

Consider that there are 6 buses in an hour. The time between the arrival of the next bus is called headway. If there are evenly spaced out, the headway will be 10,10,10,10,10,10 minutes. But in a city with unpredictable traffic, the arrival time is often disturbed. What will be the perceived performance in those situation? For example, if the only second bus is delayed by 5 minutes, the headway will be 10,15,5,10,10,10. Or consider a more extreme case where the last three bus bunches together so that after a long gap, they all arrive together within a minutes. That gives 10,10,10,28,1,1. If we just takes the arithmetic mean, it will be 10 in all cases. This number of course does not describe that disturbance it causes. Instead we will calculate the expected wait time over the entire period of time.

Define W(t) as the wait time when a person arrive the station at time t. The W(t) chart looks like a sawtooth. It is highest when the passenger arrive just after a bus has left, and it will be 0 when the passenger arrive . Here t_{i} is when bus i arrives., h_{i} is the headway to the next bus and T is the sum of t_{i} .

The expected wait time is given by this formula

\int_{0}^{T} W(t)P(t)dt

P(t) is the probability density function of people arrive at time t. We assume people come uniformly. So P(t) is simply \frac{1}{T} .

\frac{1}{T} \sum_{i=1}^{n} \int_{t_{i-1}}^{t_{i}} W(t)dt

= \frac{1}{T} \sum_{i=1}^{n} \frac{ {h_{i}}^{2}}{2}

= \frac{ \sum h_{i}^{2}} {2T}

In general communication, it is often more natural to talk about frequency than expected wait time. For example we often say the bus arrive every 10 minutes. But it requires a little bit of math to translate it to 5 minutes of expected wait time. Since the expected wait time is just half of the headway, we can multiple the formula above by to get a effective frequency. We can implement a simple function to calculate the effective frequency below.

>>> def effective_frequency(times):
...         h2 = sum(t*t for t in times)
...         T = sum(times)
...         return float(h2)/T
...
>>> effective_frequency([10,10,10,10,10,10])
10.0
>>> effective_frequency([10,15,5,10,10,10])
10.833333333333334
>>> effective_frequency([10,10,10,28,1,1])
18.1

As expected the last example give a much worst effective frequency 18.1 minutes than the baseline of 10 minutes.

Relation with Variance

It is interesting to study the relationship between expected wait time and the variance of the headway. This is given by the formula.

\sigma^2 = \frac {\sum h_i^2 - (\sum h_i)^2/N}{N}

\frac { \sigma^2 N } {T} = \frac {\sum h_i^2}{T} - \frac {T^2/N}{T}

\frac { \sigma^2 N } {T} + \frac {T}{N} = \frac {\sum h_i^2}{T}

\frac { \sigma^2 }{ \mu } + \mu = \frac {\sum h_i^2}{T} where \mu = \frac{T}{N}

or \frac {\sigma^2}{\mu} + \mu = 2 × Expected Wait Time = effective frequency

There is a very nice interpretation of effective frequency. First of all it is smallest when \sigma^2 is 0. That is when the headway are evenly distributed, it will achieve the optimal frequency μ. Secondly effective frequency is larger than μ by \frac {\sigma^2}{\mu} . So the variance can be interpret as the penalty above the baseline frequency. In order to minimize wait time, it is necessary to minimize variance.

One implication is how to mitigate missed run. If it is necessary to skip a bus run due to equipment break down or availability of drivers, it is better to redistribute the remaining run evenly than to simply miss a run.

>>> effective_frequency([10,20,10,10,10])
13.333333333333334
>>> effective_frequency([12,12,12,12,12])
12.0

Using the formula above, dropping a run with headway 10,20,10,10,10 gives an effective frequency of 13.3. This is worst than if we can redistribute the remain 5 buses to get a frequency of 12.

Leave a comment

Filed under Uncategorized