How to crunch 93,000,000 numbers

How to crunch 93,000,000 numbers

Glen McGregor is a tech-savvy journalist who, more than most Ottawa reporters, has embraced open data as a reporting platform. Today, the Ottawa Citizen published McGregor's analysis of a full year of OC Transpo's GPS data.

(If you're unclear about what exactly we're talking about when we refer to OC Transpo's GPS data, check out our report on it from late last year.)

That GPS data reveals how long it took for every bus on every route to reach its destination. McGregor's analysis found that 47 percent for all buses between April 2009 and March 2010 showed up within five minutes of their expected arrival times. He also found that 41 percent of buses arrived too early.

McGregor took some time today to tell OpenFile what it took to acquire and manage such an immense set of data. He referred to some fairly technical concepts, so novices beware.

OpenFile: Did you get the OC Transpo GPS data before it was released publicly through the open data pilot project?

McGregor: Yes, I requested the data in the spring of 2010 through the Municipal Freedom of Information and Protection of Privacy Act. I had to go in to meet with the city's data guy, because they were baffled by my request. The city showed me the structure of the database that captures the transmissions from the buses via cellular network. Once we figured out what was there, it took a few weeks for them to output a year's worth of data.

How long did it take to get the data, and what were obstacles in collecting it?

The big problem with the data was the sheer volume of it. The city gave me the data on five DVD-roms. Every bus uploads its data to the server every two minutes, so over the full year there were 93 million records. This is the largest dataset I've ever worked with. I used [open-source database] MySQL to crunch the numbers. It can handle huge datasets where Access and Excel would collapse.

But the size of the data made it difficult to work with, even in MySQL. Queries that would normally take a split second would run for three, four or even ten minutes, depending on their complexity. I had to abandon using the GUI [graphical user interface, the visual program a user typically works through] and run all the queries from the much more stable UNIX command line.

The big problem analyzing it was accounting for a hugely distorted number of buses that were recorded as arriving exactly on-time. There were about 100 times as many buses that registered an adherence figure of 0 seconds than there should have been, based on the distribution of other adherences (I put them all on a curve). It turns out the GPS units start returning zeroes about 10 minutes before a bus goes into service. So I had to interpolate the correct number of zeroes.

Was the OC Transpo GPS Application Programming Interface (or API, the interface that allows programs to share information) usable?

I didn't use the API. It wouldn't have been suitable for analyzing data across the entire route, all day, for a full year. It would have taken way too many API requests. I'm hopeful that the city will proceed with the API [that was briefly released to the public in December], though. Local developers have done some great work creating apps to use the API and I want one for my iPhone.

Looking at so much data, I have both sympathy and respect for Vincent Patterson, OC Transpo's reliability guy. The complexity of keeping so many buses on schedule, against such wildly varying potential delays, is astounding. I think it's to OC Transpo's credit they are trying to measure performance using GPS, even though the numbers may not be flattering.

Blog photo by mellyjean via Flickr.

Open Blog