2 Finding Data in the Real World

You’re probably a data scientist. A data analyst? A data-type person… so you’ll hopefully get me when I say that the foundation of data projects is, unsurprisingly, the data! Okay. In the context of flight data, there’s a whole bunch of sources and, uh, well, they’re not all super conducive to open-source.

I like the analogy of boxes.¹ Flight data, like a bunch of types of data, are served in nice boxes whose modern development stand-in is something like an API. An API is a nice way of letting you sift through someone’s data and pick out of the box what you like.

These flight data boxes are generally super nice. But super nice things are generally not cheap. So, in my case, it took some searching out in the world to identify some open-source boxes that are maintained by some very dedicated and praiseworthy people:

The OpenSky Network for open air traffic data.
ADSBDB for a bunch of aircraft registration info.
adsb.lol for aircraft route info.
The United Fleet Website, maintained by enthusiasts who just like to track United’s mainline fleet.

It’s worth noting here that people who put free boxes out into the world with high quality data often have some stipulations about how you use the stuff in their box. For example, it’s not very cool of you to take stuff from their box and then decide to sell the items around the corner for $5.99. That’s to say that you should have a look at the terms and conditions for publicly available APIs; for example, OpenSky’s contributor data is for research purposes and probably not suitable for commercial use.

A lot of that searching was going down a series of Google rabbit holes. Google’s implementation of LLMs into its search also sped up this process significantly, as I could ask Google in plain English for some examples of open-source APIs that provide flight data.

Essentially, the data hookups for this project are as follows:

Query the Google Sheet maintained by the United Fleet Website to find out the registrations for active aircraft in United’s mainline fleet.
Run the registrations through an endpoint provided by ADSBDB that will convert registrations into icao24 (mode-S) information for use with OpenSky.
Pull a random sample from point 2 (why? See Section 2.1.1).
Grab the tracks of the sampled aircraft, and fail gracefully when that aircraft is not actually in the air.
Do point 4 but for state vector information (i.e., information about the plane’s current position, its velocity, altitude, so forth).
Get the route information based on the callsign contained in the track/state vector information.

2.1 Boxes and Their Pros and Cons

There may be a bunch of alluring boxes for your project. If you’re like me, you’ll continue along your project and notice that there’s something not-so-perfect about some of the boxes you’re pulling from. Documenting what boxes you pull from is important (point 1 from the preface), but why you pull from them is just as important; this means analyzing and writing down why you chose certain data sources to pull from.²

2.1.1 Trade-Off 1: API Credits and Claw Machines

There is a reason why FlightRadar24, FlightAware, Flighty, etc. charge for flight tracking services, and that free alternatives that you or I might use to check on a given flight are backed by big players like Google or Apple: reliable flight data at the scale that regular consumers demand it is not cheap. It’s generally backed by data brokers like OAG and Cirium, whose purpose (at least partially) is to maintain a proverbial box of good information.

The OpenSky network is an open-source alternative backed by contributors of aircraft (ADS-B, Mode-S, FLARM, VHF, all the alphabet soup) data. While it won’t provide all the data that the paid alternatives do (route data specifically, I’ll get to the challenges there in a moment), it provides a lot of value to the project, specifically the mapping component.

Now, one trade-off you’ll get with some boxes is being charged a toll to look inside, that is, the usage of an API being capped by a number of credits. One reason to cap the number of times an API can be requested is because of, in economic philosophy/political philosophy/whatever knowledge field wants to claim credit here speak, the tragedy of the commons (among other reasons).³ Basically, whenever everyone has access to a shared resource, they end up overusing the resource. If everyone has access to a flight data API, too many people will try to query the API for every United Airlines flight in the world every 15 minutes.⁴ Hence the idea of limiting how many times someone can rummage around inside your box in a given period of time.

For OpenSky, the credit limit is 400 credits per day for unauthorized users, 4,000 a day for authorized users - the account is free! - and 8,000 a day for accounts that contribute data to the network.

After all that exposition comes the challenge and the trade-off: when you query OpenSky, OpenSky doesn’t care if you queried an aircraft that’s currently on the ground. You’ll spend 4 credits regardless of whether you receive a track/state vector or not.⁵ It’s kind of like playing the claw machine game - you put in a quarter, you might get something, you might not. But it counts against your budget.

Thus, we’re at one of our major decision points with several considerations before us:

We can’t query the entire fleet due to the credit limit (we might be able to manage one pull a day, but the next bullet point exhibits why this is problematic).
Planes are constantly on the move, so there’s not much utility in displaying information that’s quickly outdated.
How do we actually get information from planes that are in the air?

After some thinking, the solution I came up with was the following: take a random sample of United’s active mainline fleet (say, 10 aircraft as of the time of writing this) at a given time interval (say, 30 minutes), and query OpenSky for that random sample.

This approach has the following pros:

Provides a different snapshot of the fleet throughout the day, giving users something new to look at every time they visit.
Offers relatively up-to-date information about the aircraft selected, within the time interval (i.e., 30 minutes).
Depending on the sample size, increases the likelihood that aircraft actually in the air are queried.
Flights and time intervals can be optimized up to the rate limit (4,000 credits per day / 4 credits per pull = 1,000 aircraft per day / 24 hours in a day = 41 aircraft per hour).

With the following cons:

There’s still the possibility that no aircraft in the air will be pulled.
Doesn’t track the same aircraft through time.

To me, this seems like a valid approach and generally satisfied my goals for the project. Will they necessarily satisfy your goals? Maybe not. They probably won’t satisfy the goals of every user base. The purpose of this section, though, is to illustrate working through the assumptions mentally as well as documenting them. Why? At posit::conf(2025), there was a lot of talk about work being reproducible. We can reproduce code. It’s often harder to reproduce mental models. But if we write them down, it makes it easier to go back and rework our project if we find that demand for these priorities (i.e., coverage vs. refresh rate) shifts over time.

2.1.2 Trade-Off 2: Route Information

Sometimes you only recognize obstacles when you’re in the middle of development. That’s pretty much what happened to me when I was querying route data using the ADSBDB API. The ADSBDB API pulls route information from a database maintained by volunteer contributors. The problem is that operations staff at United Airlines change their minds quicker than volunteer contributors can keep up, so said route information often falls out of date. That’s fine, but it would be nice to mitigate this somehow. So I went back into the world and searched for an API with more information.

I ended up finding adsb.lol’s API which provides similar route information, this time from a GitHub repo with (ostensibly) more regular updates than ADSBDB. One key element that this API included were two columns: origin_plausible and destination_plausible, two binary columns indicating whether a given origin and destination airport made sense in the context of a given aircraft’s activity. Despite this, the data still isn’t perfect.

As such, the trade-off is whether to omit data entirely due to lack of reliability. Sometimes route information is correct, sometimes it isn’t. Sometimes implausible route information is correct, and sometimes plausible route information is incorrect, making the errant information hard to isolate.

Luckily, adsb.lol provided somewhat of an off-ramp to this concern: given the plausible columns, I can serve different versions of a table based on whether the route information is flagged as plausible or not. Given that even plausible information can still be incorrect, I’ll provide a disclaimer for that information as well. The technical implementation of this is demonstrated in Section 5.1.

My father, who is really good at analogies (and I like to hope that I inherited the gift a little), came up with this one, so let this citation serve as proof that I’m not plagiarizing.↩︎
I’m speaking to myself here more than anything. I acknowledge the habit of thinking “I’ll document it later” - no you won’t. You’ll find a new package to learn about or some other way to procrastinate. Write the docs. Your future self or someone else will thank you. By the way, I fully acknowledge the limited power this footnote is going to have on changing behavior. One of those things about being easier said than done.↩︎
In computer science speak, I believe the analogous law is Cheng’s Law of Why We Can’t Have Nice Things.↩︎
I can’t imagine anyone who would do such a thing.↩︎
To be fair, you do receive information that said plane is not in the air, so it technically is a data point.↩︎