Finding Multi-Stop Flights Using Neo4j & Python 🐍 Part #1
As you’d probably figured out by now, traveling is a big passion of Markus and mine. However, as students, our travel budget has some serious constraints.
That’s why I started to learn about some great and fun ways to save big bucks while still seeing the world. I’d learned to master the Miles and Points game, Fuel Dumping and Hidden City Ticketing.
For a while, I’d been doing my research manually with tools like ITA Matrix or a standard web browser on sites like Momondo or Kayak. Which is frankly speaking quite tedious and time-consuming. It’s okay if you have an itinerary with only a few stops, but not very convenient if it’s anything bigger or even an around the world itinerary.
My idea
I had the idea of creating a web app that allows you to enter a few destinations and your timeframe. The application should combine those airports in a very cost efficient manner and yield a list of possible flights. I wanted the web app to be able to also include hidden city tickets and multi-stop itineraries with mixed tickets (like mixing a RyanAir with a Delta Ticket). Another thing I wanted to try out for almost forever is Neo4j; a Graph Database. The hype about Base-Databases also left its marks on me. So I was desperately looking for a way where I could use a graph database… And the rest is history.
My planned Technology Stack
I haven’t really settled on my whole technology stack. As mentioned earlier, I’m currently using Neo4j as Database. The reason why I’d settled on a graph database in comparison to a relational database is that’s far easier to check for the shortest path and also much faster. For populating the database I am currently using a python script. I haven’t really settled on the frontend nor the middleware (like a tiny API)…
Airline Routes
‘Airplane, Airplane sorry I am late…🎶’. OK, I gonna stop now. To get a better understanding of the route network - offered by airlines from around the world- I started off with visualizing the airline routes in Tableau. I got the data from Openflights.org, a platform which lets you track flights you’d taken and has some neat statistics like your northern most airport or your longest flight. Openflights.org publishes a list of Routes as open data which can easily be imported into Tableau. Below you can see a complete visualization of the routes on Openflights. Pretty impressive, Ayh?
Looking for something leaner? Here is a route network of Alaska Airlines (AS) and Virgin America (VX).
And here’s one of Lufthansa’s pretty impressive network.
The Lufthansa Network brings me to a point worth mentioning. The data from openflights.org is sometimes not the best. Some flights have been entered incorrectly. Just look at those random flights in South America and Africa. As far as I am aware Lufthansa doesn’t have that man 5th Freedom flights.
In total it’s a bit more than 36000 point-to-point connections if you don’t take the operating airline or code-share flights into account, just the fact that a connection exists between two airports.
You may wonder why I need those flights at all? My plan was to do an initial search of all possible combinations (the ones from Openflights) and add those to my neo4j database. I’ll later consult the neo4j database for a first best guess and query the exact prices in real time through an API.
A first glimpse on the schema of the Neo4j database
I have to admit when I started with this project I had absolutely no clue about how this DB, the query language, and concept works. After playing around for a while I soon felt more comfortable.
Coming up with a bulletproof database schematic is just as important as with a traditional relational database. In my first attempt, I entered all the flights as constraints and added the date, price as properties. While I’d worked for a few flights it soon got a mess. It’s also against the codex of neo4j to have hundreds of relationships between two nodes.
That’s why I went back to the scratchboard and did some further research. I actually found a pretty cool site by Philippe Khin, in his posts he’d described the pros and cons of various models. I decided to pick up the idea of having separate nodes for months, days, airports and flights. That way I can fairly easily find out if there is a flight from a specific date and see where it’s going. Also, the runtime of the query is pretty impressive. Below you can see what it looks like, once some data has been added to the database.
As far the properties go I decide to take a slightly different approach than Phillipe.
My Airports have the following properties:
{
“name”: “Atmautluak Airport”,
“country”: “US”,
“code”: “ATT”,
“lng”: “-162.272995”,
“lat”: “60.866699”
}
While the flight nodes are fairly simple
{
“month”: 6,
“day”: 1,
“flight_number”: “F8310”
}
The actually interesting properties are assigned to the TO-relationships
{
“duration”: 2460,
“arrival_time”: 1527887460,
“distance”: 287.54,
“month”: 6,
“price”: 37,
“day”: 1,
“departure_time”: 1527885000
}
It contains the arrival and departure time formatted as Unix timestamp and most importantly the price in EUR. I’ve also added the distance as a property, that way I can fairly easy search for long-haul flights which have a good price per distance ratio.
The query I’d executed above gives back all possible combinations between two specified airports Whitehorse (YXY)
and Ningbo (NGB)
on any day in June. As you can see, you could connect in Vancouver (YVR)
and Shenyang (SHE)
and fly directly from there to NGB. Or you could take a little detour through Chengdu (CTU)
and Hainan (HAK)
. The red circles represent the specific flight numbers. This is just a visualization of one specific trip, though there many more possible combinations.
That’s it for this post. In the next post, I’ll be talking about which flight search APIs are freely available and how I imported the data into neo4j using a python script.
Cheers Alex