The Iterative Problems With Classification
(The results shown here have been sent to a journal and the link to that paper will be stuck here once published.)
“Machine learning” is a great set of tools for quickly building models without having to think too much about what a model is, what it looks like, or how it works. You don’t have to worry about things like physics, or complex equations. Just throw enough raw data at a problem and you’ll get a black box out that will tell you something. It might not be a particularly accurate or meaningful something but since science is an iterative process, something “okay’ish” now leads to better “okay” things later.
I’ve been playing around with building a classification algorithm for determining vessel class from AIS data. It takes a bunch of AIS messages that come from the same vessel and makes a guess on what AIS class the vessel should be reporting as. Cargo to Cargo, Tugs to Tugs. This is the kind of classification job that could be done by a human, if they’re trained on how ships move and have enough time to looks at hundreds of thousands of data points. An algorithm that can make a guess nearly as good as a human, but much-much faster, would be useful for applications such as measuring fishing effort, finding anomalous behaviour, or identifying vessels that are in distress.
There was once a tanker vessel that lost engine power near the coast of South Africa. They started drifting at some distance from the coast but didn’t want to radio in a distress call due to the cost of a rescue. They kept trying to fix the engine without notifying the South African Maritime Safety Organisation or their parent company. They finally notified SAMSA, and were rescued by an emergency tug, when they were within several km’s of a rocky coast that formed part of a nature reserve. Having an automated tool that could identify the odd behaviour of this vessel would have reduced the stress of a tug boat captain, and prevented a risky situation from forming.
There has been work done on this problem, but often only for small areas or scenarios. Researchers often pick a small area or filter out ports because the gamut of behaviour and expanding list of edge cases becomes untenable. It’s also more satisfying to solve a very specific problem than to “do okay” on a more general problem. In this case we’re going to take a look at:
- Looking at (billions) of AIS messages from a large (1/3 the globe) region
- Classify all AIS classes (or at least group them into super-groups and then classify them)
- Have a look at what this does to “unknown” vessel classes
LightGBM is a very nifty tool that has seen lots of success in Kaggle competetions and is available to run within Postgres via PostgresML. I’m busy working on updating the open-ais docker image to include PostgresML and to expose machine learning results via a standardised API…
The features used to train the algorithm are combination of Voyage Reports (static data of variable reliability), Position Report (GPS data), and trajectory data (GPS over time). Below is a quick look at some of the trajectory derived features… Pretty dry stuff.
Taking these input features, and some others, along with AIS class labels and sticking them into LightGBM resulted in some impressive results. There were some issues with over fitting (it’s interesting how a ship’s behaviour doesn’t change THAT much over time) that were corrected by doing a different train/test split methodology.
It looks like this algorithm is very good at classifying fishing, cargo and tanker vessels while being less good at passenger, port (tugs, dredgers etc) and recreational vessels. It turns out that those last few classes form a tiny part of the dataset though so being less accurate might be forgiveable:
But what’s that big “Not Available” class I hear you say…
AIS vessel classes could be better. There is a limited selection of available classes to pick from in the protocol and they don’t line up too well with real world behaviour or physical construction of a vessels. There is only one AIS class for fishing vessels and this covers small scale artisanal boats to large commercial ships, long liners to purse-seiners and all gear in between. Should buoys equipped with AIS transmitters that are used to mark net locations be “fishing vessels”?
Let’s look at the distribution of AIS data coloured by the self reported class:
The dataset looks pretty satisfying when plotted like this. Nice and semi-global. Lots of data and Datashader does a fantastic job of plotting billions of AIS points.
What would happen if we took the “Unknown” class and ran it through the trained classifier:
Seems like most of them are in the same locations as fishing vessels, have the same kind of spatial patterns, and are being classified as “Fishing” vessels. A quick look at a few samples leads me to believe that these are mostly net markers. A vessel with the name like “12.2V Battery” leads me to make some assumptions about their nature. There are a couple of interesting takeaways: vessels in shipping lanes are getting classified as cargo or tanker, vessels in fishing zones are getting mostly classified as fishing vessels, there are a few weird happenings like a group of tanker vessels hanging around in the southern ocean fishing zones.
Either way, vessel classes need an improvement. A possible solution would be an authoritative open dataset of vessel classes linked to something like an ontology where vocab that describes the different physical and behavioural classes is described. That would be great!