How the Twitraffic algorithm works
Twitraffic streams data from Twitter to return each tweet containing the name of a UK motorway or popular A-road. This returns a large set of data for our algorithm to analyse and process.
A sample of what a steam of "M5" tweets may return
We can see most of the above have nothing to do with traffic. The algorithm needs to learn what is junk and more importantly what key characteristics form a reliable traffic tweet. This is done by:
- Comparing each word within a tweet to a set of approved keywords. These include terms such as "roadworks", "fire", "incident", "northbound", "junction", "delays" etc. Each word has a specific weight to score its validity. For a tweet to be considered genuine it needs to pass a pre-defined score.
- Learning and identifying patterns within the tweets considered to be junk. There tends to be specific trends for each search term e.g. "M8" often referring to friends (mates).
- Disregarding calculated spammers or junk users entirely.
- Identifying the location the tweet was posted from (UK) and the language the tweet was written in (English).
- Using sentiment analysis to understanding whether the tweet was positive or negative. The algorithm treats positive tweets about traffic with suspicion.
We should now only be left with tweets relating to road traffic. These are then added to our database to be processed, sorted and organised ready for the smartphone app to take data it needs. The API then communicates this data with your phone.
The algorithm is straightforward, although there are some more complex components to ensure validity of the tweets that are stored. All this requires a great deal of computational power and as this runs 24/7 I'm constantly trying to strip it down to make it simpler and more efficient.
Add to this the various components to manage the service reliability and features within the API and you can see Twitraffic becomes a complex tool to manage.
The full component diagram for Twitraffic