What Is Geoclustering? How to Handle 200,000 Points on the Map
While making an application for truck drivers, we faced an engineering challenge plotting location data on a map in a way that overwhelming volumes wouldn’t crash users’ phones. We had more than 200,000 geopoints and needed them to be grouped into certain categories so that a driver would operate within these categories instead of seeing all the points at once.
We solved this issue by applying a particular clustering technique. Let’s now get a grasp on what geoclustering is and which methods of grouping location data are out there.
Crafting a Cluster Management Tool for an App
A geoclustering tool, along with a powerful map are the fundamentals of any location-based or navigation application. When you use an application which deals with lots of data on a daily basis, you need this data to be loud and simple, divided into visually acceptable parts. And this is what clustering serves to do; structuring data and making it understandable.
If you want to develop an app with some mapping functions, you should know how many geopoints will be included. Our guess is that there will be a lot. Now, with each geopoint being shown as a separate marker, imagine a map filled with markers overlapping each other. Messy, right? To handle thousands of points and make your service effective but not disturbing is quite a piece of work.
As we’ve mentioned, we needed to handle 200,000 points in the Atlas project. The app’s end users were truck drivers who needed an easy-to-use tool for building the delivery route. So, the way geopoints were displayed and managed on the map was essential.
Clustering tools have been out there for years. The complication is which one to choose. Apart from different algorithms, there are different programming languages such a tool can be developed with, and two different approaches: server-side and client-side clustering.
There are several techniques of data clustering which can be easily applied to geopoints:
- K-means: every geopoint is attached to a group with the closest centre, and centres are defined as the means of all vectors in the group
- Mean-shift clustering: centres are the means of points situated within a certain dense area
- DBSCAN: outliers aren’t classified into clusters but identified as noises
- Gaussian Mixture Models: geopoints are attributed to a cluster if they are close to the Gaussian centre
- Hierarchical clustering: while each geopoint is taken as a cluster, the average distance between them is calculated, then points with the smallest average linkage are combined
At MadAppGang we utilised hierarchical greedy clustering based on KD-tree geospatial indexes.
Why We Needed a New Geoclustering Solution
There are lots of clustering tools likeMarkerClustererfor Google Maps API, a Leaflet’s pluginMarker Cluster, Mapbox’s Supercluster, orAnycluster. But we needed another, much more optimizable solution. Instead of searching for the best cluster management software, you can always write your own code for clustering according to your project’s goals, and this is exactly what we did.
Why We Chose Server-Side Clustering
First of all, it’s best doing clustering at the server’s side. There are several easy to implement cluster management software solutions but they are managed on the client’s side. It increases response payload significantly and wastes time and memory on the client’s side.
While grouping points on the server’s side, there’s no need to build clusters each time on a client’s screen. Let’s say you have 100,000 points categorised in 100 clusters; with a server-side approach, clusters will be calculated one time on the server, while client-side clustering will require performing the same process by each client.
Server-side clustering will create a cache which will be transformed each time points are added or deleted. You can even pre-cache what’s kept on the server on a content delivery network to decrease the server’s load and reduce latency. Loading large volumes of data without managing clusters on the server slows down the app’s performance.
Why We Used Golang
Aiming for the fastest solution possible, we developed a clustering library using Golang. In most cases, Go is the fastest programming language which is helpful for processing many geopoints and combining them by certain criteria.
Clusters developed with other languages − be it Node, PHP, Python, or Ruby − are slow compared to Golang-based solutions. And some of those open source solutions for spatial data that are written in Golang (geo.go, tile38, rtreego) were not suitable for our project as they were too complicated, slowed down performance, or consumed too much memory.
With that said, we developed a library for building KD trees that worked with only 2D points. A library that also works with polygons and processes any quantity of coordinates in many-dimensional spaces is much less efficient.
Apps With Geodata
Travelling and delivery applications are flourishing, with more solutions being developed all the time and the most demanded ones getting millions of downloads worldwide. When it comes to making a popular application with navigational functions, a map with apprehensible data chunks is a pressing need.
Making clusters is what helps projects deal with massive amounts of data. And making a proper cluster monitoring and management tool is a challenge you will face while working on a project with a geo-powered map. Feel free to contact us to discuss your ideas or if you’d like help with geoclustering.
10 July 2019