What Is Geoclustering? How to Handle 200,000 Points on the Map

Anastasia Osypenko
Market Researcher

While making an application for truck drivers, we faced an engineering challenge plotting location data on a map in a way that overwhelming volumes wouldn’t crash users’ phones. We had more than 200,000 geopoints and needed them to be grouped into certain categories so that a driver would operate within these categories instead of seeing all the points at once.

We solved this issue by applying a particular clustering technique. Let’s now get a grasp on what geoclustering is and which methods of grouping location data are out there.

Crafting a Cluster Management Tool for an App

A geoclustering tool, along with a powerful map are the fundamentals of any location-based or navigation application. When you use an application which deals with lots of data on a daily basis, you need this data to be loud and simple, divided into visually acceptable parts. And this is what clustering serves to do; structuring data and making it understandable.

If you want to develop an app with some mapping functions, you should know how many geopoints will be included. Our guess is that there will be a lot. Now, with each geopoint being shown as a separate marker, imagine a map filled with markers overlapping each other. Messy, right? To handle thousands of points and make your service effective but not disturbing is quite a piece of work.

As we’ve mentioned, we needed to handle 200,000 points in the Atlas project. The app’s end users were truck drivers who needed an easy-to-use tool for building the delivery route. So, the way geopoints were displayed and managed on the map was essential.

Clustering Approaches

Clustering tools have been out there for years. The complication is which one to choose. Apart from different algorithms, there are different programming languages such a tool can be developed with, and two different approaches: server-side and client-side clustering.

There are several techniques of data clustering which can be easily applied to geopoints:

K-means: every geopoint is attached to a group with the closest centre, and centres are defined as the means of all vectors in the group
Mean-shift clustering: centres are the means of points situated within a certain dense area
DBSCAN: outliers aren’t classified into clusters but identified as noises
Gaussian Mixture Models: geopoints are attributed to a cluster if they are close to the Gaussian centre
Hierarchical clustering: while each geopoint is taken as a cluster, the average distance between them is calculated, then points with the smallest average linkage are combined

At MadAppGang we utilised hierarchical greedy clustering based on KD-tree geospatial indexes.

Why We Needed a New Geoclustering Solution

There are lots of clustering tools likeMarkerClustererfor Google Maps API, a Leaflet’s pluginMarker Cluster, Mapbox’s Supercluster, orAnycluster. But we needed another, much more optimizable solution. Instead of searching for the best cluster management software, you can always write your own code for clustering according to your project’s goals, and this is exactly what we did.

Why We Chose Server-Side Clustering

First of all, it’s best doing clustering at the server’s side. There are several easy to implement cluster management software solutions but they are managed on the client’s side. It increases response payload significantly and wastes time and memory on the client’s side.

While grouping points on the server’s side, there’s no need to build clusters each time on a client’s screen. Let’s say you have 100,000 points categorised in 100 clusters; with a server-side approach, clusters will be calculated one time on the server, while client-side clustering will require performing the same process by each client.

Server-side clustering will create a cache which will be transformed each time points are added or deleted. You can even pre-cache what’s kept on the server on a content delivery network to decrease the server’s load and reduce latency. Loading large volumes of data without managing clusters on the server slows down the app’s performance.

Why We Used Golang

Aiming for the fastest solution possible, we developed a clustering library using Golang. In most cases, Go is the fastest programming language which is helpful for processing many geopoints and combining them by certain criteria.

Clusters developed with other languages − be it Node, PHP, Python, or Ruby − are slow compared to Golang-based solutions. And some of those open source solutions for spatial data that are written in Golang (geo.go, tile38, rtreego) were not suitable for our project as they were too complicated, slowed down performance, or consumed too much memory.

Finally, we optimised the kdbush library, building our own library for clustering geopoints called Gocluster. Since kdbush was written in JavaScript, we wrote a framework for implementing the same algorithm in Golang.

With that said, we developed a library for building KD trees that worked with only 2D points. A library that also works with polygons and processes any quantity of coordinates in many-dimensional spaces is much less efficient.

Our Gocluster, based on KD-tree geospatial index, groups points using Mercator projection at different scales. Comparing our solution with Mapbox’s Supercluster written in JavaScript, we discovered that it worked 40 times faster and took 20 times less memory. It works this fast as you can’t change the index in the process but only rewrite it. It took Gocluster 0.5 sec and 10 MB of memory to process 21 map layers with 1,000,000 geopoints. Owing to such speed, we can update all indexes while adding or replacing points in real time.

Apps With Geodata

Travelling and delivery applications are flourishing, with more solutions being developed all the time and the most demanded ones getting millions of downloads worldwide. When it comes to making a popular application with navigational functions, a map with apprehensible data chunks is a pressing need.

Making clusters is what helps projects deal with massive amounts of data. And making a proper cluster monitoring and management tool is a challenge you will face while working on a project with a geo-powered map. Feel free to contact us to discuss your ideas or if you’d like help with geoclustering.

10 July 2019

Also on Madappgang

ᐸ

How Adopting Site Reliability Engineering Transforms Your Digital Infrastructure into a Reliability Powerhouse

Are You Still Making These Common Code Review Mistakes?

10 Innovative Startup Ideas from Reddit in 2024

Slaying the White Screen Dragon: How to Defeat Procrastination

How to stop finding time for refactoring

ᐳ