**Current landscape of IoT (internet of things), low-cost GNSS (satellite navigation system) receivers, and omnipresent wireless networks produce large amounts of data containing geospatial information. This blogpost introduces the basics of the geolocated k-nearest neighbors (k-NN) model and its applications in product campaign targeting.**

### Distance metrics

Geolocation means assigning real-world positions to the objects in the dataset. The position is usually represented as latitude, longitude (and elevation). The k-NN models are built on the ability of calculating the distance between two sets of coordinates. The distance metric itself can be defined in multiple ways.

The most common metric is the Euclidean distance, sometimes called Pythagorean formula. It is important to note that the Euclidean distance is defined in terms of Cartesian coordinates while the geographic coordinates are spherical. The usage of this metric with spherical coordinates is an approximation suitable only in certain situations (e.g. the data points are contained in a small region).

When precise distance is required, it is necessary to use the great-circle distance defined as the shortest path between two points on the surface of a sphere. The distance is usually calculated using the Haversine formula with Earth radius defined by WGS84 ellipsoid.

Cars and trains don’t travel along straight lines and some applications demand the actual travel distance or even the travel time. For large datasets however, the travel distance calculation can be technically challenging as it involves solving the shortest path problem for many different pairs of data points.

SciPy documentation offers a list of many more distance metrics. However, some may not be suitable for use with geographic coordinates.

It is important to properly analyze the problem before selecting the right metric. For example, euclidean distance might correlate well with the travel distance for a small region with a road network homogenous in all directions.

### k-nearest neighbors

k-NN is one of the simplest form of supervised learning. The algorithm predicts the target value of a data point using the spatial proximity to the surrounding training samples.

The figure belows shows an example of k-NN classification, the predicted value for *k = 5* would be 0 as it is the most common label in the set of the five nearest points. In case of k-NN regression, the target value is defined as the mean of the k-nearest labels.

The naive implementation of the k-NN algorithm requires calculating the distance to every training sample which might not be feasible for large datasets. More efficient approach is to use a **space-partitioning structure** to index the data. Ball tree or k-d tree are examples of such data structures.

k-d tree is a binary search tree where nodes represent k-dimensional points. Each level of the tree splits one of the dimensions into half-spaces. The example below is a 2-d tree constructed out of 7 points:

*data* = {(1, 0), (1,4), (2, 1), (4, 7), (7,4), (6, 3), (5, 6)}.

The tree root is selected using the median of the first dimension and splits the rest of the data into two subtrees. Points with lower values in the first dimension belong to the left subtree, points with higher values belong to the right. The process is then recursively applied to the next dimension at the second level of the tree.

Comprehensive overview of the k-d tree algorithm can be found on Wikipedia or in this PhD thesis extract.

The following Python code compares the naive implementation and the k-d tree approach. The test problem consists of finding 5 nearest neighbors in the training dataset of 100 000 samples for 1 000 random points.

import numpy as np from scipy.spatial import cKDTree # data training = np.random.rand(100000, 2) k = 5 # naive approach def get_euclid_dists(training, x): return np.sqrt(((training - x) ** 2).sum(axis=1)) %%timeit dists = [get_euclid_dists(training, np.random.rand(1, 2)) for _ in range(1000)] results = np.argpartition(dists, k)[:, :k] >> 7.65 s ± 871 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # k-d Tree %%timeit tree = cKDTree(training) results = tree.query(np.random.rand(1000, 2), k) >> 46.6 ms ± 653 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The results show that, on average, the k-d tree performed more than hundred times faster than the naive implementation of the algorithm.

The k-NN model setup involves finding an appropriate value for *k***, **this is usually done using parameter search that optimize an appropriate performance metric of the model. Some applications might warrant using different *k* values for different geographic regions of the dataset.

### Marketing applications

Geolocated nearest neighbors models can be applied to enhance the performance of direct marketing campaigns in situations with no or very limited behavioral data. Instead of random targeting, the audience is selected using propensity score based on the data from local neighborhood of each client. The model can be then further improved using additional available client data (e.g. sex, age) as well as demographic data describing the neighborhood (e.g. median income, median age).

To evaluate the model, the common approach is to define a set of test samples and calculate the usual metrics such as precision, recall, or area under ROC (receiver operating characteristic) curve. Audience targeted with well-performing model shows increased response rate to the campaign than randomly selected clients. This effect is called model lift.

First step before implementing the pipeline is to analyze whether the data are suitable for this form of predictive modeling. For example, the product that the campaign want to promote is evenly distributed across the entire geographic area, or the client base is too sparse preventing us from building any meaningful neighborhoods*.*

Pingback: Geolocated nearest neighbors in product campaign targeting - News Champs