# **Fisher’s Natural Breaks Classification complexity proof**

An ** O(k × n × log(n))** algorithm is presented here, with proof of its validity and complexity, for the classification of an array of

*n*numeric values into

*k*classes such that the sum of the squared deviations from the class means is minimal, known as Fisher’s Natural Breaks Classification. This algorithm is an improvement of Jenks’ Natural Breaks Classification Method, which is a (re)implementation of the algorithm described by Fisher within the context of choropleth maps, which has time complexity

**.**

*O(k × n*^{2})Finding breaks for 15 classes for a data array of 7 million unique values now takes 20 seconds on a desktop PC.

# definitions

Given a sequence of strictly increasing values ** v_{i}** and positive weights

**for**

*w*_{i}**, and a given number of classes**

*i ? {1..n}***with**

*k***,**

*k = n*choose classification function ** I(x)** : {1..

**} ? {1..**

*n***}, such that ***SSD**

*k*_{n, k***}, the sum of the squares of the deviations from the class means, is minimal, where:

Note that any array of n unsorted and not necessarily unique values can be sorted and made into unique ***v_{i** by removing duplicates with **wi* representing the occurrence frequency of each value in **O(nlog(n)) time.}

# characteristics

One can derive the following:

with*ssd(S) = sqr(S) - wsm(S)*

=*SSD*-*SSV*with*WSM*

- Since
is independent of the chosen partitioning*SSV*, minimizing*S*_{j}is equivalent to maximizing:*SSD*

- Since the sequence
is sorted, all*v*unique values are given (for proof, see: Fisher, On Grouping for Maximum Homogeneity, 1958). Consequently, the**v**_{i***}’s that belong to one class ***S_{j***}must be consecutive and if equal values would be allowed, these would end up in the same class provided that at least ***k* = 1..*S*_{j***}’s can be 1-to-1 related to a strictly increasing sequence of class-break-indices *i_{**j**}:= min ***S*j_{j***}for **with*k**i*_{1}= 1 and also to a strictly increasing sequence of class-break-values.*v*_{ij}

- With
values and*n*classes, one can choose \(\binom{n-1}{k-1} = \frac{(n-1)!}{(k-1)! (n-k)!}\) elements as first value of a class (the minimum value must be the minimum of the lowest class).*k*

- given indices
*i*_{1}=*i*_{2}and*i*_{3}=*i*_{4}:*i*_{1}=*i*_{3}?*i*_{2}=*i*_{4}?*m(i*_{1}..i_{2}) =*m(i*_{3}..i_{4})

# dynamic programming approach

Take * SSM_{i,j***} as the maximum sum of squared deviations for the first i values classified to j classes and ***CB_{i,j***} as the value index of the last class-break for such classification. Since a maximal sum is a sum, it is easy to see that ***SSM_{i,j***} is equal to the maximum value for ***p* ? {

**..**

*j***} of**

*i*

*SSM*ssm_{p*** - 1, ***j*** - 1}+ ***({*

**p**

*..*

**i**

*}) . Thus, the*

***SSM**_{p - 1, j - 1}provide Overlapping Sub Problems with an Optimal Sub Structure and the following dynamic program can be used to find a solution.

Dynamic rule for ** i** =

**> 1:**

*j*Start conditions:

\[\begin{align} SSM_{i,1} &:= ssm(1..i) CB_{i,1} &:= 1 \end{align}\]Thus

\[\begin{align} SSM_{i,i} &:= sqr(\{1..i\}) \\\\ CB_{i,i} &:= i \\\\ \end{align}\]Note that *CB*_{i,j} and *SSM*_{i,j} are only defined for ** j** =

**thus**

*i***-**

*i***= 0. Furthermore, for finding values with indices (**

*j***,**

*n***), only elements with**

*k***-**

*i***=**

*j***-**

*n***are relevant.**

*k*- If
=*i**CB*_{n, k}then the corresponding previous class-break index has been stored as*CB*_{i - 1, k - 1}.

- Fisher described an algorithm that goes through all i and for each i finding and storing all
*SSD*_{i,j}and*CB*_{i,j}for each subsequent j, which is the approach taken in the found implementations (see links). This requiresworking memory and*O(k × n)*time.*O(k × n*^{2})

- Dynamically going through all j and for each j finding all
* working memory and*SSM*O(n-k)_{i,j***}for each i only requires **time by exploiting the no-crossing-paths characteristic, as proven below, but only provides the last ClassBreak for each i.*O(k × n × log(n))*

- no crossing paths property:
*CB*_{i, j}is monotonous increasing with i, thus=*i1*?*i2**CB*_{i1, j}=*CB*_{i2, j}(see proof below).

- To obtain a chain of
*CB*_{i,j}back from*CB*_{n, k}, one needs to maintain () × (*n - k + 1*) intermediate CB’s, unless the*k - 2*algorithm is reapplied k times.*O(k × nlog(n))*

# complexity

- Each
(*ssm** time and keeping a buffer of*i*O(n)_{1}..i_{2***}) can be calculated in constant time as \(\frac{\left(WV[i_2] - WV[i_1-1]\right)^2}{W[i_2] - W[i_1-1]}\) after a single initialization of **size containing series of cumulative weights,*O(n)*, and cumulative weighted values,*W*.*WV*

- During the calculation of the
relevant values of SSM and CB for a row*n - k + 1*using the*j ? {2..k - 1}*values that were calculated for row*n - k + 1*, one can divide and conquer recursively by calculating*j - 1** for processing*CB*C(n,m)_{i***, ***j***}for i in the middle of a range for which a left and right boundary are given and using the “no crossing paths” property to cut the remaining number of options in half by sectioning the lower indices to the left or equal options and the higher indices to the greater or equal options. The required processing time budget **class-breaks using*n**m*values for previous class breaks is×*c*+ max*m*: [*i*]. A budget of*C(n/2,i)+C(n/2,m+1-i)*:=*C(n,m)*is sufficient, thus the calculation of*c × (m + n ) × (log*_{2}(n)+1), which is**CB**_{i, j***}given all ***CB_{i, j - 1***}requires processing time of ***C(n,n).*O(n * logn)*

# proof of the no crossing paths property

Addition: the no crossing paths property might be equivalent to what elsewhere is called the convex Monge condition, which allows for the found speed-up of solving a dynamic programming problem.

- Given indices
,*p1*,*p2*,*i1*such that*i2*<*p1*2 =*p*1 <*i*2,*i*

- it follows that

- and it follows that the following quantities are positive:

- note that

- and therefore

- summation of these two inequalities gives:

- thus

- which can be restated as

- given that :

\(SSM_{i_{1}, j} = SSM_{p_{2}-1, j-1} + ssm(p2..i1) = SSM_{p_{1}-1, j-1} + ssm(p_{1}..i_{1}) \text{ for all } p_{1} < p_{2} \text{ when } p_{2} \text{ is taken as } CB_{i_{1}, j}\) ,

- adding the last two inequalities results in:

- given the definition for
, this implies that*CB**CB*CB_{*i2, j*}cannot be p1 and therefore***CB**= **_{i2, j***}= ***p2_{i1, j***}, QED.

# further improvements

Note that

- for each
, and*i*<*j*:*i**SSM*_{i,j}<*SSM*_{i,j + 1}since splitting up any class would make the totallarger.*SSM* - it is not always true that
*SSM*_{i,j}< =*SSM*_{i + 1,j}, for example, take*v*_{1}= - 10 and*v*_{2}= 0, then*SSM*_{1, 1}= 1 × (-10)^{2}= 100 and*SSM*_{2, 1}= 2 × (-5)^{2}= 50 - From the previous paragraph, it follows that:

** ssm**({

**}) -**

*p2..i2***({**

*ssm***}) >**

*p1..i2***({**

*ssm***}) -**

*p2..i1***({**

*ssm***})**

*p1..i1*- for each
,*i1*>*i2*and*i1*<*j*:*i1*

*SSM*_{i2, j + 1} - *SSM*_{i2, j} = *SSM*_{i1, j + 1} - *SSM*_{i1, j}, which can be proven by induction.

following the lines of the latter induction proof it can be shown (elaborate!) that for each ** i** and

**<**

*j***:**

*i*

*CB*_{i,j}=

*CB*_{i,j + 1}, thus

*CB*_{i,j}can be used as a lower bound when calculating

*CB*_{i, j + 1}.

# previously known algorithms

- Iteratively moving values from classes with high variation to classes with low variation is discussed on Wikipedia, which does not always result in an optimal solution. Start for example with Class 1 with (1 4) and Class 2 with (99 100). The described algorithms prescribes that 4 needs to go to Class 2 since Class 1 has the highest variation.

- Local moves as long as the total sum of squared deviations decreases (as described in the ESRI FAQ). Note that this has the risk of sticking to a local optimum. For example when starting with Class 1 = {1, 8, 9, 10} and Class 2 = {16}, the SSD = 50. When moving the 10 to class 2, the SSD increases to 56, but the global minimum is reached when Class 1 = (1) and Class 2 = (8, 9, 10, 16) with an SSD of 38.75.

# source code

Source code is part of the GeoDms and can be found here or in our GitHub Repository, look for the function ClassifyJenksFisher().

# acknowledgements

This algorithm and proof of its validity and complexity is found by Maarten Hilferink, also being the author of this page. © Object Vision BV. The work for developing this method has partly been made possible by a software development request from the http://www.pbl.nl PBL Netherlands Environmental Assessment Agency.

# references

- Fisher, W. D. (1958). On grouping for maximum homogeneity. American Statistical Association Journal, 53, 789-798.
- Jenks, G. F. (1977). Optimal data classification for choropleth maps, Occasional paper No. 2. Lawrence, Kansas: University of Kansas, Department of Geography.
- Evans, I.S. (1977). The selection of class intervals. Transactions of the Institute of British Geographers, 2:98-124.
- https://arxiv.org/pdf/1701.07204.pdf
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5148156/
- https://link.springer.com/content/pdf/10.1007%2FBF02574380.pdf
- https://en.wikipedia.org/wiki/SMAWK_algorithm

# links

- http://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization (WikiPedia)
- http://danieljlewis.org/2010/06/07/jenks-natural-breaks-algorithm-in-python
- http://danieljlewis.org/files/2010/06/Jenks.pdf
- http://www.biomedware.com/files/documentation/spacestat/interface/map/classify/About_natural_breaks.htm
- http://marc.info/?l=r-sig-geo&m=118536881101239
- http://code.google.com/p/geoviz/source/browse/trunk/common/src/main/java/geovista/common/classification/ClassifierJenks.java?spec=svn634&r=634
- http://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479
- http://www.casa.ucl.ac.uk/martin/msc_gis/evans_class_intervals.pdf
- http://cran.r-project.org/web/packages/classInt/classInt.pdf
- http://www.srnr.arizona.edu/rnr/rnr419/publications/jenks_caspall_1971.pdf