Fisher’s Natural Breaks Classification complexity proof

An O(k × n × log(n)) algorithm is presented here, with proof of its validity and complexity, for the classification of an array of n numeric values into k classes such that the sum of the squared deviations from the class means is minimal, known as Fisher’s Natural Breaks Classification. This algorithm is an improvement of Jenks’ Natural Breaks Classification Method, which is a (re)implementation of the algorithm described by Fisher within the context of choropleth maps, which has time complexity O(k × n ²).

Finding breaks for 15 classes for a data array of 7 million unique values now takes 20 seconds on a desktop PC.

definitions

Given a sequence of strictly increasing values v_i and positive weights w_i for i ? {1..n}, and a given number of classes k with k = n,

choose classification function I(x) : {1..n} ? {1..k}, such that ***SSD_{n, k***}, the sum of the squares of the deviations from the class means, is minimal, where:

\[\begin{align} SSD_{n,k} &:= \min\limits_{I: \{1..n\} \to \{1..k\}} \sum\limits^k_{j=1} ssd(S_j) & & \text{minimal sum of sum of squared deviations } \\\\ S_j &:= \{i\in \{1..n\}\|I(i) = j\} & & \text{set of indices that map to class}\,j \\\\ ssd(S) &:= { \sum\limits_{i \in S} {w_i (v_i - m(S))^2} } & & \text{the sum of squared deviations of the values of any index set}\,S \\\\ m(S) &:= { s(S) \over w(S) } & & \text{the weighted mean of the values of any index set}\,S \\\\ s(S) &:= { \sum\limits_{i \in S} {w_i v_i} } & & \text{the weighted sum of the values of any index set}\,S \\\\ w(S) &:= \sum\limits_{i \in S} {w_i} & & \text{the sum of the value-weights of any index set}\,S \end{align}\]

Note that any array of n unsorted and not necessarily unique values can be sorted and made into unique ***v_{i** by removing duplicates with **w_{i* representing the occurrence frequency of each value in **O(nlog(n)) time.}}

characteristics

One can derive the following:

ssd(S) = sqr(S) - wsm(S) with

\[\begin{align} sqr(S) &:= \sum\limits_{i \in S} {w_i {v_i}^2} \\\\ wsm(S) &:= {w(S) {m(S)}^2} = s(S) m(S) & &\text{the weighted square of the mean of S} \end{align}\]

SSD = SSV - WSM with

\[\begin{align} SSV &:= \sum\limits_{j} {sqr(S_j)} = sqr(\{1..n\}) \\\\ WSM &:= \sum\limits_{j} {wsm(S_j)} = \sum\limits_{j} {w(S_j) {m(S_j)}^2} = \sum\limits_{j} { {s(S_j)}^2 \over w(S_j)} \end{align}\]

Since SSV is independent of the chosen partitioning S_j, minimizing SSD is equivalent to maximizing:

\[WSM = \sum\limits_{j} { {[\sum\limits_{i \in S_j} {w_i v_i}]^2} \over {\sum\limits_{i \in S_j} {w_i}} }\]

Since the sequence v is sorted, all v_i***’s that belong to one class ***S_j*** must be consecutive and if equal values would be allowed, these would end up in the same class provided that at least ***k unique values are given (for proof, see: Fisher, On Grouping for Maximum Homogeneity, 1958). Consequently, the S_j***’s can be 1-to-1 related to a strictly increasing sequence of class-break-indices *i_**j** := min **S_j*** for **j* = 1..k with i₁ = 1 and also to a strictly increasing sequence of class-break-values v_{i_j}.

With n values and k classes, one can choose \(\binom{n-1}{k-1} = \frac{(n-1)!}{(k-1)! (n-k)!}\) elements as first value of a class (the minimum value must be the minimum of the lowest class).

given indices i₁ = i₂ and i₃ = i₄: i₁ = i₃ ? i₂ = i₄ ? m(i₁..i₂) = m(i₃..i₄)

dynamic programming approach

Take SSM_i,j*** as the maximum sum of squared deviations for the first i values classified to j classes and ***CB_i,j*** as the value index of the last class-break for such classification. Since a maximal sum is a sum, it is easy to see that ***SSM_i,j*** is equal to the maximum value for ***p ? {j..i} of SSM_{p*** - 1, ***j*** - 1} + **ssm({p..i}) . Thus, the *SSM_{p - 1, j - 1} provide Overlapping Sub Problems with an Optimal Sub Structure and the following dynamic program can be used to find a solution.

Dynamic rule for i = j > 1:

\[\begin{align} SSM_{i,j} &:= \max\limits_{p \in \{j..i\}} SSM_{p-1, j-1}+ssm(\{p..i\}) \\\\ CB_{i,j} &:= arg\max\limits_{p \in \{j..i\}} SSM_{p-1, j-1}+ssm(\{p..i\}) \end{align}\]

Start conditions:

\[\begin{align} SSM_{i,1} &:= ssm(1..i) CB_{i,1} &:= 1 \end{align}\]

Thus

\[\begin{align} SSM_{i,i} &:= sqr(\{1..i\}) \\\\ CB_{i,i} &:= i \\\\ \end{align}\]

Note that CB_i,j and SSM_i,j are only defined for j = i thus i - j = 0. Furthermore, for finding values with indices (n,k), only elements with i - j = n - k are relevant.

If i = CB_{n, k} then the corresponding previous class-break index has been stored as CB_{i - 1, k - 1}.

Fisher described an algorithm that goes through all i and for each i finding and storing all SSD_i,j and CB_i,j for each subsequent j, which is the approach taken in the found implementations (see links). This requires O(k × n) working memory and O(k × n ²) time.

Dynamically going through all j and for each j finding all SSM_i,j*** for each i only requires **O(n-k)* working memory and O(k × n × log(n)) time by exploiting the no-crossing-paths characteristic, as proven below, but only provides the last ClassBreak for each i.

no crossing paths property: CB_{i, j}is monotonous increasing with i, thus i1 = i2 ? CB_{i1, j} = CB_{i2, j} (see proof below).

To obtain a chain of CB_i,j back from CB_{n, k}, one needs to maintain (n - k + 1) × (k - 2) intermediate CB’s, unless the O(k × nlog(n)) algorithm is reapplied k times.

complexity

Each ssm(i₁..i_2***) can be calculated in constant time as \(\frac{\left(WV[i_2] - WV[i_1-1]\right)^2}{W[i_2] - W[i_1-1]}\) after a single initialization of **O(n)* time and keeping a buffer of O(n) size containing series of cumulative weights, W, and cumulative weighted values, WV.

During the calculation of the n - k + 1 relevant values of SSM and CB for a row j ? {2..k - 1} using the n - k + 1 values that were calculated for row j - 1, one can divide and conquer recursively by calculating CB_{i***, ***j***} for i in the middle of a range for which a left and right boundary are given and using the “no crossing paths” property to cut the remaining number of options in half by sectioning the lower indices to the left or equal options and the higher indices to the greater or equal options. The required processing time budget **C(n,m)* for processing n class-breaks using m values for previous class breaks is c × m + max i : [C(n/2,i)+C(n/2,m+1-i)]. A budget of C(n,m) := c × (m + n ) × (log₂(n)+1) is sufficient, thus the calculation of CB_{i, j***} given all ***CB_{i, j - 1***} requires processing time of ***C(n,n), which is O(n * logn).

proof of the no crossing paths property

Addition: the no crossing paths property might be equivalent to what elsewhere is called the convex Monge condition, which allows for the found speed-up of solving a dynamic programming problem.

Given indices p1, p2, i1, i2 such that p1 < p2 = i1 < i2,

\[\begin{align} m_p &:= m(\{p1 .. p2-1\}) &w_p &:= w(\{p1 .. p2-1\}) \\\\ m_j &:= m(\{p2 .. i1 \}) &w_j &:= w(\{p2 .. i1 \}) \\\\ m_n &:= m(\{i1+1 .. i2\}) &w_n &:= w(\{i1+1 .. i2\}) \\\\ m_{pj} &:= m(\{p1 .. i1 \}) &w_{pj} &:= w(\{p1 .. i1\}) \\\\ m_{jn} &:= m(\{p2 .. i2 \}) &w_{jn} &:= w(\{p2 .. i2\}) \\\\ m_{pjn} &:= m(\{p1 .. i2 \}) &w_{pjn} &:= w(\{p1 .. i2\}) \\\\ \end{align}\]

it follows that

\[v_{p1} < v_{p2} \le v_{i1} < v_{i2} \\\\ m_p < m_{pj} < m_j < m_{jn} < m_n \\\\ m_{pj} < m_{pjn} < m_{jn}\] \[\begin{align} w_{pj} &= w_p + w_j \\\\ w_{jn} &= w_j + w_n \\\\ w_{pjn} &= w_p + w_j + w_n = w_{pj}+w_{jn} - w_j \\\\ \end{align}\] \[\begin{align} w_{pj} \times m_{pj} &= w_p \times m_p + w_j \times m_j \\\\ w_{jn} \times m_{jn} &= w_j \times m_j + w_n \times m_n \\\\ w_{pjn} \times m_{pjn} &= w_j \times m_j + w_n \times m_n + w_n \times m_n \\\\ w_{pj} \times m_{pj}+w_{jn} \times m_{jn} &= w_j \times m_j + w_{pjn} \times m_{pjn} \end{align}\] \[\begin{align} ssm(\{p1 .. i1\}) &= w_{pj} \times m_{pj}^2 \\\\ ssm(\{p2 .. i1\}) &= w_{j} \times m_{j}^2 \\\\ ssm(\{p1 .. i2\}) &= w_{pjn} \times m_{pjn}^2 \\\\ ssm(\{i1 .. i2\}) &= w_{jn} \times m_{jn}^2 \end{align}\]

and it follows that the following quantities are positive:

\[\begin{align} w_{pj1} &:= w_j \times (m_{jn} - m_{j}) / (m_{jn} - m_{pj}) &> 0 \\\\ w_{jn1} &:= w_j \times (m_{j} - m_{pj}) / (m_{jn} - m_{pj}) &> 0 \\\\ w_{pj3} &:= w_{pj} - w_{pj1} &> 0\\\\ w_{jn3} &:= w_{jn} - w_{jn1} &> 0\\\\ \end{align}\]

note that

\[\begin{align} w_{pj1}+w_{jn1} &= w_j \\\\ w_{pj3}+w_{jn3} &= w_{pjn} \\\\ w_{pj1}+w_{pj3} &= w_{pj} \\\\ w_{jn1}+w_{jn3} &= w_{jn} \\\\ w_{pj1} \times m_{pj} + w_{jn1} \times m_{jn} &= w_j \times m_j \\\\ w_{pj3} \times m_{pj} + w_{jn3} \times m_{jn} &= w_{pj} \times m_{pj}+w_{jn} \times m_{jn}-w_j \times aj = w_{pjn} \times m_{pjn} \end{align}\]

and therefore

\[\begin{align} w_{pj1} \times (m_{pj} - m_j)^2 &+ w_{jn1} \times (m_{jn} - m_j)^2 &= w_{pj1} \times m_{pj}^2 &+ w_{jn1} \times m_{jn}^2 - w_{j} \times m_{j}^2 &> 0 \\\\ w_{pj3} \times (m_{pj} - m_{pjn})^2 &+ w_{jn3} \times (m_{jn} - m_{pjn})^2 &= w_{pj3} \times m_{pj}^2 &+ w_{jn3} \times m_{jn}^2 - w_{pjn} \times m_{pjn}^2 &> 0 \end{align}\]

summation of these two inequalities gives:

\[w_{pj} × m_{pj}^2 + w_{jn} × m_{jn}^2 - w_j × m_j^2 - w_{pjn} × m_{pj}^2 > 0\]

thus

\[w_{jn} × m_{jn}^2 - w_{j} × m_{j}^2 > w_{pjn} × m_{pjn}^2 - w_{pj} × m_{pj}^2\]

which can be restated as

\[ssm({p2..i2}) - ssm({p2..i1}) > ssm({p1..i2}) - ssm({p1..i1})\]

given that :

\(SSM_{i_{1}, j} = SSM_{p_{2}-1, j-1} + ssm(p2..i1) = SSM_{p_{1}-1, j-1} + ssm(p_{1}..i_{1}) \text{ for all } p_{1} < p_{2} \text{ when } p_{2} \text{ is taken as } CB_{i_{1}, j}\) ,

adding the last two inequalities results in:

\[SSM_{p2-1,j-1} + ssm(p2..i2) > SSM_{p1-1,j-1} + ssm(p1..i2) \text{ for all } n = i\]

given the definition for CB, this implies that CB_{*i2, j*} cannot be p1 and therefore *CB_{i2, j***} = ***p2 = **CB_{i1, j***}, QED.

further improvements

Note that

for each i, and j < i: SSM_i,j < SSM_{i,j + 1} since splitting up any class would make the total SSM larger.
it is not always true that SSM_i,j < = SSM_{i + 1,j}, for example, take v₁ = - 10 and v₂ = 0, then SSM_{1, 1} = 1 × (-10)² = 100 and SSM_{2, 1} = 2 × (-5)² = 50
From the previous paragraph, it follows that:

ssm({p2..i2}) - ssm({p1..i2}) > ssm({p2..i1}) - ssm({p1..i1})

for each i1, i2 > i1 and j < i1:

SSM_{i2, j + 1} - SSM_{i2, j} = SSM_{i1, j + 1} - SSM_{i1, j}, which can be proven by induction.

following the lines of the latter induction proof it can be shown (elaborate!) that for each i and j < i: CB_i,j = CB_{i,j + 1}, thus CB_i,j can be used as a lower bound when calculating CB_{i, j + 1}.

previously known algorithms

Iteratively moving values from classes with high variation to classes with low variation is discussed on Wikipedia, which does not always result in an optimal solution. Start for example with Class 1 with (1 4) and Class 2 with (99 100). The described algorithms prescribes that 4 needs to go to Class 2 since Class 1 has the highest variation.

Local moves as long as the total sum of squared deviations decreases (as described in the ESRI FAQ). Note that this has the risk of sticking to a local optimum. For example when starting with Class 1 = {1, 8, 9, 10} and Class 2 = {16}, the SSD = 50. When moving the 10 to class 2, the SSD increases to 56, but the global minimum is reached when Class 1 = (1) and Class 2 = (8, 9, 10, 16) with an SSD of 38.75.

source code

Source code is part of the GeoDms and can be found here or in our GitHub Repository, look for the function ClassifyJenksFisher().

acknowledgements

This algorithm and proof of its validity and complexity is found by Maarten Hilferink, also being the author of this page. © Object Vision BV. The work for developing this method has partly been made possible by a software development request from the http://www.pbl.nl PBL Netherlands Environmental Assessment Agency.

references

Fisher, W. D. (1958). On grouping for maximum homogeneity. American Statistical Association Journal, 53, 789-798.
Jenks, G. F. (1977). Optimal data classification for choropleth maps, Occasional paper No. 2. Lawrence, Kansas: University of Kansas, Department of Geography.
Evans, I.S. (1977). The selection of class intervals. Transactions of the Institute of British Geographers, 2:98-124.
https://arxiv.org/pdf/1701.07204.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5148156/
https://link.springer.com/content/pdf/10.1007%2FBF02574380.pdf
https://en.wikipedia.org/wiki/SMAWK_algorithm