An extended form of Bram Cohen's patience diff algorithm.
This implementation was derived by using the 4 rules that are outlined in
Bram Cohen's blog,
and then was further extended to support low-occurrence common elements.
The basic idea of the algorithm is to create a histogram of occurrences for
each element of sequence A. Each element of sequence B is then considered in
turn. If the element also exists in sequence A, and has a lower occurrence
count, the positions are considered as a candidate for the longest common
subsequence (LCS). After scanning of B is complete the LCS that has the
lowest number of occurrences is chosen as a split point. The region is split
around the LCS, and the algorithm is recursively applied to the sections
before and after the LCS.
By always selecting a LCS position with the lowest occurrence count, this
algorithm behaves exactly like Bram Cohen's patience diff whenever there is a
unique common element available between the two sequences. When no unique
elements exist, the lowest occurrence element is chosen instead. This offers
more readable diffs than simply falling back on the standard Myers' O(ND)
algorithm would produce.
To prevent the algorithm from having an O(N^2) running time, an upper limit
on the number of unique elements in a histogram bucket is configured by
#setMaxChainLength(int). If sequence A has more than this many
elements that hash into the same hash bucket, the algorithm passes the region
to
#setFallbackAlgorithm(DiffAlgorithm). If no fallback algorithm is
configured, the region is emitted as a replace edit.
During scanning of sequence B, any element of A that occurs more than
#setMaxChainLength(int) times is never considered for an LCS match
position, even if it is common between the two sequences. This limits the
number of locations in sequence A that must be considered to find the LCS,
and helps maintain a lower running time bound.
So long as
#setMaxChainLength(int) is a small constant (such as 64),
the algorithm runs in O(N * D) time, where N is the sum of the input lengths
and D is the number of edits in the resulting EditList. If the supplied
org.eclipse.jgit.diff.SequenceComparator has a good hash function,
this implementation typically out-performs
org.eclipse.jgit.diff.MyersDiff, even though its theoretical running
time is the same.
This implementation has an internal limitation that prevents it from handling
sequences with more than 268,435,456 (2^28) elements.