An efficient (both memory and time) implementation of a Set used to
verify that a given
word is contained within the set. The general usage pattern is expected
to be such that most checks are positive, ie. that the word indeed
is contained in the set.
Performance of the set is comparable to that of
java.util.TreeSetfor Strings, ie. 2-3x slower than
java.util.HashSet when
using pre-constructed Strings. This is generally result of algorithmic
complexity of structures; Word and Tree sets are roughly logarithmic
to the whole data, whereas Hash set is linear to the length of key.
However:
- WordSet can use char arrays as keys, without constructing Strings.
In cases where there is no (need for) Strings, WordSet seems to be
about twice as fast, even without considering additional GC caused
by temporary String instances.
- WordSet is more compact in its memory presentation; if Strings are
shared its size is comparable to optimally filled HashSet, and if
no such Strings exists, its much more compact (relatively speaking)
Although this is an efficient set for specific set of usage patterns,
one restriction is that the full set of words to include has to be
known before constructing the set. Also, the size of the set is
limited to total word content of about 20k characters; factory method
does verify the limit and indicates if an instance can not be created.