A concrete implementation class for
Collation.
RuleBasedCollator has the following restrictions for efficiency
(other subclasses may be used for more complex languages):
- If a French secondary ordering is specified it applies to the whole
collator object.
- All non-mentioned Unicode characters are at the end of the collation
order.
- If a character is not located in the
RuleBasedCollator, the
default Unicode Collation Algorithm (UCA) rule-based table is automatically
searched as a backup.
The collation table is composed of a list of collation rules, where each rule
is of three forms:
<modifier>
<relation> <text-argument>
<reset> <text-argument>
The rule elements are defined as follows:
- Modifier: There is a single modifier which is used to
specify that all accents (secondary differences) are backwards:
- '@' : Indicates that accents are sorted backwards, as in French.
- Relation: The relations are the following:
- '<' : Greater, as a letter difference (primary)
- ';' : Greater, as an accent difference (secondary)
- ',' : Greater, as a case difference (tertiary)
- '=' : Equal
- Text-Argument: A text-argument is any sequence of
characters, excluding special characters (that is, common whitespace
characters [0009-000D, 0020] and rule syntax characters [0021-002F,
003A-0040, 005B-0060, 007B-007E]). If those characters are desired, you can
put them in single quotes (for example, use '&' for ampersand). Note that
unquoted white space characters are ignored; for example,
b c is
treated as
bc.
- Reset: There is a single reset which is used primarily
for contractions and expansions, but which can also be used to add a
modification at the end of a set of rules:
- '&' : Indicates that the next rule follows the position to where the reset
text-argument would be sorted.
This sounds more complicated than it is in practice. For example, the
following are equivalent ways of expressing the same thing:
a < b < c
a < b & b < c
a < c & a < b
Notice that the order is important, as the subsequent item goes immediately
after the text-argument. The following are not equivalent:
a < b & a < c
a < c & a < b
Either the text-argument must already be present in the sequence, or some
initial substring of the text-argument must be present. For example
"a < b & ae < e" is valid since "a" is present in the sequence before
"ae" is reset. In this latter case, "ae" is not entered and treated as a
single character; instead, "e" is sorted as if it were expanded to two
characters: "a" followed by an "e". This difference appears in natural
languages: in traditional Spanish "ch" is treated as if it contracts to a
single character (expressed as
"c < ch < d"), while in traditional
German a-umlaut is treated as if it expands to two characters (expressed as
"a,A < b,B ... & ae;\u00e3 & AE;\u00c3", where \u00e3 and \u00c3
are the escape sequences for a-umlaut).
Ignorable Characters
For ignorable characters, the first rule must start with a relation (the
examples we have used above are really fragments;
"a < b" really
should be
"< a < b"). If, however, the first relation is not
"Normalization and Accents
RuleBasedCollator automatically processes its rule table to include
both pre-composed and combining-character versions of accented characters.
Even if the provided rule string contains only base characters and separate
combining accent characters, the pre-composed accented characters matching
all canonical combinations of characters from the rule string will be entered
in the table.
This allows you to use a RuleBasedCollator to compare accented strings even
when the collator is set to NO_DECOMPOSITION. However, if the strings to be
collated contain combining sequences that may not be in canonical order, you
should set the collator to CANONICAL_DECOMPOSITION to enable sorting of
combining sequences. For more information, see The Unicode Standard, Version 3.0.
Errors
The following rules are not valid:
- A text-argument contains unquoted punctuation symbols, for example
"a < b-c < d".
- A relation or reset character is not followed by a text-argument, for
example
"a < , b".
- A reset where the text-argument (or an initial substring of the
text-argument) is not already in the sequence or allocated in the default UCA
table, for example
"a < b & e < f".
If you produce one of these errors,
RuleBasedCollator throws a
ParseException.
Examples
Normally, to create a rule-based collator object, you will use
Collator's factory method
getInstance. However, to create a
rule-based collator object with specialized rules tailored to your needs, you
construct the
RuleBasedCollator with the rules contained in a
String object. For example:
String Simple = "< a < b < c < d";
RuleBasedCollator mySimple = new RuleBasedCollator(Simple);
Or:
String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I"
+ "< j,J< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R"
+ "< s,S< t,T< u,U< v,V< w,W< x,X< y,Y< z,Z"
+ "< \u00E5=a\u030A,\u00C5=A\u030A"
+ ";aa,AA< \u00E6,\u00C6< \u00F8,\u00D8";
RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);
Combining
Collators is as simple as concatenating strings. Here is
an example that combines two
Collators from two different locales:
// Create an en_US Collator object
RuleBasedCollator en_USCollator = (RuleBasedCollator)Collator
.getInstance(new Locale("en", "US", ""));
// Create a da_DK Collator object
RuleBasedCollator da_DKCollator = (RuleBasedCollator)Collator
.getInstance(new Locale("da", "DK", ""));
// Combine the two collators
// First, get the collation rules from en_USCollator
String en_USRules = en_USCollator.getRules();
// Second, get the collation rules from da_DKCollator
String da_DKRules = da_DKCollator.getRules();
RuleBasedCollator newCollator = new RuleBasedCollator(en_USRules + da_DKRules);
// newCollator has the combined rules
The next example shows to make changes on an existing table to create a new
Collator object. For example, add
"& C < ch, cH, Ch, CH" to
the
en_USCollator object to create your own:
// Create a new Collator object with additional rules
String addRules = "& C < ch, cH, Ch, CH";
RuleBasedCollator myCollator = new RuleBasedCollator(en_USCollator + addRules);
// myCollator contains the new rules
The following example demonstrates how to change the order of non-spacing
accents:
// old rule
String oldRules = "= \u00a8 ; \u00af ; \u00bf" + "< a , A ; ae, AE ; \u00e6 , \u00c6"
+ "< b , B < c, C < e, E & C < d, D";
// change the order of accent characters
String addOn = "& \u00bf ; \u00af ; \u00a8;";
RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn);
The last example shows how to put new primary ordering in before the default
setting. For example, in the Japanese
Collator, you can either sort
English characters before or after Japanese characters:
// get en_US Collator rules
RuleBasedCollator en_USCollator = (RuleBasedCollator)
Collator.getInstance(Locale.US);
// add a few Japanese character to sort before English characters
// suppose the last character before the first base letter 'a' in
// the English collation rule is \u30A2
String jaString = "& \u30A2 , \u30FC < \u30C8";
RuleBasedCollator myJapaneseCollator =
new RuleBasedCollator(en_USCollator.getRules() + jaString);