java.text.BreakIterator.
_usage_
A class that locates boundaries in text. This class defines a protocol for
objects that break up a piece of natural-language text according to a set
of criteria. Instances or subclasses of BreakIterator can be provided, for
example, to break a piece of text into words, sentences, or logical characters
according to the conventions of some language or group of languages.
We provide five built-in types of BreakIterator:
- getTitleInstance() returns a BreakIterator that locates boundaries
between title breaks.
- getSentenceInstance() returns a BreakIterator that locates boundaries
between sentences. This is useful for triple-click selection, for example.
- getWordInstance() returns a BreakIterator that locates boundaries between
words. This is useful for double-click selection or "find whole words" searches.
This type of BreakIterator makes sure there is a boundary position at the
beginning and end of each legal word. (Numbers count as words, too.) Whitespace
and punctuation are kept separate from real words.
- getLineInstance() returns a BreakIterator that locates positions where it is
legal for a text editor to wrap lines. This is similar to word breaking, but
not the same: punctuation and whitespace are generally kept with words (you don't
want a line to start with whitespace, for example), and some special characters
can force a position to be considered a line-break position or prevent a position
from being a line-break position.
- getCharacterInstance() returns a BreakIterator that locates boundaries between
logical characters. Because of the structure of the Unicode encoding, a logical
character may be stored internally as more than one Unicode code point. (A with an
umlaut may be stored as an a followed by a separate combining umlaut character,
for example, but the user still thinks of it as one character.) This iterator allows
various processes (especially text editors) to treat as characters the units of text
that a user would think of as characters, rather than the units of text that the
computer sees as "characters".
The text boundary positions are found according to the rules
described in Unicode Standard Annex #29, Text Boundaries, and
Unicode Standard Annex #14, Line Breaking Properties. These
are available at http://www.unicode.org/reports/tr14/ and
http://www.unicode.org/reports/tr29/.
BreakIterator's interface follows an "iterator" model (hence the name), meaning it
has a concept of a "current position" and methods like first(), last(), next(),
and previous() that update the current position. All BreakIterators uphold the
following invariants:
- The beginning and end of the text are always treated as boundary positions.
- The current position of the iterator is always a boundary position (random-
access methods move the iterator to the nearest boundary position before or
after the specified position, not _to_ the specified position).
- DONE is used as a flag to indicate when iteration has stopped. DONE is only
returned when the current position is the end of the text and the user calls next(),
or when the current position is the beginning of the text and the user calls
previous().
- Break positions are numbered by the positions of the characters that follow
them. Thus, under normal circumstances, the position before the first character
is 0, the position after the first character is 1, and the position after the
last character is 1 plus the length of the string.
- The client can change the position of an iterator, or the text it analyzes,
at will, but cannot change the behavior. If the user wants different behavior, he
must instantiate a new iterator.
BreakIterator accesses the text it analyzes through a CharacterIterator, which makes
it possible to use BreakIterator to analyze text in any text-storage vehicle that
provides a CharacterIterator interface.
Note: Some types of BreakIterator can take a long time to create, and
instances of BreakIterator are not currently cached by the system. For
optimal performance, keep instances of BreakIterator around as long as makes
sense. For example, when word-wrapping a document, don't create and destroy a
new BreakIterator for each line. Create one break iterator for the whole document
(or whatever stretch of text you're wrapping) and use it to do the whole job of
wrapping the text.
Examples:
Creating and using text boundaries
public static void main(String args[]) {
if (args.length == 1) {
String stringToExamine = args[0];
//print each word in order
BreakIterator boundary = BreakIterator.getWordInstance();
boundary.setText(stringToExamine);
printEachForward(boundary, stringToExamine);
//print each sentence in reverse order
boundary = BreakIterator.getSentenceInstance(Locale.US);
boundary.setText(stringToExamine);
printEachBackward(boundary, stringToExamine);
printFirst(boundary, stringToExamine);
printLast(boundary, stringToExamine);
}
}
Print each element in order
public static void printEachForward(BreakIterator boundary, String source) {
int start = boundary.first();
for (int end = boundary.next();
end != BreakIterator.DONE;
start = end, end = boundary.next()) {
System.out.println(source.substring(start,end));
}
}
Print each element in reverse order
public static void printEachBackward(BreakIterator boundary, String source) {
int end = boundary.last();
for (int start = boundary.previous();
start != BreakIterator.DONE;
end = start, start = boundary.previous()) {
System.out.println(source.substring(start,end));
}
}
Print first element
public static void printFirst(BreakIterator boundary, String source) {
int start = boundary.first();
int end = boundary.next();
System.out.println(source.substring(start,end));
}
Print last element
public static void printLast(BreakIterator boundary, String source) {
int end = boundary.last();
int start = boundary.previous();
System.out.println(source.substring(start,end));
}
Print the element at a specified position
public static void printAt(BreakIterator boundary, int pos, String source) {
int end = boundary.following(pos);
int start = boundary.previous();
System.out.println(source.substring(start,end));
}
Find the next word
public static int nextWordStartAfter(int pos, String text) {
BreakIterator wb = BreakIterator.getWordInstance();
wb.setText(text);
int last = wb.following(pos);
int current = wb.next();
while (current != BreakIterator.DONE) {
for (int p = last; p < current; p++) {
if (Character.isLetter(text.charAt(p)))
return last;
}
last = current;
current = wb.next();
}
return BreakIterator.DONE;
}
(The iterator returned by BreakIterator.getWordInstance() is unique in that
the break positions it returns don't represent both the start and end of the
thing being iterated over. That is, a sentence-break iterator returns breaks
that each represent the end of one sentence and the beginning of the next.
With the word-break iterator, the characters between two boundaries might be a
word, or they might be the punctuation or whitespace between two words. The
above code uses a simple heuristic to determine which boundary is the beginning
of a word: If the characters between this boundary and the next boundary
include at least one letter (this can be an alphabetical letter, a CJK ideograph,
a Hangul syllable, a Kana character, etc.), then the text between this boundary
and the next is a word; otherwise, it's the material between words.)