Classify the contents of a
String to classified character offset
spans. Plain text or XML input text is expected and the
PlainTextDocumentReaderAndWriter is used by default.
Output is a (possibly
empty, but not
null) List of Triples. Each Triple is an entity
name, followed by beginning and ending character offsets in the original
String. Character offsets can be thought of as fenceposts between the
characters, or, like certain methods in the Java String class, as character
positions, numbered starting from 0, with the end index pointing to the
position AFTER the entity ends. That is, end - start is the length of the
entity in characters.
Fine points: Token offsets are true wrt the source text, even though
the tokenizer may internally normalize certain tokens to String
representations of different lengths (e.g., " becoming `` or ''). When a
period counts as both part of an abbreviation and as an end of sentence
marker, and that abbreviation is part of a named entity, the reported
entity string excludes the period.