In fairly rare cases, a PDF's XMP will contain a string that
has incorrectly been encoded with PDFEncoding: an octal for non-ascii and
ascii for ascii, e.g. "\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000"
This class can be used to decode those strings.
See TIKA-1678. Many thanks to Andrew Jackson for raising this issue
and Tilman Hausherr for the solution.
As of this writing, we are only handling strings that start with
an encoded BOM. Andrew Jackson found a handful of other examples (e.g.
this ISO-8859-7 string:
"Microsoft Word - \\323\\365\\354\\354\\345\\364\\357\\367\\336
\\364\\347\\362 PRAKSIS \\363\\364\\357")
that we aren't currently handling.