How to use
setOcrStrategy
method
in
org.apache.tika.parser.pdf.PDFParserConfig

Best Java code snippets using org.apache.tika.parser.pdf.PDFParserConfig.setOcrStrategy (Showing top 8 results out of 315)

@Field
public void setOcrStrategy(String ocrStrategyString) {
  defaultConfig.setOcrStrategy(ocrStrategyString);
}

    isCatchIntermediateIOExceptions()));
setOcrStrategy(OCR_STRATEGY.parse(props.getProperty("ocrStrategy")));

@Field
public void setOcrStrategy(String ocrStrategyString) {
  defaultConfig.setOcrStrategy(ocrStrategyString);
}

@Field
public void setOcrStrategy(String ocrStrategyString) {
  defaultConfig.setOcrStrategy(ocrStrategyString);
}

/**
 * Create a new extractor, which will OCR images by default if Tesseract is available locally, extract inline
 * images from PDF files and OCR them and use PDFBox's non-sequential PDF parser.
 */
public Extractor() {
  // Calculate the SHA256 digest by default.
  setDigestAlgorithms(DigestAlgorithm.SHA256);
  // Run OCR on images contained within PDFs and not on pages.
  pdfConfig.setExtractInlineImages(true);
  pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
  // By default, only the object IDs are used for determining uniqueness.
  // In scanned documents under test from the Panama registry, different embedded images had the same ID, leading to incomplete OCRing when uniqueness detection was turned on.
  pdfConfig.setExtractUniqueInlineImagesOnly(false);
  // Set a long OCR timeout by default, because Tika's is too short.
  setOcrTimeout(Duration.ofDays(1));
  ocrConfig.setEnableImageProcessing(0); // See TIKA-2167. Image processing causes OCR to fail.
  // English text recognition by default.
  ocrConfig.setLanguage("eng");
}

    isCatchIntermediateIOExceptions()));
setOcrStrategy(OCR_STRATEGY.parse(props.getProperty("ocrStrategy")));

    isCatchIntermediateIOExceptions()));
setOcrStrategy(OCR_STRATEGY.parse(props.getProperty("ocrStrategy")));

  pdfConfig.setExtractInlineImages(true);
} else {
  pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);

Javadoc

Which strategy to use for OCR

Popular methods of PDFParserConfig

setExtractInlineImages
If true, extract inline embedded OBXImages.Beware: some PDF documents of modest size (~4MB) can cont
<init>
Loads properties from InputStream and then tries to close InputStream. If there is an IOException, t
setExtractUniqueInlineImagesOnly
Multiple pages within a PDF file might refer to the same underlying image. If #extractUniqueInlineIm
setSuppressDuplicateOverlappingText
If true, the parser should try to remove duplicated text over the same region. This is needed for so
configure
Configures the given pdf2XHTML.
setEnableAutoSpace
If true (the default), the parser should estimate where spaces should be inserted between words. For
setExtractAcroFormContent
If true (the default), extract content from AcroForms at the end of the document. If an XFA is found
setExtractAnnotationText
If true (the default), text in annotations will be extracted.
setSortByPosition
If true, sort text tokens by their x/y position before extracting text. This may be necessary for so
getAccessChecker
getAverageCharTolerance
getBooleanProp

Popular in Java

Reading from database using SQL prepared statement
notifyDataSetChanged (ArrayAdapter)
requestLocationUpdates (LocationManager)
getApplicationContext (Context)
Collection (java.util)
Collection is the root of the collection hierarchy. It defines operations on data collections and t
HashSet (java.util)
HashSet is an implementation of a Set. All optional operations (adding and removing) are supported.
Properties (java.util)
A Properties object is a Hashtable where the keys and values must be Strings. Each property can have
SortedSet (java.util)
SortedSet is a Set which iterates over its elements in a sorted order. The order is determined eithe
TimeZone (java.util)
TimeZone represents a time zone offset, and also figures out daylight savings. Typically, you get a
Component (java.awt)
A component is an object having a graphical representation that can be displayed on the screen and t
Github Copilot alternatives

How to use setOcrStrategymethodin org.apache.tika.parser.pdf.PDFParserConfig

Best Java code snippets using org.apache.tika.parser.pdf.PDFParserConfig.setOcrStrategy (Showing top 8 results out of 315)

How to use
setOcrStrategy
method
in
org.apache.tika.parser.pdf.PDFParserConfig