org.apache.tika.parser.pdf.PDFParserConfig.setExtractInlineImages java code examples

@Field
void setExtractInlineImages(boolean extractInlineImages) {
  defaultConfig.setExtractInlineImages(extractInlineImages);
}

private void extractInlineImagesFromPDFs() {
  if (configFilePath == null && context.get(PDFParserConfig.class) == null) {
    PDFParserConfig pdfParserConfig = new PDFParserConfig();
    pdfParserConfig.setExtractInlineImages(true);
    String warn = "As a convenience, TikaCLI has turned on extraction of\n" +
        "inline images for the PDFParser (TIKA-2374).\n" +
        "Aside from the -z option, this is not the default behavior\n"+
        "in Tika generally or in tika-server.";
    LOG.info(warn);
    context.set(PDFParserConfig.class, pdfParserConfig);
  }
}

    getBooleanProp(props.getProperty("extractBookmarksText"),
        getExtractBookmarksText()));
setExtractInlineImages(
    getBooleanProp(props.getProperty("extractInlineImages"),
        getExtractInlineImages()));

/**
 * Disable OCR. This method only has an effect if Tesseract is installed.
 */
public void disableOcr() {
  if (!ocrDisabled) {
    excludeParser(TesseractOCRParser.class);
    ocrDisabled = true;
    pdfConfig.setExtractInlineImages(false);
  }
}

@Field
void setExtractInlineImages(boolean extractInlineImages) {
  defaultConfig.setExtractInlineImages(extractInlineImages);
}

@Field
void setExtractInlineImages(boolean extractInlineImages) {
  defaultConfig.setExtractInlineImages(extractInlineImages);
}

  pdfParserConfig.setExtractInlineImages((Boolean) extractInlineImages);
} else {
  pdfParserConfig.setExtractInlineImages(true);

pdfConfig.setExtractInlineImages(true);

    getBooleanProp(props.getProperty("extractAcroFormContent"),
        getExtractAcroFormContent()));
setExtractInlineImages(
    getBooleanProp(props.getProperty("extractInlineImages"),
        getExtractInlineImages()));

    getBooleanProp(props.getProperty("extractBookmarksText"),
        getExtractBookmarksText()));
setExtractInlineImages(
    getBooleanProp(props.getProperty("extractInlineImages"),
        getExtractInlineImages()));

/**
 * Create a new extractor, which will OCR images by default if Tesseract is available locally, extract inline
 * images from PDF files and OCR them and use PDFBox's non-sequential PDF parser.
 */
public Extractor() {
  // Calculate the SHA256 digest by default.
  setDigestAlgorithms(DigestAlgorithm.SHA256);
  // Run OCR on images contained within PDFs and not on pages.
  pdfConfig.setExtractInlineImages(true);
  pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
  // By default, only the object IDs are used for determining uniqueness.
  // In scanned documents under test from the Panama registry, different embedded images had the same ID, leading to incomplete OCRing when uniqueness detection was turned on.
  pdfConfig.setExtractUniqueInlineImagesOnly(false);
  // Set a long OCR timeout by default, because Tika's is too short.
  setOcrTimeout(Duration.ofDays(1));
  ocrConfig.setEnableImageProcessing(0); // See TIKA-2167. Image processing causes OCR to fail.
  // English text recognition by default.
  ocrConfig.setLanguage("eng");
}

 Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

 Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath(tPath);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(false); // set to false if pdf contains multiple images.
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
//need to add this to make sure recursive parsing happens!
parseContext.set(Parser.class, parser);

      || contentType.matches(ocrConfig.getContentTypes()))) {
  context.set(TesseractOCRConfig.class, ocrTesseractConfig);
  pdfConfig.setExtractInlineImages(true);
} else {
  pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);

Javadoc

If true, extract inline embedded OBXImages. Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Set to true with caution.

The default is false.

Popular methods of PDFParserConfig

<init>
Loads properties from InputStream and then tries to close InputStream. If there is an IOException, t
setExtractUniqueInlineImagesOnly
Multiple pages within a PDF file might refer to the same underlying image. If #extractUniqueInlineIm
setOcrStrategy
Which strategy to use for OCR
setSuppressDuplicateOverlappingText
If true, the parser should try to remove duplicated text over the same region. This is needed for so
configure
Configures the given pdf2XHTML.
setEnableAutoSpace
If true (the default), the parser should estimate where spaces should be inserted between words. For
setExtractAcroFormContent
If true (the default), extract content from AcroForms at the end of the document. If an XFA is found
setExtractAnnotationText
If true (the default), text in annotations will be extracted.
setSortByPosition
If true, sort text tokens by their x/y position before extracting text. This may be necessary for so
getAccessChecker
getAverageCharTolerance
getBooleanProp

Popular in Java

Making http post requests using okhttp
notifyDataSetChanged (ArrayAdapter)
startActivity (Activity)
putExtra (Intent)
File (java.io)
An "abstract" representation of a file system entity identified by a pathname. The pathname may be a
Permission (java.security)
Legacy security code; do not use.
ResultSet (java.sql)
An interface for an object which represents a database table entry, returned as the result of the qu
UUID (java.util)
UUID is an immutable representation of a 128-bit universally unique identifier (UUID). There are mul
Manifest (java.util.jar)
The Manifest class is used to obtain attribute information for a JarFile and its entries.
Graphics2D (java.awt)
This Graphics2D class extends the Graphics class to provide more sophisticated control overgraphics
Github Copilot alternatives

How to use setExtractInlineImagesmethodin org.apache.tika.parser.pdf.PDFParserConfig

Best Java code snippets using org.apache.tika.parser.pdf.PDFParserConfig.setExtractInlineImages (Showing top 14 results out of 315)

How to use
setExtractInlineImages
method
in
org.apache.tika.parser.pdf.PDFParserConfig