How to use
CmsExtractorMsWord
in
org.opencms.search.extractors

Best Java code snippets using org.opencms.search.extractors.CmsExtractorMsWord (Showing top 3 results out of 315)

  /** 
   * @see org.opencms.search.extractors.I_CmsTextExtractor#extractText(java.io.InputStream, java.lang.String)
   */
  @Override
  public I_CmsExtractionResult extractText(InputStream in, String encoding) throws Exception {

    String rawContent = "";
    try {
      // first extract the text using the text abstraction libary
      WordExtractor wordExtractor = new WordExtractor();
      rawContent = wordExtractor.extractText(getStreamCopy(in));
      rawContent = removeControlChars(rawContent);

      // now extract the meta information using POI 
      POIFSReader reader = new POIFSReader();
      reader.registerListener(this);
      reader.read(getStreamCopy(in));
    } catch (Exception e) {
      if (LOG.isErrorEnabled()) {
        LOG.error(Messages.get().container(Messages.LOG_EXTRACT_TEXT_ERROR_0), e);
      }
    }
    // combine the meta information with the content and create the result
    return createExtractionResult(rawContent);
  }
}

/**
 * Returns the raw text content of a given vfs resource containing MS Word data.<p>
 * 
 * @see org.opencms.search.documents.I_CmsSearchExtractor#extractContent(CmsObject, CmsResource, CmsSearchIndex)
 */
public I_CmsExtractionResult extractContent(CmsObject cms, CmsResource resource, CmsSearchIndex index)
throws CmsIndexException, CmsException {
  CmsFile file = readFile(cms, resource);
  try {
    return CmsExtractorMsWord.getExtractor().extractText(file.getContents());
  } catch (Exception e) {
    throw new CmsIndexException(
      Messages.get().container(Messages.ERR_TEXT_EXTRACTION_1, resource.getRootPath()),
      e);
  }
}

  textExtractor = CmsExtractorPdf.getExtractor();
} else if (path1.endsWith(".doc") && path2.endsWith(".doc")) {
  textExtractor = CmsExtractorMsWord.getExtractor();
} else if (path1.endsWith(".xls") && path2.endsWith(".xls")) {
  textExtractor = CmsExtractorMsExcel.getExtractor();

Javadoc

Extracts the text from an MS Word document.

Most used methods

createExtractionResult
getExtractor
Returns an instance of this text extractor.
getStreamCopy
removeControlChars

Popular in Java

Finding current android device location
runOnUiThread (Activity)
getContentResolver (Context)
startActivity (Activity)
BufferedInputStream (java.io)
A BufferedInputStream adds functionality to another input stream-namely, the ability to buffer the i
BufferedWriter (java.io)
Wraps an existing Writer and buffers the output. Expensive interaction with the underlying reader is
FileNotFoundException (java.io)
Thrown when a file specified by a program cannot be found.
Date (java.sql)
A class which can consume and produce dates in SQL Date format. Dates are represented in SQL as yyyy
Deque (java.util)
A linear collection that supports element insertion and removal at both ends. The name deque is shor
Container (java.awt)
A generic Abstract Window Toolkit(AWT) container object is a component that can contain other AWT co
Top PhpStorm plugins

How to useCmsExtractorMsWord in org.opencms.search.extractors

Best Java code snippets using org.opencms.search.extractors.CmsExtractorMsWord (Showing top 3 results out of 315)

How to use
CmsExtractorMsWord
in
org.opencms.search.extractors