How to use
BlockInvertedIndexBuilder
in
org.terrier.structures.indexing.classical

Best Java code snippets using org.terrier.structures.indexing.classical.BlockInvertedIndexBuilder (Showing top 2 results out of 315)

invertedIndexBuilder = new BlockInvertedIndexBuilder(currentIndex, "inverted", compressionInvertedConfig);
invertedIndexBuilder.createInvertedIndex();
this.finishedInvertedIndexBuild();

    results = scanLexiconForPointers(
        numberOfPointersPerIteration, lexiconStream,
        codesHashMap, tmpStorageStorage);
    results = scanLexiconForTerms(processTerms, lexiconStream,
        codesHashMap, tmpStorage);
  traverseDirectFile(codesHashMap, tmpStorage);
  logger.info("time to traverse direct file: "+ ((System.currentTimeMillis() - startTraversingDirectFile) / 1000D));
  numberOfTokens += writeInvertedFilePart(dos, tmpStorage,
      processTerms);
  logger.info("time to write inverted file: "	+ ((System.currentTimeMillis() - startWritingInvertedFile) / 1000D));
LexiconOutputStream<String> los = getLexOutputStream("tmplexicon");

Javadoc

Builds an inverted index saving term-block information. It optionally saves term-field information as well.

Algorithm:

While there are terms left:
1. Read M term ids from lexicon, in lexicographical order
2. Read the occurrences of these M terms into memory from the direct file
3. Write the occurrences of these M terms to the inverted file
Rewrite the lexicon, removing block frequencies, and adding inverted file offsets
Write the collection statistics

Lexicon term selection: There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here is to read a small enough number of terms into memory such that the occurrences of all those terms from the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations, which is I/O expensive, as the entire direct file has to be read for every iteration.
The two strategies are:

Read a fixed number of terms on each iterations - this corresponds to the property invertedfile.processterms
Read a fixed number of occurrences (pointers) on each iteration. The number of pointers can be determined using the sum of frequencies of each term from the lexicon. This corresponds to the property invertedfile.processpointers.

By default, the 2nd strategy is chosen, unless the invertedfile.processpointers has a zero value specified.

Properties:

invertedfile.processterms - the number of terms to process in each iteration. Defaults to 25,000
invertedfile.processpointers - the number of pointers to process in each iteration. Defaults to 2,000,000, which specifies that invertedfile.processterms should be read from the lexicon, regardless of the number of pointers.

Most used methods

<init>
constructor
getLexOutputStream
scanLexiconForPointers
scanLexiconForTerms
traverseDirectFile
Traverses the direct fies recording all occurrences of terms noted in codesHashMap into tmpStorage.
writeInvertedFilePart
Writes the section of the inverted file

Popular in Java

Start an intent from android
setRequestProperty (URLConnection)
getSystemService (Context)
startActivity (Activity)
SecureRandom (java.security)
This class generates cryptographically secure pseudo-random numbers. It is best to invoke SecureRand
DecimalFormat (java.text)
A concrete subclass of NumberFormat that formats decimal numbers. It has a variety of features desig
BitSet (java.util)
The BitSet class implements abit array [http://en.wikipedia.org/wiki/Bit_array]. Each element is eit
Collectors (java.util.stream)
Logger (org.apache.log4j)
This is the central class in the log4j package. Most logging operations, except configuration, are d
JFileChooser (javax.swing)
Top PhpStorm plugins

How to useBlockInvertedIndexBuilder in org.terrier.structures.indexing.classical

Best Java code snippets using org.terrier.structures.indexing.classical.BlockInvertedIndexBuilder (Showing top 2 results out of 315)

How to use
BlockInvertedIndexBuilder
in
org.terrier.structures.indexing.classical