How to use
findInCacheByPartOfFilename
method
in
org.apache.mahout.common.HadoopUtil

Best Java code snippets using org.apache.mahout.common.HadoopUtil.findInCacheByPartOfFilename (Showing top 6 results out of 315)

@Override
protected void setup(Context context) throws IOException, InterruptedException {
 super.setup(context);
 Configuration conf = context.getConfiguration();
 vectorCount = conf.getLong(TFIDFConverter.VECTOR_COUNT, 1);
 featureCount = conf.getLong(TFIDFConverter.FEATURE_COUNT, 1);
 minDf = conf.getInt(TFIDFConverter.MIN_DF, 1);
 maxDf = conf.getLong(TFIDFConverter.MAX_DF, -1);
 sequentialAccess = conf.getBoolean(PartialVectorMerger.SEQUENTIAL_ACCESS, false);
 namedVector = conf.getBoolean(PartialVectorMerger.NAMED_VECTOR, false);
 URI[] localFiles = DistributedCache.getCacheFiles(conf);
 Path dictionaryFile = HadoopUtil.findInCacheByPartOfFilename(TFIDFConverter.FREQUENCY_FILE, localFiles);
 // key is feature, value is the document frequency
 for (Pair<IntWritable,LongWritable> record 
    : new SequenceFileIterable<IntWritable,LongWritable>(dictionaryFile, true, conf)) {
  dictionary.put(record.getFirst().get(), record.getSecond().get());
 }
}

@Override
protected void setup(Context context) throws IOException, InterruptedException {
 super.setup(context);
 Configuration conf = context.getConfiguration();
 vectorCount = conf.getLong(TFIDFConverter.VECTOR_COUNT, 1);
 featureCount = conf.getLong(TFIDFConverter.FEATURE_COUNT, 1);
 minDf = conf.getInt(TFIDFConverter.MIN_DF, 1);
 maxDf = conf.getLong(TFIDFConverter.MAX_DF, -1);
 sequentialAccess = conf.getBoolean(PartialVectorMerger.SEQUENTIAL_ACCESS, false);
 namedVector = conf.getBoolean(PartialVectorMerger.NAMED_VECTOR, false);
 URI[] localFiles = DistributedCache.getCacheFiles(conf);
 Path dictionaryFile = HadoopUtil.findInCacheByPartOfFilename(TFIDFConverter.FREQUENCY_FILE, localFiles);
 // key is feature, value is the document frequency
 for (Pair<IntWritable,LongWritable> record 
    : new SequenceFileIterable<IntWritable,LongWritable>(dictionaryFile, true, conf)) {
  dictionary.put(record.getFirst().get(), record.getSecond().get());
 }
}

@Override
protected void setup(Context context) throws IOException, InterruptedException {
 super.setup(context);
 Configuration conf = context.getConfiguration();
 dimension = conf.getInt(PartialVectorMerger.DIMENSION, Integer.MAX_VALUE);
 sequentialAccess = conf.getBoolean(PartialVectorMerger.SEQUENTIAL_ACCESS, false);
 namedVector = conf.getBoolean(PartialVectorMerger.NAMED_VECTOR, false);
 maxNGramSize = conf.getInt(DictionaryVectorizer.MAX_NGRAMS, maxNGramSize);
 URI[] localFiles = DistributedCache.getCacheFiles(conf);
 Path dictionaryFile = HadoopUtil.findInCacheByPartOfFilename(DictionaryVectorizer.DICTIONARY_FILE, localFiles);
 // key is word value is id
 for (Pair<Writable, IntWritable> record
     : new SequenceFileIterable<Writable, IntWritable>(dictionaryFile, true, conf)) {
  dictionary.put(record.getFirst().toString(), record.getSecond().get());
 }
}

@Override
protected void setup(Context context) throws IOException, InterruptedException {
 super.setup(context);
 Configuration conf = context.getConfiguration();
 dimension = conf.getInt(PartialVectorMerger.DIMENSION, Integer.MAX_VALUE);
 sequentialAccess = conf.getBoolean(PartialVectorMerger.SEQUENTIAL_ACCESS, false);
 namedVector = conf.getBoolean(PartialVectorMerger.NAMED_VECTOR, false);
 maxNGramSize = conf.getInt(DictionaryVectorizer.MAX_NGRAMS, maxNGramSize);
 URI[] localFiles = DistributedCache.getCacheFiles(conf);
 Path dictionaryFile = HadoopUtil.findInCacheByPartOfFilename(DictionaryVectorizer.DICTIONARY_FILE, localFiles);
 // key is word value is id
 for (Pair<Writable, IntWritable> record
     : new SequenceFileIterable<Writable, IntWritable>(dictionaryFile, true, conf)) {
  dictionary.put(record.getFirst().toString(), record.getSecond().get());
 }
}

@Test
public void nonExistingFile() {
 Path path = HadoopUtil.findInCacheByPartOfFilename("no such file", DISTRIBUTED_CACHE_FILES);
 assertNull(path);
}

@Test
public void existingFile() {
 Path path = HadoopUtil.findInCacheByPartOfFilename("want_to_find", DISTRIBUTED_CACHE_FILES);
 assertNotNull(path);
 assertEquals(FILE_I_WANT_TO_FIND.getName(), path.getName());
}

Javadoc

Finds a file in the DistributedCache

Popular methods of HadoopUtil

delete
countRecords
Count all the records in a directory using a org.apache.mahout.common.iterator.sequencefile.Sequence
getFileStatus
listStatus
buildDirList
Builds a comma-separated list of input splits
cacheFiles
getCachedFiles
Retrieves paths to cached files.
getCustomJobName
getSingleCachedFile
Return the first cached file in the list, else null if thre are no cached files.
openStream
prepareJob
Create a map-only Hadoop Job out of the passed in parameters. Does not set the Job name.
readInt

Popular in Java

Updating database using SQL prepared statement
setContentView (Activity)
putExtra (Intent)
scheduleAtFixedRate (Timer)
Thread (java.lang)
A thread is a thread of execution in a program. The Java Virtual Machine allows an application to ha
FileUtils (org.apache.commons.io)
General file manipulation utilities. Facilities are provided in the following areas: * writing to a
Color (java.awt)
The Color class is used to encapsulate colors in the default sRGB color space or colors in arbitrary
GridLayout (java.awt)
The GridLayout class is a layout manager that lays out a container's components in a rectangular gri
BufferedImage (java.awt.image)
The BufferedImage subclass describes an java.awt.Image with an accessible buffer of image data. All
Reflections (org.reflections)
Reflections one-stop-shop objectReflections scans your classpath, indexes the metadata, allows you t
Github Copilot alternatives

How to use findInCacheByPartOfFilenamemethodin org.apache.mahout.common.HadoopUtil

Best Java code snippets using org.apache.mahout.common.HadoopUtil.findInCacheByPartOfFilename (Showing top 6 results out of 315)

How to use
findInCacheByPartOfFilename
method
in
org.apache.mahout.common.HadoopUtil