How to use
loadInvolvedFiles
method
in
com.uber.hoodie.index.bloom.HoodieBloomIndex

Best Java code snippets using com.uber.hoodie.index.bloom.HoodieBloomIndex.loadInvolvedFiles (Showing top 4 results out of 315)

/**
 * Lookup the location for each record key and return the pair<record_key,location> for all record
 * keys already present and drop the record keys if not present
 */
private JavaPairRDD<String, String> lookupIndex(
  JavaPairRDD<String, String> partitionRecordKeyPairRDD, final JavaSparkContext
  jsc, final HoodieTable hoodieTable) {
 // Obtain records per partition, in the incoming records
 Map<String, Long> recordsPerPartition = partitionRecordKeyPairRDD.countByKey();
 List<String> affectedPartitionPathList = new ArrayList<>(recordsPerPartition.keySet());
 // Step 2: Load all involved files as <Partition, filename> pairs
 List<Tuple2<String, BloomIndexFileInfo>> fileInfoList = loadInvolvedFiles(affectedPartitionPathList, jsc,
   hoodieTable);
 final Map<String, List<BloomIndexFileInfo>> partitionToFileInfo = fileInfoList.stream()
   .collect(groupingBy(Tuple2::_1, mapping(Tuple2::_2, toList())));
 // Step 3: Obtain a RDD, for each incoming record, that already exists, with the file id,
 // that contains it.
 int parallelism = autoComputeParallelism(recordsPerPartition, partitionToFileInfo,
   partitionRecordKeyPairRDD);
 return findMatchingFilesForRecordKeys(partitionToFileInfo,
   partitionRecordKeyPairRDD, parallelism, hoodieTable.getMetaClient());
}

/**
 * Lookup the location for each record key and return the pair<record_key,location> for all record
 * keys already present and drop the record keys if not present
 */
private JavaPairRDD<String, String> lookupIndex(
  JavaPairRDD<String, String> partitionRecordKeyPairRDD, final JavaSparkContext
  jsc, final HoodieTable hoodieTable) {
 // Obtain records per partition, in the incoming records
 Map<String, Long> recordsPerPartition = partitionRecordKeyPairRDD.countByKey();
 List<String> affectedPartitionPathList = new ArrayList<>(recordsPerPartition.keySet());
 // Step 2: Load all involved files as <Partition, filename> pairs
 List<Tuple2<String, BloomIndexFileInfo>> fileInfoList = loadInvolvedFiles(affectedPartitionPathList, jsc,
   hoodieTable);
 final Map<String, List<BloomIndexFileInfo>> partitionToFileInfo = fileInfoList.stream()
   .collect(groupingBy(Tuple2::_1, mapping(Tuple2::_2, toList())));
 // Step 3: Obtain a RDD, for each incoming record, that already exists, with the file id,
 // that contains it.
 int parallelism = autoComputeParallelism(recordsPerPartition, partitionToFileInfo,
   partitionRecordKeyPairRDD);
 return findMatchingFilesForRecordKeys(partitionToFileInfo,
   partitionRecordKeyPairRDD, parallelism, hoodieTable.getMetaClient());
}

/**
 * Load all involved files as <Partition, filename> pair RDD from all partitions in the table.
 */
@Override
@VisibleForTesting
List<Tuple2<String, BloomIndexFileInfo>> loadInvolvedFiles(List<String> partitions, final JavaSparkContext jsc,
                              final HoodieTable hoodieTable) {
 HoodieTableMetaClient metaClient = hoodieTable.getMetaClient();
 try {
  List<String> allPartitionPaths = FSUtils
    .getAllPartitionPaths(metaClient.getFs(), metaClient.getBasePath(),
      config.shouldAssumeDatePartitioning());
  return super.loadInvolvedFiles(allPartitionPaths, jsc, hoodieTable);
 } catch (IOException e) {
  throw new HoodieIOException("Failed to load all partitions", e);
 }
}

HoodieTableMetaClient metadata = new HoodieTableMetaClient(jsc.hadoopConfiguration(), basePath);
HoodieTable table = HoodieTable.getHoodieTable(metadata, config, jsc);
List<Tuple2<String, BloomIndexFileInfo>> filesList = index.loadInvolvedFiles(partitions, jsc, table);
filesList = index.loadInvolvedFiles(partitions, jsc, table);
assertEquals(filesList.size(), 4);

Javadoc

Load all involved files as pair RDD.

Popular methods of HoodieBloomIndex

<init>
explodeRecordRDDWithFileComparisons
For each incoming record, produce N output records, 1 each for each file against which the record's
autoComputeParallelism
The index lookup can be skewed in three dimensions : #files, #partitions, #records To be able to smo
determineParallelism
Its crucial to pick the right parallelism. totalSubPartitions : this is deemed safe limit, to be nic
fetchRecordLocation
findMatchingFilesForRecordKeys
Find out pair. All workload grouped by file-level. Join PairRDD(PartitionPath, RecordKey) and PairRD
lookupIndex
Lookup the location for each record key and return the pair for all record keys already present and
shouldCompareWithFile
if we dont have key ranges, then also we need to compare against the file. no other choice if we do,
tagLocation
tagLocationBacktoRecords
Tag the back to the original HoodieRecord RDD.

Popular in Java

Creating JSON documents from java classes using gson
setScale (BigDecimal)
scheduleAtFixedRate (Timer)
getSupportFragmentManager (FragmentActivity)
SortedMap (java.util)
A map that has its keys ordered. The sorting is according to either the natural ordering of its keys
UUID (java.util)
UUID is an immutable representation of a 128-bit universally unique identifier (UUID). There are mul
IsNull (org.hamcrest.core)
Is the value null?
Notification (javax.management)
JList (javax.swing)
JPanel (javax.swing)
Top 12 Jupyter Notebook extensions

How to use loadInvolvedFilesmethodin com.uber.hoodie.index.bloom.HoodieBloomIndex

Best Java code snippets using com.uber.hoodie.index.bloom.HoodieBloomIndex.loadInvolvedFiles (Showing top 4 results out of 315)

How to use
loadInvolvedFiles
method
in
com.uber.hoodie.index.bloom.HoodieBloomIndex