@SuppressWarnings("unchecked") public static JavaRDD<HoodieRecord> dropDuplicates(JavaSparkContext jssc, JavaRDD<HoodieRecord> incomingHoodieRecords, Map<String, String> parameters) throws Exception { HoodieWriteConfig writeConfig = HoodieWriteConfig .newBuilder() .withPath(parameters.get("path")) .withProps(parameters).build(); return dropDuplicates(jssc, incomingHoodieRecords, writeConfig); } }
public Path makeTempPath(String partitionPath, int taskPartitionId, String fileName, int stageId, long taskAttemptId) { Path path = new Path(config.getBasePath(), HoodieTableMetaClient.TEMPFOLDER_NAME); return new Path(path.toString(), FSUtils.makeTempDataFileName(partitionPath, commitTime, taskPartitionId, fileName, stageId, taskAttemptId)); }
@Before public void start() throws ConfigurationException { HoodieWriteConfig config = mock(HoodieWriteConfig.class); when(config.isMetricsOn()).thenReturn(true); when(config.getMetricsReporterType()).thenReturn(MetricsReporterType.INMEMORY); metrics = new HoodieMetrics(config, "raw_table"); }
@VisibleForTesting HoodieWriteClient(JavaSparkContext jsc, HoodieWriteConfig clientConfig, boolean rollbackInFlight, HoodieIndex index) { this.fs = FSUtils.getFs(clientConfig.getBasePath(), jsc.hadoopConfiguration()); this.jsc = jsc; this.config = clientConfig; this.index = index; this.metrics = new HoodieMetrics(config, config.getTableName()); this.rollbackInFlight = rollbackInFlight; }
private JavaRDD<WriteStatus> updateIndexAndCommitIfNeeded(JavaRDD<WriteStatus> writeStatusRDD, HoodieTable<T> table, String commitTime) { // Update the index back JavaRDD<WriteStatus> statuses = index.updateLocation(writeStatusRDD, jsc, table); // Trigger the insert and collect statuses statuses = statuses.persist(config.getWriteStatusStorageLevel()); commitOnAutoCommit(commitTime, statuses, new HoodieTableMetaClient(jsc.hadoopConfiguration(), config.getBasePath(), true) .getCommitActionType()); return statuses; }
@Test public void testPropertyLoading() throws IOException { Builder builder = HoodieWriteConfig.newBuilder().withPath("/tmp"); Map<String, String> params = Maps.newHashMap(); params.put(HoodieCompactionConfig.MAX_COMMITS_TO_KEEP, "5"); params.put(HoodieCompactionConfig.MIN_COMMITS_TO_KEEP, "2"); ByteArrayOutputStream outStream = saveParamsIntoOutputStream(params); ByteArrayInputStream inputStream = new ByteArrayInputStream(outStream.toByteArray()); try { builder = builder.fromInputStream(inputStream); } finally { outStream.close(); inputStream.close(); } HoodieWriteConfig config = builder.build(); assertEquals(config.getMaxCommitsToKeep(), 5); assertEquals(config.getMinCommitsToKeep(), 2); }
@Test public void testArchiveEmptyDataset() throws IOException { HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder().withPath(basePath) .withSchema(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA).withParallelism(2, 2) .forTable("test-trip-table").build(); HoodieCommitArchiveLog archiveLog = new HoodieCommitArchiveLog(cfg, new HoodieTableMetaClient(dfs.getConf(), cfg.getBasePath(), true)); boolean result = archiveLog.archiveIfRequired(jsc); assertTrue(result); }
public Timer.Context getCompactionCtx() { if (config.isMetricsOn() && compactionTimer == null) { compactionTimer = createTimer(commitTimerName); } return compactionTimer == null ? null : compactionTimer.time(); }
new HoodieTableMetaClient(jsc.hadoopConfiguration(), config.getBasePath(), true), config, jsc); Optional<HoodieInstant> cleanInstant = table.getCompletedCleanTimeline().lastInstant(); config.shouldAssumeDatePartitioning())) .mapToPair((PairFunction<String, String, List<String>>) partitionPath -> {
private Stream<HoodieInstant> getInstantsToArchive(JavaSparkContext jsc) { int maxCommitsToKeep = config.getMaxCommitsToKeep(); int minCommitsToKeep = config.getMinCommitsToKeep();
List<String> partitionPaths = FSUtils .getAllPartitionPaths(metaClient.getFs(), metaClient.getBasePath(), config.shouldAssumeDatePartitioning()); partitionPaths = config.getCompactionStrategy().filterPartitionPaths(config, partitionPaths); config.getCompactionStrategy().captureMetrics(config, dataFile, partitionPath, logFiles)); }) .filter(c -> !c.getDeltaFilePaths().isEmpty()) HoodieCompactionPlan compactionPlan = config.getCompactionStrategy().generateCompactionPlan(config, operations, CompactionUtils.getAllPendingCompactionPlans(metaClient).stream().map(Pair::getValue).collect(toList())); Preconditions.checkArgument(compactionPlan.getOperations().stream()
List<HoodieRollbackStat> stats = jsc.parallelize(FSUtils .getAllPartitionPaths(metaClient.getFs(), getMetaClient().getBasePath(), config.shouldAssumeDatePartitioning())) .map((Function<String, HoodieRollbackStat>) partitionPath -> {
/** * Performs cleaning of partition paths according to cleaning policy and returns the number of * files cleaned. Handles skews in partitions to clean by making files to clean as the unit of * task distribution. * * @throws IllegalArgumentException if unknown cleaning policy is provided */ @Override public List<HoodieCleanStat> clean(JavaSparkContext jsc) { try { FileSystem fs = getMetaClient().getFs(); List<String> partitionsToClean = FSUtils .getAllPartitionPaths(fs, getMetaClient().getBasePath(), config.shouldAssumeDatePartitioning()); logger.info("Partitions to clean up : " + partitionsToClean + ", with policy " + config .getCleanerPolicy()); if (partitionsToClean.isEmpty()) { logger.info("Nothing to clean here mom. It is already clean"); return Collections.emptyList(); } return cleanPartitionPaths(partitionsToClean, jsc); } catch (IOException e) { throw new HoodieIOException("Failed to clean up after commit", e); } }
public SparkBoundedInMemoryExecutor(final HoodieWriteConfig hoodieConfig, BoundedInMemoryQueueProducer<I> producer, BoundedInMemoryQueueConsumer<O, E> consumer, Function<I, O> bufferedIteratorTransform) { super(hoodieConfig.getWriteBufferLimitBytes(), producer, Optional.of(consumer), bufferedIteratorTransform); this.sparkThreadTaskContext = TaskContext.get(); }
public static MetricsReporter createReporter(HoodieWriteConfig config, MetricRegistry registry) { MetricsReporterType type = config.getMetricsReporterType(); MetricsReporter reporter = null; switch (type) { case GRAPHITE: reporter = new MetricsGraphiteReporter(config, registry); break; case INMEMORY: reporter = new InMemoryMetricsReporter(); break; default: logger.error("Reporter type[" + type + "] is not supported."); break; } return reporter; } }
private List<HoodieCompactionOperation> createCompactionOperations(HoodieWriteConfig config, Map<Long, List<Long>> sizesMap, Map<Long, String> keyToPartitionMap) { List<HoodieCompactionOperation> operations = Lists.newArrayList(sizesMap.size()); sizesMap.forEach((k, v) -> { HoodieDataFile df = TestHoodieDataFile.newDataFile(k); String partitionPath = keyToPartitionMap.get(k); List<HoodieLogFile> logFiles = v.stream().map(TestHoodieLogFile::newLogFile).collect(Collectors.toList()); operations.add(new HoodieCompactionOperation(df.getCommitTime(), logFiles.stream().map(s -> s.getPath().toString()).collect(Collectors.toList()), df.getPath(), df.getFileId(), partitionPath, config.getCompactionStrategy().captureMetrics(config, Optional.of(df), partitionPath, logFiles))); }); return operations; } }
/** * Test CLeaner Stat when there are no partition paths. */ @Test public void testCleaningWithZeroPartitonPaths() throws IOException { HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).withAssumeDatePartitioning(true) .withCompactionConfig(HoodieCompactionConfig.newBuilder().withCleanerPolicy( HoodieCleaningPolicy.KEEP_LATEST_COMMITS).retainCommits(2).build()).build(); // Make a commit, although there are no partitionPaths. // Example use-case of this is when a client wants to create a table // with just some commit metadata, but no data/partitionPaths. HoodieTestUtils.createCommitFiles(basePath, "000"); HoodieTable table = HoodieTable.getHoodieTable( new HoodieTableMetaClient(jsc.hadoopConfiguration(), config.getBasePath(), true), config, jsc); List<HoodieCleanStat> hoodieCleanStatsOne = table.clean(jsc); assertTrue("HoodieCleanStats should be empty for a table with empty partitionPaths", hoodieCleanStatsOne.isEmpty()); }
public Timer.Context getDeltaCommitCtx() { if (config.isMetricsOn() && deltaCommitTimer == null) { deltaCommitTimer = createTimer(deltaCommitTimerName); } return deltaCommitTimer == null ? null : deltaCommitTimer.time(); }
new HoodieTableMetaClient(jsc.hadoopConfiguration(), config.getBasePath(), true), config, jsc); Optional<HoodieInstant> cleanInstant = table.getCompletedCleanTimeline().lastInstant(); config.shouldAssumeDatePartitioning())) .mapToPair((PairFunction<String, String, List<String>>) partitionPath -> {
private Stream<HoodieInstant> getInstantsToArchive(JavaSparkContext jsc) { int maxCommitsToKeep = config.getMaxCommitsToKeep(); int minCommitsToKeep = config.getMinCommitsToKeep();