Implementation of a
CleanableDataset that uses a
org.apache.gobblin.data.management.retention.version.finder.VersionFinder to find dataset versions, a
org.apache.gobblin.data.management.retention.policy.RetentionPolicy to figure out deletable versions, and then deletes
those files and newly empty parent directories.
Concrete subclasses should implement
#getVersionFinder and
#getRetentionPolicy.
Datasets are directories in the filesystem containing data files organized in version-like directory structures.
Example datasets:
For snapshot based datasets, with the directory structure:
/path/to/table/
snapshot1/
dataFiles...
snapshot2/
dataFiles...
each of snapshot1 and snapshot2 are dataset versions.
For tracking datasets, with the directory structure:
/path/to/tracking/data/
2015/
06/
01/
dataFiles...
02/
dataFiles...
each of 2015/06/01 and 2015/06/02 are dataset versions.
CleanableDatasetBase uses a
org.apache.gobblin.data.management.version.finder.DatasetVersionFinder to find all
subdirectories that are versions of this dataset. After that, for each dataset, it uses a
org.apache.gobblin.data.management.retention.policy.RetentionPolicy to decide which versions of the dataset should be
deleted. For each version deleted, if
#deleteEmptyDirectories it will also look at all parent directories
and delete directories that are now empty, up to but not including the dataset root.