Abstract base class of all file systems used by Flink. This class may be extended to implement
distributed file systems, or local file systems. The abstraction by this file system is very simple,
and the set of available operations quite limited, to support the common denominator of a wide
range of file systems. For example, appending to or mutating existing files is not supported.
Flink implements and supports some file system types directly (for example the default
machine-local file system). Other file system types are accessed by an implementation that bridges
to the suite of file systems supported by Hadoop (such as for example HDFS).
Scope and Purpose
The purpose of this abstraction is used to expose a common and well defined interface for
access to files. This abstraction is used both by Flink's fault tolerance mechanism (storing
state and recovery data) and by reusable built-in connectors (file sources / sinks).
The purpose of this abstraction is not to give user programs an abstraction with
extreme flexibility and control across all possible file systems. That mission would be a folly,
as the differences in characteristics of even the most common file systems are already quite
large. It is expected that user programs that need specialized functionality of certain file systems
in their functions, operations, sources, or sinks instantiate the specialized file system adapters
directly.
Data Persistence Contract
The FileSystem's
FSDataOutputStream are used to persistently store data,
both for results of streaming applications and for fault tolerance and recovery. It is therefore
crucial that the persistence semantics of these streams are well defined.
Definition of Persistence Guarantees
Data written to an output stream is considered persistent, if two requirements are met:
- Visibility Requirement: It must be guaranteed that all other processes, machines,
virtual machines, containers, etc. that are able to access the file see the data consistently
when given the absolute file path. This requirement is similar to the close-to-open
semantics defined by POSIX, but restricted to the file itself (by its absolute path).
- Durability Requirement: The file system's specific durability/persistence requirements
must be met. These are specific to the particular file system. For example the
LocalFileSystem does not provide any durability guarantees for crashes of both
hardware and operating system, while replicated distributed file systems (like HDFS)
typically guarantee durability in the presence of at most n concurrent node failures,
where n is the replication factor.
Updates to the file's parent directory (such that the file shows up when
listing the directory contents) are not required to be complete for the data in the file stream
to be considered persistent. This relaxation is important for file systems where updates to
directory contents are only eventually consistent.
The
FSDataOutputStream has to guarantee data persistence for the written bytes
once the call to
FSDataOutputStream#close() returns.
Examples
- For fault-tolerant distributed file systems, data is considered persistent once
it has been received and acknowledged by the file system, typically by having been replicated
to a quorum of machines (durability requirement). In addition the absolute file path
must be visible to all other machines that will potentially access the file (visibility
requirement).
Whether data has hit non-volatile storage on the storage nodes depends on the specific
guarantees of the particular file system.
The metadata updates to the file's parent directory are not required to have reached
a consistent state. It is permissible that some machines see the file when listing the parent
directory's contents while others do not, as long as access to the file by its absolute path
is possible on all nodes.
- A local file system must support the POSIX close-to-open semantics.
Because the local file system does not have any fault tolerance guarantees, no further
requirements exist.
The above implies specifically that data may still be in the OS cache when considered
persistent from the local file system's perspective. Crashes that cause the OS cache to loose
data are considered fatal to the local machine and are not covered by the local file system's
guarantees as defined by Flink.
That means that computed results, checkpoints, and savepoints that are written only to
the local filesystem are not guaranteed to be recoverable from the local machine's failure,
making local file systems unsuitable for production setups.
Updating File Contents
Many file systems either do not support overwriting contents of existing files at all, or do
not support consistent visibility of the updated contents in that case. For that reason,
Flink's FileSystem does not support appending to existing files, or seeking within output streams
so that previously written data could be overwritten.
Overwriting Files
Overwriting files is in general possible. A file is overwritten by deleting it and creating
a new file. However, certain filesystems cannot make that change synchronously visible
to all parties that have access to the file.
For example Amazon S3 guarantees only
eventual consistency in the visibility of the file replacement: Some machines may see
the old file, some machines may see the new file.
To avoid these consistency issues, the implementations of failure/recovery mechanisms in
Flink strictly avoid writing to the same file path more than once.
Thread Safety
Implementations of
FileSystem must be thread-safe: The same instance of FileSystem
is frequently shared across multiple threads in Flink and must be able to concurrently
create input/output streams and list file metadata.
The
FSDataOutputStream and
FSDataOutputStream implementations are strictly
not thread-safe. Instances of the streams should also not be passed between threads
in between read or write operations, because there are no guarantees about the visibility of
operations across threads (many operations do not create memory fences).
Streams Safety Net
When application code obtains a FileSystem (via
FileSystem#get(URI) or via
Path#getFileSystem()), the FileSystem instantiates a safety net for that FileSystem.
The safety net ensures that all streams created from the FileSystem are closed when the
application task finishes (or is canceled or failed). That way, the task's threads do not
leak connections.
Internal runtime code can explicitly obtain a FileSystem that does not use the safety
net via
FileSystem#getUnguardedFileSystem(URI).