Experimental
PTransform for reading from and writing to
Google Cloud Spanner.
Reading from Cloud Spanner
To read from Cloud Spanner, apply
SpannerIO.Read transformation. It will return a
PCollection of
Struct, where each element represents an individual row
returned from the read operation. Both Query and Read APIs are supported. See more information
about reading from Cloud Spanner
To execute a query, specify a
SpannerIO.Read#withQuery(Statement) or
SpannerIO.Read#withQuery(String) during the construction of the transform.
PCollection rows = p.apply(
To use the Read API, specify a
SpannerIO.Read#withTable(String) and a
SpannerIO.Read#withColumns(List).
PCollection rows = p.apply(
To optimally read using index, specify the index name using
SpannerIO.Read#withIndex.
The transform is guaranteed to be executed on a consistent snapshot of data, utilizing the
power of read only transactions. Staleness of data can be controlled using
SpannerIO.Read#withTimestampBound or
SpannerIO.Read#withTimestamp(Timestamp) methods. Read more about transactions in
Cloud Spanner.
It is possible to read several
PCollection within a single transaction.
Apply
SpannerIO#createTransaction() transform, that lazily creates a transaction. The
result of this transformation can be passed to read operation using
SpannerIO.Read#withTransaction(PCollectionView).
SpannerConfig spannerConfig = ...
Writing to Cloud Spanner
The Cloud Spanner
SpannerIO.Write transform writes to Cloud Spanner by executing a
collection of input row
Mutation. The mutations are grouped into batches for
efficiency.
To configure the write transform, create an instance using
#write() and then specify
the destination Cloud Spanner instance (
Write#withInstanceId(String) and destination
database (
Write#withDatabaseId(String)). For example:
// Earlier in the pipeline, create a PCollection of Mutations to be written to Cloud Spanner.
SpannerWriteResult
The
SpannerWriteResult object contains the results of the transform,
including a
PCollection of MutationGroups that failed to write, and a
PCollectionthat can be used as a completion signal.
Batching
To reduce the number of transactions sent to Spanner, the
Mutation are
grouped into batches The default maximum size of the batch is set to 1MB or 5000 mutated cells.
To override this use
Write#withBatchSizeBytes(long) and
Write#withMaxNumMutations(long). Setting either to a small value or zero
disables batching.
Note that the maximum
size of a single transaction is 20,000 mutated cells - including cells in indexes. If you
have a large number of indexes and are getting exceptions with message: INVALID_ARGUMENT: The
transaction contains too many mutations you will need to specify a smaller number of
MaxNumMutations.
The batches written are obtained from by grouping enough
Mutation from the
Bundle provided by Beam to form (by default) 1000 batches. This group of
Mutation is then sorted by Key, and the batches are created from the sorted group. This so that
each batch will have keys that are 'close' to each other to optimise write performance. This
grouping factor (number of batches) is controlled by the parameter
Write#withGroupingFactor(int).
Note that each worker will need enough memory to hold
GroupingFactor x MaxBatchSizeBytesMutations, so if you have a large
MaxBatchSize you may need to reduce
GroupingFactor
Database Schema Preparation
The Write transform reads the database schema on pipeline start. If the schema is created as
part of the same pipeline, this transform needs to wait until this has happened. Use
Write#withSchemaReadySignal(PCollection) to pass a signal
PCollection which will be used
with
Wait.OnSignal to prevent the schema from being read until it is ready. The Write
transform will be paused until the signal
PCollection is closed.
Transactions
The transform does not provide same transactional guarantees as Cloud Spanner. In particular,
- Individual Mutations are submitted atomically, but all Mutations are not submitted in the
same transaction.
- A Mutation is applied at least once;
- If the pipeline was unexpectedly stopped, mutations that were already applied will not get
rolled back.
Use
MutationGroup with the
WriteGrouped transform to ensure
that a small set mutations is bundled together. It is guaranteed that mutations in a
MutationGroup are submitted in the same transaction. Note that a MutationGroup must not exceed
the Spanner transaction limits.
// Earlier in the pipeline, create a PCollection of MutationGroups to be written to Cloud Spanner.
Streaming Support
SpannerIO.Write can be used as a streaming sink, however as with batch mode note that
the write order of individual
Mutation/
MutationGroup objects is not guaranteed.