PTransforms for reading and writing
BigQuery tables.
Table References
A fully-qualified BigQuery table name consists of three components:
-
projectId: the Cloud project id (defaults to
GcpOptions#getProject()).
-
datasetId: the BigQuery dataset id, unique within a project.
-
tableId: a table id, unique within a dataset.
BigQuery table references are stored as a
TableReference, which comes from the BigQuery Java Client API. Tables
can be referred to as Strings, with or without the
projectId. A helper function is
provided (
BigQueryHelpers#parseTableSpec(String)) that parses the following string forms
into a
TableReference:
- [
project_id]:[
dataset_id].[
table_id]
- [
dataset_id].[
table_id]
Reading
Reading from BigQuery is supported by
#read(SerializableFunction), which parses
records in AVRO format
into a custom type using a specified parse function, and by
#readTableRows which parses
them into
TableRow, which may be more convenient but has lower performance.
Both functions support reading either from a table or from the result of a query, via
TypedRead#from(String) and
TypedRead#fromQuery respectively. Exactly one of these must
be specified.
Example: Reading rows of a table as
TableRow.
PCollection weatherData = pipeline.apply(
Example: Reading rows of a table and parsing them into a custom type.
PCollection weatherData = pipeline.apply(})
.from("clouddataflow-readonly:samples.weather_stations"))
.withCoder(SerializableCoder.of(WeatherRecord.class));
}
Note: When using
#read(SerializableFunction), you may sometimes need to use
TypedRead#withCoder(Coder) to specify a
Coder for the result type, if Beam fails to
infer it automatically.
Example: Reading results of a query as
TableRow.
PCollection meanTemperatureData = pipeline.apply(BigQueryIO.readTableRows()
Users can optionally specify a query priority using
TypedRead#withQueryPriority(TypedRead.QueryPriority) and a geographic location where the query will be executed using
TypedRead#withQueryLocation(String). Query location must be specified for jobs that are not
executed in US or EU. See BigQuery Jobs:
query.
Writing
To write to a BigQuery table, apply a
BigQueryIO.Write transformation. This consumes a
PCollection of a user-defined type when using
BigQueryIO#write() (recommended),
or a
PCollection of
TableRow as input when using
BigQueryIO#writeTableRows() (not recommended). When using a user-defined type, a function must
be provided to turn this type into a
TableRow using
BigQueryIO.Write#withFormatFunction(SerializableFunction).
class Quote { Instant timestamp; String exchange; String symbol; double price; }
PCollection quotes = ...
quotes.apply(BigQueryIO
.write()
.to("my-project:my_dataset.my_table")
.withSchema(new TableSchema().setFields(
ImmutableList.of(
new TableFieldSchema().setName("timestamp").setType("TIMESTAMP"),
new TableFieldSchema().setName("exchange").setType("STRING"),
new TableFieldSchema().setName("symbol").setType("STRING"),
new TableFieldSchema().setName("price").setType("FLOAT"))))
.withFormatFunction(quote -> new TableRow().set(..set the columns..))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
}
See
BigQueryIO.Write for details on how to specify if a write should append to an
existing table, replace the table, or verify that the table is empty. Note that the dataset being
written to must already exist. Unbounded PCollections can only be written using
Write.WriteDisposition#WRITE_EMPTY or
Write.WriteDisposition#WRITE_APPEND.
Loading historical data into time-partitioned BigQuery tables
To load historical data into a time-partitioned BigQuery table, specify
BigQueryIO.Write#withTimePartitioning with a
TimePartitioning#setField(String)used for column-based
partitioning. For example:
PCollection quotes = ...;
Writing different values to different tables
A common use case is to dynamically generate BigQuery table names based on the current value.
To support this,
BigQueryIO.Write#to(SerializableFunction) accepts a function mapping the
current element to a tablespec. For example, here's code that outputs quotes of different stocks
to different tables:
PCollection quotes = ...;);
}
Per-table schemas can also be provided using
BigQueryIO.Write#withSchemaFromView. This
allows you the schemas to be calculated based on a previous pipeline stage or statically via a
org.apache.beam.sdk.transforms.Create transform. This method expects to receive a
map-valued
PCollectionView, mapping table specifications (project:dataset.table-id), to
JSON formatted
TableSchema objects. All destination tables must be present in this map,
or the pipeline will fail to create tables. Care should be taken if the map value is based on a
triggered aggregation over and unbounded
PCollection; the side input will contain the
entire history of all table schemas ever generated, which might blow up memory usage. This method
can also be useful when writing to a single table, as it allows a previous stage to calculate the
schema (possibly based on the full collection of records being written to BigQuery).
For the most general form of dynamic table destinations and schemas, look at
BigQueryIO.Write#to(DynamicDestinations).
Insertion Method
BigQueryIO.Write supports two methods of inserting data into BigQuery specified using
BigQueryIO.Write#withMethod. If no method is supplied, then a default method will be
chosen based on the input PCollection. See
BigQueryIO.Write.Method for more information
about the methods. The different insertion methods provide different tradeoffs of cost, quota,
and data consistency; please see BigQuery documentation for more information about these
tradeoffs.
Usage with templates
When using
#read or
#readTableRows() in a template, it's required to specify
Read#withTemplateCompatibility(). Specifying this in a non-template pipeline is not
recommended because it has somewhat lower performance.
When using
#write() or
#writeTableRows() with batch loads in a template, it is
recommended to specify
Write#withCustomGcsTempLocation. Writing to BigQuery via batch
loads involves writing temporary files to this location, so the location must be accessible at
pipeline execution time. By default, this location is captured at pipeline construction
time, may be inaccessible if the template may be reused from a different project or at a moment
when the original location no longer exists.
Write#withCustomGcsTempLocation(ValueProvider) allows specifying the location as an argument to
the template invocation.
Permissions
Permission requirements depend on the
PipelineRunner that is used to execute the
pipeline. Please refer to the documentation of corresponding
PipelineRunners for more
details.
Please see BigQuery Access Control
for security and permission related information specific to BigQuery.