org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO java code examples

Read() {
 this(BigQueryIO.read(TableRowParser.INSTANCE).withCoder(TableRowJsonCoder.of()));
}

BigQueryIO.writeTableRows()
.to(String.format("%1$s.%2$s",bqDataSet, bqTable))
.withSchema(schema)

/**
 * A {@link PTransform} that writes a {@link PCollection} containing {@link TableRow TableRows} to
 * a BigQuery table.
 *
 * <p>It is recommended to instead use {@link #write} with {@link
 * Write#withFormatFunction(SerializableFunction)}.
 */
public static Write<TableRow> writeTableRows() {
 return BigQueryIO.<TableRow>write().withFormatFunction(IDENTITY_FORMATTER);
}

static void runBigQueryTornadoes(Options options) {
 Pipeline p = Pipeline.create(options);
 // Build the table schema for the output table.
 List<TableFieldSchema> fields = new ArrayList<>();
 fields.add(new TableFieldSchema().setName("month").setType("INTEGER"));
 fields.add(new TableFieldSchema().setName("tornado_count").setType("INTEGER"));
 TableSchema schema = new TableSchema().setFields(fields);
 p.apply(BigQueryIO.readTableRows().from(options.getInput()))
   .apply(new CountTornadoes())
   .apply(
     BigQueryIO.writeTableRows()
       .to(options.getOutput())
       .withSchema(schema)
       .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
       .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
 p.run().waitUntilFinish();
}

p.apply(BigQueryIO.readTableRows().from(tableSpec))
  BigQueryIO.read(
      (SchemaAndRecord elem) -> (Double) elem.getRecord().get("max_temperature"))
    .from(tableSpec)
  BigQueryIO.read(
      (SchemaAndRecord elem) -> (Double) elem.getRecord().get("max_temperature"))
    .fromQuery(
  BigQueryIO.read(
      (SchemaAndRecord elem) -> (Double) elem.getRecord().get("max_temperature"))
    .fromQuery(
        new TableRow().set("source", elem.source).set("quote", elem.quote)))
.apply(
  BigQueryIO.writeTableRows()
    .to(tableSpec)
    .withSchema(tableSchema)
BigQueryIO.<Quote>write()
  .to(tableSpec)
  .withSchema(tableSchema)
BigQueryIO.<Quote>write()
  .to(tableSpec)
  .withJsonSchema(tableSchemaJson)
  BigQueryIO.read(

private void runBigQueryToTablePipeline() {
 Pipeline p = Pipeline.create(options);
 BigQueryIO.Read bigQueryRead = BigQueryIO.read().fromQuery(options.getQuery());
 if (options.getUsingStandardSql()) {
  bigQueryRead = bigQueryRead.usingStandardSql();
 }
 PCollection<TableRow> input = p.apply(bigQueryRead);
 if (options.getReshuffle()) {
  input =
    input
      .apply(WithKeys.<Void, TableRow>of((Void) null))
      .setCoder(KvCoder.of(VoidCoder.of(), TableRowJsonCoder.of()))
      .apply(Reshuffle.<Void, TableRow>of())
      .apply(Values.<TableRow>create());
 }
 input.apply(
   BigQueryIO.writeTableRows()
     .to(options.getOutput())
     .withSchema(options.getOutputSchema())
     .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
 p.run().waitUntilFinish();
}

if (useReadTableRows) {
 BigQueryIO.Read read =
   BigQueryIO.read()
     .from("non-executing-project:somedataset.sometable")
     .withTestServices(fakeBqServices)
} else {
 BigQueryIO.TypedRead<TableRow> read =
   BigQueryIO.readTableRows()
     .from("non-executing-project:somedataset.sometable")
     .withTestServices(fakeBqServices)

 public static void main(String[] args) throws Exception {
  Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
  Pipeline p = Pipeline.create(options);
  // the following two 'applys' create multiple inputs to our pipeline, one for each
  // of our two input sources.
  PCollection<TableRow> eventsTable =
    p.apply(BigQueryIO.readTableRows().from(GDELT_EVENTS_TABLE));
  PCollection<TableRow> countryCodes = p.apply(BigQueryIO.readTableRows().from(COUNTRY_CODES));
  PCollection<String> formattedResults = joinEvents(eventsTable, countryCodes);
  formattedResults.apply(TextIO.write().to(options.getOutput()));
  p.run().waitUntilFinish();
 }
}

@Before
public void setUp() throws IOException, InterruptedException {
 FakeDatasetService.setUp();
 BigQueryIO.clearCreatedTables();
 fakeDatasetService.createDataset("project-id", "dataset-id", "", "", null);
}

 public static void main(String[] args) throws Exception {

  Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
  Pipeline p = Pipeline.create(options);

  // Build the table schema for the output table.
  List<TableFieldSchema> fields = new ArrayList<>();
  fields.add(new TableFieldSchema().setName("month").setType("INTEGER"));
  fields.add(new TableFieldSchema().setName("max_mean_temp").setType("FLOAT"));
  TableSchema schema = new TableSchema().setFields(fields);

  p.apply(BigQueryIO.readTableRows().from(options.getInput()))
    .apply(new MaxMeanTemp())
    .apply(
      BigQueryIO.writeTableRows()
        .to(options.getOutput())
        .withSchema(schema)
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

  p.run().waitUntilFinish();
 }
}

bigQueryIOPTransform = BigQueryIO.readTableRows().from(table);
break;
bigQueryIOPTransform = BigQueryIO.readTableRows().fromQuery(dataset.query.getValue());
if (!dataset.useLegacySql.getValue()) {
  bigQueryIOPTransform = bigQueryIOPTransform.usingStandardSql();

@Before
public void setUp() throws IOException, InterruptedException {
 FakeDatasetService.setUp();
 BigQueryIO.clearCreatedTables();
 fakeDatasetService.createDataset("project-id", "dataset-id", "", "", null);
}

/**
 * Like {@link #read(SerializableFunction)} but represents each row as a {@link TableRow}.
 *
 * <p>This method is more convenient to use in some cases, but usually has significantly lower
 * performance than using {@link #read(SerializableFunction)} directly to parse data into a
 * domain-specific type, due to the overhead of converting the rows to {@link TableRow}.
 */
public static TypedRead<TableRow> readTableRows() {
 return read(new TableRowParser()).withCoder(TableRowJsonCoder.of());
}

@Override
public WriteResult expand(PCollection<FailsafeElement<PubsubMessage, String>> failedRecords) {
 return failedRecords
   .apply("FailedRecordToTableRow", ParDo.of(new FailedPubsubMessageToTableRowFn()))
   .apply(
     "WriteFailedRecordsToBigQuery",
     BigQueryIO.writeTableRows()
       .to(getErrorRecordsTable())
       .withJsonSchema(getErrorRecordsTableSchema())
       .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
       .withWriteDisposition(WriteDisposition.WRITE_APPEND));
}

 public static void main(String[] args) throws Exception {

  Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
  Pipeline p = Pipeline.create(options);

  // Build the table schema for the output table.
  List<TableFieldSchema> fields = new ArrayList<>();
  fields.add(new TableFieldSchema().setName("word").setType("STRING"));
  fields.add(new TableFieldSchema().setName("all_plays").setType("STRING"));
  TableSchema schema = new TableSchema().setFields(fields);

  p.apply(BigQueryIO.readTableRows().from(options.getInput()))
    .apply(new PlaysForWord())
    .apply(
      BigQueryIO.writeTableRows()
        .to(options.getOutput())
        .withSchema(schema)
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

  p.run().waitUntilFinish();
 }
}

@Override
public POutput buildIOWriter(PCollection<Row> input) {
 return input.apply(
   BigQueryIO.<Row>write()
     .withSchema(BigQueryUtils.toTableSchema(getSchema()))
     .withFormatFunction(BigQueryUtils.toTableRow())
     .to(tableSpec));
}

bigQueryIOPTransform = BigQueryIO.readTableRows().from(table);
break;
bigQueryIOPTransform = BigQueryIO.readTableRows().fromQuery(dataset.query.getValue());
if (!dataset.useLegacySql.getValue()) {
  bigQueryIOPTransform = bigQueryIOPTransform.usingStandardSql();

@Override
public PCollection<FeatureRow> expand(PInput input) {
 BigQuerySourceOptions options = OptionsParser
   .parse(importSpec.getOptionsMap(), BigQuerySourceOptions.class);
 List<String> entities = importSpec.getEntitiesList();
 Preconditions.checkArgument(
   entities.size() == 1, "exactly 1 entity must be set for BigQuery import");
 String url = String.format("%s:%s.%s", options.project, options.dataset, options.table);
 return input
   .getPipeline()
   .apply(
     BigQueryIO.read(new BigQueryToFeatureRowFn(importSpec)).from(url));
}

@Override
public WriteResult expand(PCollection<FailsafeElement<String, String>> failedRecords) {
 return failedRecords
   .apply("FailedRecordToTableRow", ParDo.of(new FailedStringToTableRowFn()))
   .apply(
     "WriteFailedRecordsToBigQuery",
     BigQueryIO.writeTableRows()
       .to(getErrorRecordsTable())
       .withJsonSchema(getErrorRecordsTableSchema())
       .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
       .withWriteDisposition(WriteDisposition.WRITE_APPEND));
}

 public static void main(String[] args) throws Exception {

  Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
  Pipeline p = Pipeline.create(options);

  TableSchema schema = buildWeatherSchemaProjection();

  p.apply(BigQueryIO.readTableRows().from(options.getInput()))
    .apply(ParDo.of(new ProjectionFn()))
    .apply(new BelowGlobalMean(options.getMonthFilter()))
    .apply(
      BigQueryIO.writeTableRows()
        .to(options.getOutput())
        .withSchema(schema)
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

  p.run().waitUntilFinish();
 }
}

Javadoc

PTransforms for reading and writing BigQuery tables.

Table References

A fully-qualified BigQuery table name consists of three components:

projectId: the Cloud project id (defaults to GcpOptions#getProject()).
datasetId: the BigQuery dataset id, unique within a project.
tableId: a table id, unique within a dataset.

BigQuery table references are stored as a TableReference, which comes from the BigQuery Java Client API. Tables can be referred to as Strings, with or without the projectId. A helper function is provided ( BigQueryHelpers#parseTableSpec(String)) that parses the following string forms into a TableReference:

[ project_id]:[ dataset_id].[ table_id]
[ dataset_id].[ table_id]

Reading

Reading from BigQuery is supported by #read(SerializableFunction), which parses records in AVRO format into a custom type using a specified parse function, and by #readTableRows which parses them into TableRow, which may be more convenient but has lower performance.

Both functions support reading either from a table or from the result of a query, via TypedRead#from(String) and TypedRead#fromQuery respectively. Exactly one of these must be specified.

Example: Reading rows of a table as TableRow.

 
PCollection weatherData = pipeline.apply(

Example: Reading rows of a table and parsing them into a custom type.

 
PCollection weatherData = pipeline.apply(}) 
.from("clouddataflow-readonly:samples.weather_stations")) 
.withCoder(SerializableCoder.of(WeatherRecord.class)); 
}

Note: When using #read(SerializableFunction), you may sometimes need to use TypedRead#withCoder(Coder) to specify a Coder for the result type, if Beam fails to infer it automatically.

Example: Reading results of a query as TableRow.

 
PCollection meanTemperatureData = pipeline.apply(BigQueryIO.readTableRows()

Users can optionally specify a query priority using TypedRead#withQueryPriority(TypedRead.QueryPriority) and a geographic location where the query will be executed using TypedRead#withQueryLocation(String). Query location must be specified for jobs that are not executed in US or EU. See BigQuery Jobs: query.

Writing

To write to a BigQuery table, apply a BigQueryIO.Write transformation. This consumes a PCollection of a user-defined type when using BigQueryIO#write() (recommended), or a PCollection of TableRow as input when using BigQueryIO#writeTableRows() (not recommended). When using a user-defined type, a function must be provided to turn this type into a TableRow using BigQueryIO.Write#withFormatFunction(SerializableFunction).

 
class Quote { Instant timestamp; String exchange; String symbol; double price; } 
PCollection quotes = ... 
quotes.apply(BigQueryIO 
.write() 
.to("my-project:my_dataset.my_table") 
.withSchema(new TableSchema().setFields( 
ImmutableList.of( 
new TableFieldSchema().setName("timestamp").setType("TIMESTAMP"), 
new TableFieldSchema().setName("exchange").setType("STRING"), 
new TableFieldSchema().setName("symbol").setType("STRING"), 
new TableFieldSchema().setName("price").setType("FLOAT")))) 
.withFormatFunction(quote -> new TableRow().set(..set the columns..)) 
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)); 
}

See BigQueryIO.Write for details on how to specify if a write should append to an existing table, replace the table, or verify that the table is empty. Note that the dataset being written to must already exist. Unbounded PCollections can only be written using Write.WriteDisposition#WRITE_EMPTY or Write.WriteDisposition#WRITE_APPEND.

Loading historical data into time-partitioned BigQuery tables

To load historical data into a time-partitioned BigQuery table, specify BigQueryIO.Write#withTimePartitioning with a TimePartitioning#setField(String)used for column-based partitioning. For example:

 
PCollection quotes = ...;

Writing different values to different tables

A common use case is to dynamically generate BigQuery table names based on the current value. To support this, BigQueryIO.Write#to(SerializableFunction) accepts a function mapping the current element to a tablespec. For example, here's code that outputs quotes of different stocks to different tables:

 
PCollection quotes = ...;); 
}

Per-table schemas can also be provided using BigQueryIO.Write#withSchemaFromView. This allows you the schemas to be calculated based on a previous pipeline stage or statically via a org.apache.beam.sdk.transforms.Create transform. This method expects to receive a map-valued PCollectionView, mapping table specifications (project:dataset.table-id), to JSON formatted TableSchema objects. All destination tables must be present in this map, or the pipeline will fail to create tables. Care should be taken if the map value is based on a triggered aggregation over and unbounded PCollection; the side input will contain the entire history of all table schemas ever generated, which might blow up memory usage. This method can also be useful when writing to a single table, as it allows a previous stage to calculate the schema (possibly based on the full collection of records being written to BigQuery).

For the most general form of dynamic table destinations and schemas, look at BigQueryIO.Write#to(DynamicDestinations).

Insertion Method

BigQueryIO.Write supports two methods of inserting data into BigQuery specified using BigQueryIO.Write#withMethod. If no method is supplied, then a default method will be chosen based on the input PCollection. See BigQueryIO.Write.Method for more information about the methods. The different insertion methods provide different tradeoffs of cost, quota, and data consistency; please see BigQuery documentation for more information about these tradeoffs.

Usage with templates

When using #read or #readTableRows() in a template, it's required to specify Read#withTemplateCompatibility(). Specifying this in a non-template pipeline is not recommended because it has somewhat lower performance.

When using #write() or #writeTableRows() with batch loads in a template, it is recommended to specify Write#withCustomGcsTempLocation. Writing to BigQuery via batch loads involves writing temporary files to this location, so the location must be accessible at pipeline execution time. By default, this location is captured at pipeline construction time, may be inaccessible if the template may be reused from a different project or at a moment when the original location no longer exists. Write#withCustomGcsTempLocation(ValueProvider) allows specifying the location as an argument to the template invocation.

Permissions

Permission requirements depend on the PipelineRunner that is used to execute the pipeline. Please refer to the documentation of corresponding PipelineRunners for more details.

Please see BigQuery Access Control for security and permission related information specific to BigQuery.

Most used methods

read
Reads from a BigQuery table or query and returns a PCollection with one element per each row of the
writeTableRows
A PTransform that writes a PCollection containing TableRow to a BigQuery table. It is recommended to
write
A PTransform that writes a PCollection to a BigQuery table. A formatting function must be provided t
readTableRows
Like #read(SerializableFunction) but represents each row as a TableRow.This method is more convenien
clearCreatedTables
Clear the cached map of created tables. Used for testing.
getExtractDestinationUri
getExtractFilePaths

Popular in Java

Creating JSON documents from java classes using gson
setContentView (Activity)
getOriginalFilename (MultipartFile)
Return the original filename in the client's filesystem.This may contain path information depending
onRequestPermissionsResult (Fragment)
SecureRandom (java.security)
This class generates cryptographically secure pseudo-random numbers. It is best to invoke SecureRand
Comparator (java.util)
A Comparator is used to compare two objects to determine their ordering with respect to each other.
Random (java.util)
This class provides methods that return pseudo-random values.It is dangerous to seed Random with the
TimerTask (java.util)
The TimerTask class represents a task to run at a specified time. The task may be run once or repeat
IsNull (org.hamcrest.core)
Is the value null?
JLabel (javax.swing)
Top plugins for Android Studio

How to useBigQueryIO in org.apache.beam.sdk.io.gcp.bigquery

Best Java code snippets using org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO (Showing top 20 results out of 315)

How to use
BigQueryIO
in
org.apache.beam.sdk.io.gcp.bigquery