BigQuery Storage API: Avro

Previously we had an introduction on the BigQuery Storage API and we proceeded reading data using the Arrow format.
In this tutorial we shall read Data using the Avro format.

What applied on the previous tutorial applies here too.

We shall create a BigQuery Storage Client, create a ReadSession using the Avro format and iterate the data on each stream.

Let’s get started by importing the dependencies, we do import the Avro library needed.

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>com.google.cloud</groupId>
                <artifactId>libraries-bom</artifactId>
                <version>20.5.0</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-bigquerystorage</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.9.2</version>
        </dependency>
    </dependencies>

Our next step would be to create an Avro Data Reader for our rows that have the schema of col1:string, col2:int. In our case we shall just print the data through sys.out

package com.gkatzioura.bigquery.storage.api.avro;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DecoderFactory;

import com.google.cloud.bigquery.storage.v1.AvroSchema;
import com.google.protobuf.ByteString;

public class AvroReader {

    private final GenericDatumReader<GenericRecord> datumReader;

    public AvroReader(AvroSchema arrowSchema) {
        Schema schema = new Schema.Parser().parse(arrowSchema.getSchema());
        this.datumReader = new GenericDatumReader<>(schema);
    }

    public void processRows(ByteString avroRows) throws IOException {
        try(InputStream inputStream = new ByteArrayInputStream(avroRows.toByteArray())) {
            BinaryDecoder decoder =DecoderFactory.get().binaryDecoder(inputStream, null);

            while (!decoder.isEnd()) {
                GenericRecord item = datumReader.read(null, decoder);

                System.out.println(item.get("col1")+","+item.get("col2"));
            }
        }
    }

}

Then on to our main class which is the one with any BigQuery logic needed.

package com.gkatzioura.bigquery.storage.api.avro;

import org.apache.arrow.util.Preconditions;

import com.google.api.gax.rpc.ServerStream;
import com.google.cloud.bigquery.storage.v1.BigQueryReadClient;
import com.google.cloud.bigquery.storage.v1.CreateReadSessionRequest;
import com.google.cloud.bigquery.storage.v1.DataFormat;
import com.google.cloud.bigquery.storage.v1.ReadRowsRequest;
import com.google.cloud.bigquery.storage.v1.ReadRowsResponse;
import com.google.cloud.bigquery.storage.v1.ReadSession;

public class AvroMain {

    public static void main(String[] args) throws Exception {

        String projectId = System.getenv("PROJECT_ID");

        try (BigQueryReadClient client = BigQueryReadClient.create()) {
            String parent = String.format("projects/%s", projectId);

            String srcTable =
                    String.format(
                            "projects/%s/datasets/%s/tables/%s",
                            projectId, System.getenv("DATASET"), System.getenv("TABLE"));

            ReadSession.Builder sessionBuilder =
                    ReadSession.newBuilder()
                               .setTable(srcTable)
                               .setDataFormat(DataFormat.AVRO);


            CreateReadSessionRequest.Builder builder =
                    CreateReadSessionRequest.newBuilder()
                                            .setParent(parent)
                                            .setReadSession(sessionBuilder)
                                            .setMaxStreamCount(1);
            ReadSession session = client.createReadSession(builder.build());

            Preconditions.checkState(session.getStreamsCount() > 0);

            String streamName = session.getStreams(0).getName();

            ReadRowsRequest readRowsRequest =
                    ReadRowsRequest.newBuilder().setReadStream(streamName).build();

            ServerStream<ReadRowsResponse> stream = client.readRowsCallable().call(readRowsRequest);

            for (ReadRowsResponse response : stream) {
                new AvroReader(session.getAvroSchema()).processRows(response.getAvroRows().getSerializedBinaryRows());
            }
        }
    }

}

A BigQuery client is created. Then we create a session request with a max number of streams. We did specify that the format to be used will be Avro.
Once we get a Response, the response will contain the initiated the Session, the Avro schema and the streams that we shall use to retrieve the Data.
For each stream there has to be a ReadRowsRequest in order to fetch the data.
Then we pass data to our Avro decoder.

That’s it we just read data from the BigQuery Storage API using Avro and Arrow!