BigQuery Storage API: Arrow

Previously we had an introduction on the BigQuery Storage API. As explained the storage API of BigQuery supports two formats. For this tutorial we will choose the Arrow Format.

First let’s import the dependencies. The BigQuery storage API binary does not come with a library to parse Arrow. This way the consumer receives the binaries in an Arrow format, and it’s up to the consumer on how to consume the binaries and what libraries to use.


    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>com.google.cloud</groupId>
                <artifactId>libraries-bom</artifactId>
                <version>20.5.0</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-bigquerystorage</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.arrow</groupId>
            <artifactId>arrow-vector</artifactId>
            <version>4.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.arrow</groupId>
            <artifactId>arrow-memory-netty</artifactId>
            <version>4.0.0</version>
        </dependency>
    </dependencies>

As mentioned before, when we use Arrow we need to import a library for the memory allocation Arrow needs.

We shall create first a plain Arrow Reader.
This Reader will be BigQuery agnostic. This is one of the benefits when we use a platform-language independent format.

An Arrow Binary shall be submitted to the reader with the schema and the rows shall be printed in CSV format.

package com.gkatzioura.bigquery.storage.api.arrow;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.util.Preconditions;
import org.apache.arrow.vector.FieldVector;
import org.apache.arrow.vector.VectorLoader;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.ipc.ReadChannel;
import org.apache.arrow.vector.ipc.message.MessageSerializer;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.Schema;
import org.apache.arrow.vector.util.ByteArrayReadableSeekableByteChannel;

import com.google.cloud.bigquery.storage.v1.ArrowRecordBatch;
import com.google.cloud.bigquery.storage.v1.ArrowSchema;

public class ArrowReader implements AutoCloseable {

    private final BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);

    private final VectorSchemaRoot root;
    private final VectorLoader loader;

    public ArrowReader(ArrowSchema arrowSchema) throws IOException {
        Schema schema =
                MessageSerializer.deserializeSchema(
                        new ReadChannel(
                                new ByteArrayReadableSeekableByteChannel(
                                        arrowSchema.getSerializedSchema().toByteArray())));

        Preconditions.checkNotNull(schema);
        List<FieldVector> vectors = new ArrayList<>();
        for (Field field : schema.getFields()) {
            vectors.add(field.createVector(allocator));
        }

        root = new VectorSchemaRoot(vectors);
        loader = new VectorLoader(root);
    }

    public void processRows(ArrowRecordBatch batch) throws IOException {
        org.apache.arrow.vector.ipc.message.ArrowRecordBatch deserializedBatch =
                MessageSerializer.deserializeRecordBatch(
                        new ReadChannel(
                                new ByteArrayReadableSeekableByteChannel(
                                        batch.getSerializedRecordBatch().toByteArray())),
                        allocator);

        loader.load(deserializedBatch);
        deserializedBatch.close();
        System.out.println(root.contentToTSVString());
        root.clear();
    }

    @Override
    public void close() throws Exception {
        root.close();
        allocator.close();
    }

}

The constructor will have the schema injected, then the schema root shall be created.
Pay attention that we receive the schema in a binary form, it’s up to us and our library on how to read it.


        Schema schema =
                MessageSerializer.deserializeSchema(
                        new ReadChannel(
                                new ByteArrayReadableSeekableByteChannel(
                                        arrowSchema.getSerializedSchema().toByteArray())));

You can find more on reading Arrow data on this tutorial.

Then on to our main class which is the one with any BigQuery logic needed.

package com.gkatzioura.bigquery.storage.api.arrow;

import org.apache.arrow.util.Preconditions;

import com.google.api.gax.rpc.ServerStream;
import com.google.cloud.bigquery.storage.v1.BigQueryReadClient;
import com.google.cloud.bigquery.storage.v1.CreateReadSessionRequest;
import com.google.cloud.bigquery.storage.v1.DataFormat;
import com.google.cloud.bigquery.storage.v1.ReadRowsRequest;
import com.google.cloud.bigquery.storage.v1.ReadRowsResponse;
import com.google.cloud.bigquery.storage.v1.ReadSession;

public class ArrowMain {

    public static void main(String[] args) throws Exception {

        String projectId = System.getenv("PROJECT_ID");

        try (BigQueryReadClient client = BigQueryReadClient.create()) {
            String parent = String.format("projects/%s", projectId);

            String srcTable =
                    String.format(
                            "projects/%s/datasets/%s/tables/%s",
                            projectId, System.getenv("DATASET"), System.getenv("TABLE"));

            ReadSession.Builder sessionBuilder =
                    ReadSession.newBuilder()
                               .setTable(srcTable)
                               .setDataFormat(DataFormat.ARROW);

            CreateReadSessionRequest.Builder builder =
                    CreateReadSessionRequest.newBuilder()
                                            .setParent(parent)
                                            .setReadSession(sessionBuilder)
                                            .setMaxStreamCount(1);
            ReadSession session = client.createReadSession(builder.build());

            try (ArrowReader reader = new ArrowReader(session.getArrowSchema())) {
                Preconditions.checkState(session.getStreamsCount() > 0);

                String streamName = session.getStreams(0).getName();

                ReadRowsRequest readRowsRequest =
                        ReadRowsRequest.newBuilder().setReadStream(streamName).build();

                ServerStream<ReadRowsResponse> stream = client.readRowsCallable().call(readRowsRequest);
                for (ReadRowsResponse response : stream) {
                    Preconditions.checkState(response.hasArrowRecordBatch());
                    reader.processRows(response.getArrowRecordBatch());
                }
            }
        }
    }

}

A BigQuery client is created. Then we create a session request with a max number of streams. We do have to specify that the format to be used will be Arrow.
Once we get a Response, the response will contain the initiated the Session, the Arrow schema and the streams that we shall use to retrieve the Data.
For each stream there has to be a ReadRowsRequest in order to fetch the data.
Our next example will focus on reading data in Avro format.