Extracting Data from BigQuery table to Parquet into GCS using Cloud Dataflow and Apache Beam

Satyasheel
3 min readMay 29, 2019

Cloud Dataflow is an excellent tool for moving data within GCP and several blog post has been dedicated on how to perform ETL using dataflow and apache beam. Out of those several articles very less are focused on how to move data from bigquery to cloud storage in a specific format.

For the problem, I was working on. We had an internal ML framework which only reads data in a parquet format. And given the fact that BigQuery still not support exporting data in parquet, we came up with a dataflow pipeline which reads data from bigquery and converts it to parquet then writes it to GCS bucket.

Some of the things to be noted before reading the article:

  • This solution is in based on Apache Beam API for python3.xx ( 2.12.0 )
  • Beam natively supports parquet read and write but it depends on Apache Arrow to define the schema

AIM: Extract data from BQ table to parquet in GCS

I am expecting readers to be familiar with apache beam and columnar data store parquet. Also, this guide expects some familiarity with beam programming guide. If not then I would highly recommend to follow this link and get an understanding of beam programming.

Let’s get started

--

--

Satyasheel

Data Platform Engineer, Data Engineer, Weekend Dota2 player