Cleaning & Preparing Data for Google BigQuery

Liz Sanderson
Liz Sanderson
  • Updated

FME Version

Introduction

FME comes equipped with lots of different tools to help clean, prepare, and validate datasets to make sure the data that is migrated or enriched within FME meets all the requirements for an existing system, sharing with stakeholders, or further processing in subsequent steps.

In fact, FME carries out some of the most crucial aspects to get the data ready for Google BigQuery automatically. However, it is important to know the requirements for Google BigQuery tables and how to quickly transform any data so it can be successfully loaded into Google BigQuery.

In this article, we will take a look at 3 different requirements for Google BigQuery that are the most common causes of issues when loading data and we will discuss how to resolve any issues related to them including transformer recommendations.
 

Data Integrity Requirements

1. Geography in Google BigQuery

In Google BigQuery location information is represented as a GEOGRAPHY data type. This data type is based on the OGC Simple Features Specification (SFS) and is a collection of points, lines, or polygons that form a “simple” arrangement on the WGS84 reference ellipsoid.

The Google BigQuery Writer will automatically reproject all geometry and geography features to WGS84 as long as the coordinate system of the features entering the Writer is known. To make sure the coordinate system features of your source data are known you can either specify the coordinate system in the Reader or use the Reprojector transformer to reproject the data to a valid coordinate system.

To make sure all geometries and geographies are written to a Google BigQuery table successfully with FME the geometries can be validated with the GeometryValidator and the OGC Simple Compliant validation test.

1584047442140.png

This simple validation will eliminate the most common geometry-related scenarios that can cause issues when loading your geospatial data into Google BigQuery. If issues are detected and not automatically repaired (e.g. self-intersecting features), additional transformers like the SpikeRemover or Generalizer can help to resolve the problems. See the Data Validation and QA within FME tutorial for more information.

The Generalizer can also help reduce the size of the geospatial datasets loaded into Google BigQuery by eliminating vertices in the geometries and therefore the resulting geography stored in Google BigQuery.

When writing geographies that cover a very large area (eg. multiple countries or continents), it can also be useful to increase the number of vertices to guarantee the integrity of spherical geodesic lines that are part of the geographies. The Densifier is the tool of choice for this task.
 

2. Date & Time in Google BigQuery

Google BigQuery supports 4 different date & time types. With the DateTimeConverter transformer in FME, any automatically recognized or custom date time format can be easily transformed into a format supported by Google BigQuery:

Google BigQuery data type

DateTimeConverter Output Format

DATE

%Y-%m-%d

TIME

%H:%M:%S

DATETIME

%Y-%m-%d %H:%M:%S

TIMESTAMP

%Y-%m-%d %H:%M:%S %z

 

3. Column names in BigQuery

Google BigQuery has specific requirements for column names. As the BigQuery will alter or reject features with unsupported column names, it is recommended to use the AttributeManager, AttributeRenamer, or BulkAttributeRenamer to modify any attribute names to match the requirements for Google BigQuery columns.
 

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.