Reading/Writing to Cosmos DB with FME

Connecting to Cosmos DB

CosmosDB is a NoSQL database on the Microsoft Azure platform. It supports indexing and querying of geospatial data that's represented using the GeoJSON specification. As you will see below, writing large volumes of data to Cosmos DB in bulk isn't as performant as we might expect. However, performing simple spatial queries against huge datasets is extremely fast.

Writing Data

If you are writing a large amount of data to Cosmos DB, the main SQL API we are using is not suitable for very large datasets. There are various methods for bulk uploading data into CosmosDB, both online and offline. Refer to the Azure documentation for additional information.

If you want to use FME to write data to Cosmos DB, here are a few tips:

Concurrent Requests Parameter

If you are writing more than 20,000 records, leave the default concurrent requests set to 4; any higher than this, and the API seems to throttle requests, causing FME to error. If you are writing a smaller number of features, you can play around with increasing this number, which essentially parallelizes the requests.

Creating the collection

Currently, FME doesn't create a spatial index if you are writing geometry out. I recommend creating the database and collection manually in the Azure portal, rather than using FME, and then setting up the indexing policy manually. The Azure Documentation outlines the values that need to be added to support spatial indexing. By setting up the collection manually, you can also configure features like Autopilot mode, which is not available in FME.

Scaling Writing

If you want to scale the amount of data FME can write to Cosmos DB, you can use FME Flow and scale the number of FME Engines. For example, I have a PostGIS table with 10 million points in it that I want to write to CosmosDB. On the PostGIS reader, there are Start Feature Parameter and Max Features to Read parameters. I set 10 jobs off using the parameters below, which read the data in blocks from PostGIS and wrote the data in parallel to Cosmos DB.

FME Server Job	Start Feature	Max Features to Read
1	0	999,999
2	1 million	999,999
3	2 million	999,999
4	3 million	999,999
5	4 million	999,999
6	5 million	999,999
7	6 million	999,999
8	7 million	999,999
9	8 million	999,999
10	9 million	999,999

Reading Data

Cosmos DB supports indexing and querying of geospatial data. The following geospatial functions are supported: ST_DISTANCE, ST_INTERSECTS, ST_WITHIN, ST_ISVALID.

The Cosmos DB reader supports passing in a WHERE clause on the feature type. We simply pass the query written in the text field directly to Cosmos DB without modification. The Azure documentation stipulates that requests should be prefixed with the root keyword, which confused me, as you don't need to do this in the Azure portal. This is what a successful WHERE clause looked like for me to retrieve all points within 300m of the point passed that have a path_id of 1049.

root.path_id = '1049' AND ST_DISTANCE(root.geom, {'type': 'Point', 'coordinates':[-123.01663220581631, 49.26604953933307]}) < 300

Gotchas

Currently, using Cosmos DB as a source in either the SQLExecutor or FeatureReader does not work, as there is an issue with cross-partition support.

Search