Connecting to Cosmos DB
CosmosDB is a NoSQL database on the Microsoft Azure platform. It supports indexing and querying of geospatial data that's represented using the GeoJSON specification. As you will see below, writing large volumes of data to Cosmos DB in bulk isn't as performant as we might expect. However, performing simple spatial queries against huge datasets is extremely fast.
Writing Data
If you are writing a large amount of data to Cosmos DB, the main SQL API we are using is not suitable for very large datasets. There are various methods for bulk uploading data into CosmosDB, both online and offline. Refer to the Azure documentation for additional information.
If you want to use FME to write data to Cosmos DB, here are a few tips:
Concurrent Requests Parameter
If you are writing more than 20,000 records, leave the default concurrent requests set to 4; any higher than this, and the API seems to throttle requests, causing FME to error. If you are writing a smaller number of features, you can play around with increasing this number, which essentially parallelizes the requests.
Creating the collection
Currently, FME doesn't create a spatial index if you are writing geometry out. I recommend creating the database and collection manually in the Azure portal, rather than using FME, and then setting up the indexing policy manually. The Azure Documentation outlines the values that need to be added to support spatial indexing. By setting up the collection manually, you can also configure features like Autopilot mode, which is not available in FME.
Scaling Writing
If you want to scale the amount of data FME can write to Cosmos DB, you can use FME Flow and scale the number of FME Engines. For example, I have a PostGIS table with 10 million points in it that I want to write to CosmosDB. On the PostGIS reader, there are Start Feature Parameter and Max Features to Read parameters. I set 10 jobs off using the parameters below, which read the data in blocks from PostGIS and wrote the data in parallel to Cosmos DB.
| FME Server Job | Start Feature | Max Features to Read |
|---|---|---|
| 1 | 0 | 999,999 |
| 2 | 1 million | 999,999 |
| 3 | 2 million | 999,999 |
| 4 | 3 million | 999,999 |
| 5 | 4 million | 999,999 |
| 6 | 5 million | 999,999 |
| 7 | 6 million | 999,999 |
| 8 | 7 million | 999,999 |
| 9 | 8 million | 999,999 |
| 10 | 9 million | 999,999 |
Reading Data
Cosmos DB supports indexing and querying of geospatial data. The following geospatial functions are supported: ST_DISTANCE, ST_INTERSECTS, ST_WITHIN, ST_ISVALID.
The Cosmos DB reader supports passing in a WHERE clause on the feature type. We simply pass the query written in the text field directly to Cosmos DB without modification. The Azure documentation stipulates that requests should be prefixed with the root keyword, which confused me, as you don't need to do this in the Azure portal. This is what a successful WHERE clause looked like for me to retrieve all points within 300m of the point passed that have a path_id of 1049.
root.path_id = '1049' AND ST_DISTANCE(root.geom, {'type': 'Point', 'coordinates':[-123.01663220581631, 49.26604953933307]}) < 300
Gotchas
Currently, using Cosmos DB as a source in either the SQLExecutor or FeatureReader does not work, as there is an issue with cross-partition support.