FME Version
Connecting to Cosmos DB
CosmosDB is a NoSQL database on the Microsoft Azure platform, it supports indexing and querying of geospatial data that's represented using the GeoJSON specification. As you will see below, writing large volumes of data to Cosmos DB in bulk isn't as performant as we might expect. However, performing simple spatial queries against huge datasets is extremely fast.
Writing Data
If you are writing a large amount of data up to Cosmos DB, the main SQL API we are using is not suitable for very large datasets. There are various ways to bulk upload data into CosmosDB both online and offline, see the Azure documentation for further information.
If you want to use FME to write data to Cosmos DB, here are a few tips:
Concurrent Requests Parameter
If you are writing more than 20,000 records leave the default concurrent requests set to 4, any higher than this and the API seems to throttle requests causing FME to error. If you are writing smaller numbers of features you can play around with increasing this number, it essentially parallelizes the requests.
Creating the collection
Currently, FME doesn't create a spatial index if you are writing geometry out. I would recommend creating the database and collection manually in the Azure portal rather than using FME and then setting up indexing policy manually. The Azure documentation outlines the values that need to be added to support spatial indexing. By setting the collection up manually, you can also configure features like Autopilot mode which you can't do in FME.
Scaling Writing
If you want to scale the amount of data FME can write to Cosmos DB, you can use FME Server and scale the number of FME Engines. For example, I have a PostGIS table with 10 million points in it that I want to write to CosmosDB. On the PostGIS reader, there is a Start Feature Parameter and Max Features to Read parameter. I set 10 jobs off using the parameters below which read the data in blocks from PostGIS and wrote the data in parallel to Cosmos DB.
FME Server Job | Start Feature | Max Features to Read |
---|---|---|
1 | 0 | 999,999 |
2 | 1 million | 999,999 |
3 | 2 million | 999,999 |
4 | 3 million | 999,999 |
5 | 4 million | 999,999 |
6 | 5 million | 999,999 |
7 | 6 million | 999,999 |
8 | 7 million | 999,999 |
9 | 8 million | 999,999 |
10 | 9 million | 999,999 |
Reading Data
Cosmos DB supports indexing and querying of geospatial data. The following geospatial functions are supported: ST_DISTANCE, ST_INTERSECTS, ST_WITHIN, ST_ISVALID.
The Cosmos DB reader supports passing in a WHERE clause on the feature type. We simply pass-through whatever query is written in the text field directly through to Cosmos DB without modifying. The Azure doc stipulates that request should be prefixed with the root keyword which confused me as you don't need to do this in the Azure portal. This is what a successful WHERE clause looked like for me to get all of the points within 300m of the point passed that have a path_id of 1049.
root.path_id = '1049' AND ST_DISTANCE(root.geom, {'type': 'Point', 'coordinates':[-123.01663220581631, 49.26604953933307]}) < 300
Gotchas
Currently, using Cosmos DB as a source in either the SQLExecutor or FeatureReader does not work as there is an issue with cross partition support.
Comments
0 comments
Please sign in to leave a comment.