Introduction
This article provides an overview of writing data when running in stream mode. It focuses on support for databases and data lakes.
Writers
Two parameters should be considered when writing to databases in stream mode: Bulk Insert and Features per Transaction.
Bulk Insert
Due to the high volumes often associated with streams, bulk insert (or equivalent) needs to be set to Yes. Setting bulk insert to yes means the SQL COPY (which is more performant) command is used instead of SQL INSERT.
Features per Transaction
Features per transaction are the number of features that FME places in each transaction before committing them to the database. This parameter is critical when writing to streams. If you make the number too high and the number of features written is low, the data won't get written. If the value is too low and the feature output is too high, the database will be overwhelmed with too many transactions.
For example, if you have a workflow with 20,000 messages per minute, if you set the features per transaction to be 1000, the data is committed roughly every three seconds, which is probably too often. If you set the features per transaction to be 500,000, the data is committed roughly every 25 minutes, which is likely not often enough.
| Bulk Insert | Features per Transaction | Other | |
|---|---|---|---|
| Amazon Aurora | N/A | Set based on message rate | |
| Amazon DynamoDB | N/A | Called Batch Size, it should be set. The maximum value is 25. | |
| Amazon Redshift | N/A | Set based on message rate. | |
| Azure Cosmos | Yes | N/A | Concurrent Requests need to be set based upon the number of features. More info here. |
| Azure SQL Database | Yes | Set based on message rate. | |
| Databricks | N/A | Set based on message rate. | |
| Google BigQuery | Yes | Called Table Splitting, it should be set based on the message rate. | The number of errors allowed should also be set if errors are expected. |
| MariaDB/SkySQL | Yes | Set based on message rate. | |
| Oracle | Yes | Set based on message rate. | |
| PostgreSQL | Yes | Set based on message rate. | |
| Snowflake | N/A | Set based on message rate. | |
| SQL Server | N/A | Set based on message rate. |
FeatureWriter
Writing via the FeatureWriter in stream mode to databases is supported. Parameters and considerations are the same as defined in the table above.
The main reason to use a FeatureWriter is if you wish to leverage the Group By to write features based on when the group changes. For example, in combination with the TimeWindower transformer, you could write data out whenever the window ends. Group By is not supported by writers. Since the FeatureWriter is a feature-holding transformer, features will not be released through its output ports in stream mode unless Group By is enabled and set to "When Group Changes (Advanced)".
Other Transformers
There are several other transformers you can use to write data out when running in stream mode.
- AmazonAthenaConnector
- AzureQueueStorageConnector
- SQLExecutor
- GoogleBigQueryConnector
Additional Resources
Webinar: Powering Real-Time Decisions with Continuous Data Streams