Feature Caching and Performance

Question

How does Feature Caching affect the performance of an FME workspace?

Answer

This is a great question. There are a number of ways that Feature Caching affects workspace performance.

Note: For a refresher on what Feature Caching functionality does, please check the FME Workbench documentation under About Feature Inspection.

Creating Cached Data

Features are cached when a workspace is run in FME Workbench with the Feature Caching option active. Caches are indicated by green icons, which are animated when the workspace is running and features being cached.

Above, the workspace has been run and the data cached at every step.

Performance Effects: Performance suffers when a workspace is initially run with Feature Caching activated. This is because it takes both time and system resources to create those caches and fill them with data.

However, the intention is that the time lost to caching data is more than compensated by the time saved when re-running a workspace, reading cached features as the source data to avoid re-running the entire workspace.

Reading Cached Data

During design and testing of a workspace, it is often not necessary to re-run the entire translation. When features have been cached it is possible to use them as a data source, ignoring prior transformers that have not been affected by edits.

Above, the StatisticsCalculator has been edited in this workspace. The change in icon color shows that subsequent transformers are also affected by the change. To create the correct output, rather than run the entire workspace again, the features cached in the AreaCalculator can be used as a data source. This is achieved by running the workspace from the StatisticsCalculator onwards:

Performance Effects: Because the workspace prior to the edited StatisticsCalculator does not need to be re-run, the overall performance is much faster. The improvement in performance is directly related to the work the skipped transformers carried out. For example, much time can be saved by caching the results of a Geocoder transformer, to avoid making subsequent calls to the same web service.

The intention is that the time gained during "partial runs", offsets the time lost during the initial execution of the workspace.

Collapsed Bookmarks

When a workspace is being run, and data is cached, each output port of a feature type or transformer is cached.

Above, for example, every transformer in this expanded bookmark caches its features.

However, during design and testing of a workspace, not all features need to be cached. Sections that have already been tested satisfactorily, will not be edited further and so do not need to be cached. In that scenario, caching of every transformer can be prevented by collapsing the bookmark in which they reside.

Now when the workspace is run, only the output port of the collapsed bookmark is cached. Features are not cached for the transformers within it.

Performance Effects: By collapsing bookmarks to exclude sections of workspace, caching of features is reduced and so can be carried out more quickly and using fewer system resources.

Unconnected Output Ports

Some transformers in FME are designed to create multiple outputs. In general, FME creates output only for connected output ports.

Above, for example, a workspace turns contours into a TIN surface using the SurfaceModeller transformer. Because only the TINSurface output port is connected, only a TIN Surface is created.

However, when Feature Caching is activated, then all output ports are deemed to be connected, and all outputs created.

Above, feature caching has caused all outputs to be generated, even though only TINSurface is connected.

Performance Effects: On the one hand, this is a useful feature that allows the user to inspect different results without having to connect their output port. On the other hand, the process takes longer and uses more system resources; not just because more data is being cached, but because more calculations are taking place to generate all of the different outputs.

Parallel Processing

Parallel processing is a technique for using multiple processes to improve performance. This is often achieved using a custom transformer. However, parallel processing in a custom transformer is not permitted when Feature Caching is activated.

Performance Effects: Although Feature Caching can save time using "partial runs", it causes the loss of any performance gains achieved through parallel processing. The workspace author must balance the benefits of each technique and choose which produces the best performance.

Large Dataset Caches

The reduction in performance when caching is especially noticeable when the amount of data involved is very large. For example, given multiple raster files, each very large in size, caching all that data can eat up a considerable amount of system resources. So, consider whether caching is truly required before activating it. Also, be sure to carry out general best practices in terms of attributes; that is, use the AttributeRemover to remove excess attributes as early in the workspace as possible (don't read them at all if you don't have to) and in particular remove list attributes once they are no longer required.

Another suggestion is to design and test a workspace using just a small subset of data, turning off caching when the full dataset is ready to be put into production.

Performance Effects: The idea behind Feature Caching is to save time and resources through the use of "partial runs". If the time and resources taken to cache the data initially, is greater than the time saved by partial runs, then feature caching should not be applied. Large datasets can cause excess caching and so should be avoided where possible.

Conclusion

By permitting partial runs of a workspace, Feature Caching can have very great benefits in terms of performance. However, there are also some occasions where it can have a negative impact. Care must be taken to avoid caching more data than is necessary, using collapsed bookmarks as necessary. This includes transformers with multiple unused output ports, that might generate more data than expected in Feature Caching mode.

Search