FME Version
Files
Introduction
If you need to perform more advanced statistics than is available in the StatisticsCalculator transformer, the RCaller transformer makes much more advanced statistical analysis possible in FME. The RCaller gives you the ability to run R scripts in FME.
Getting Started
Before you can use the RCaller you need to install the appropriate R packages. See the section on Installing the R Interpreter in the FME User Documentation.
Before you get going with the RCaller there's a couple of useful things to remember:
- R doesn't like UNC path names so you can't run a FME Workspace stored on a UNC path, i.e.: \\myprojects\fmeWorkspaces. You have to be running your FME workspaces on a mapped drive, i.e: f:\myprojects\fmeWorkspaces
- Do a bit of reading about R, if you're not already familiar with the concepts. This is a pretty good tutorial. More resources are listed at the end of this article.
Source Data
FME adds new ports to the RCaller as you connect transformers or feature types to the Connect Input port. The new input port will inherit its name from the source object (i.e the transformer name or the Feature Type Name).
The port names are used as the data frame names in R, so rename the port names to something you'll be able to use in your R scripts.
FME loads your data into a temporary SQLite database, so for both performance and clarity, only select the attributes you're going to use in your R scripts. Make sure the data types are correct.
FME transfers the data into R as a R data frame. You can access the data frame or data frame columns in your R script by dragging items from the Data Frames menu:
So to access the vector of Estimated values drag the Data - Estimated item into the script window and you'll see Data$Estimated in your R script window.
Building an R Script
This is not an R tutorial. To learn more about R, please see the resources section at the end of this article. If you're new to R, I'd recommend that you use the R Console to develop and debug your scripts - you'll get better feedback and it's a little easier to see the intermediate results. Then copy and paste the script into the RCaller. Build your script incrementally in the R Console so it's clear where any issues arise. You can load a sample of your data using the R readers: i.e. :
Data = read.csv("D:/tmp/SampleData.csv")
# Note R uses UNIX paths, i.e. '/' not '\'.
Getting the Results out of R...
... can be tricky! RCaller passes data back to FME via a data frame called 'fmeOutput' . Each row in the data frame will become a separate output feature in FME. If you know how to build and append to R data frames you can probably skip this section.
To populate fmeOutput data frame, you can simply pass back a list of values (vectors of length one), i.e.:
> Data = read.csv("D:/tmp/SampleData.csv") > meanAct = mean(Data$Actual) > meanEst = mean(Data$Estimated) > fmeOutput = data.frame(meanEst, meanAct)
But many R functions return more complex results. For example a linear regression function solving for y=mx+k:
lm.linear <- lm(Data$Actual ~ Data$Estimated)
Use the R summary() function to see the results:
> summary(lm.linear) Call: lm(formula = Data$Actual ~ Data$Estimated) Residuals: Min 1Q Median 3Q Max -9.9667 -2.1022 0.2679 2.3813 8.3354 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.051001 9.149612 1.317 0.211 Data$Estimated -0.009291 0.861531 -0.011 0.992 Residual standard error: 5.045 on 13 degrees of freedom Multiple R-squared: 8.946e-06, Adjusted R-squared: -0.07691 F-statistic: 0.0001163 on 1 and 13 DF, p-value: 0.9916
How to get that back into FME?
The names() function will give you back the variable names in the summary, i.e.:
> names(summary(lm.linear)) [1] "call" "terms" "residuals" "coefficients" "aliased" [6] "sigma" "df" "r.squared" "adj.r.squared" "fstatistic" [11] "cov.unscaled"
But... some of these are more complex objects in their own right, like the "coefficients":
> summary(lm.linear)$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 12.051001492 9.1496116 1.31710525 0.2105480 Data$Estimated -0.009291111 0.8615308 -0.01078442 0.9915592
So what to do if you want to return to FME the common characteristics of the y=mx+k analysis such as the r squared value, m & k?
You have to pick-out the values you need and pass them to the fmeOutput data frame. In the above example, m is given by the Data$Estimated Estimate = -0.009291111 and k (y intercept) is given by (Intercept) Estimate = 12.051001492 and the r squared result is simple value: r.squared. So you can use:
k <- summary(lm.linear)$coefficients[1,1] (the first column of the first row) m <- summary(lm.linear)$coefficients[2,1] (the second column of the first row) r2 <- summary(lm.linear)$r.squared
That was easy!
The workspace rcallerlinearregression.fmwt illustrates the example described above.
One final tip: expose the result variables in the RCaller to make life easier in workbench:
Exporting R Results to a Temporary File
In some cases it may not be appropriate to use a data frame for your R results, i.e for a large raster or an image. In this case you can export your R results to a temporary data file and have FME re-read those results. The article RCaller: Interpolate Points to Raster Through Kriging illustrates how you can do this.
Grouping with looping
For many statistical problems, you have a qualitative value, i.e. Codes ABC ABD TXU, that have some bearing on the quantitative values. So simple grouping makes a lot of sense.
For example, you might want to calculate the mean of each Code value:
Date Code Estimated Actual 11/29/2016 TXU 46.14 59.5 11/28/2016 ABD 43.89 34.1 11/27/2016 TXU 42.15 25.8 11/27/2016 ABC 9.3 20.3 11/26/2016 ABD 42.15 50.6 11/25/2016 ABC 11.04 11.7
You can put your analysis in a loop, sample the data by the Code and then calculate the regression. Something like:
for ( currentCode in unique(Data$Code)) { # assuming the input data.frame is 'Data' tmpData = Data[Data$Code == currentCode,] lm = lm(tmpData$Actual ~ tmpData$Estimated) # linear model on y = mx+k }
# initialize vectors to hold results r2 <- c() m <- c() k <- c() Code <- character() # y = Actual x = Estimated for ( currentCode in unique(Data$Code)) { tmpData = Data[Data$Code == currentCode,] # linear regression for y = mx+k lm.linear = lm(tmpData$Actual ~ tmpData$Estimated) # linear model result vectors y = mx+k r2 = c(r2, summary(lm.linear)$r.squared) k = c(k, summary(lm.linear)$coefficients[1,1]) m = c(m, summary(lm.linear)$coefficients[2,1]) Code = c(Code, currentCode) } fmeOutput<-data.frame(Code, m, k, r2)
You can assign the results directly to a data frame which would be more efficient, if you can figure it out.
The workspace rcallerlinearregressionwithgroups.fmwt illustrates the example described above.
Debugging your R Script
If you are relatively new to R, then I would recommend that you first develop your script in the R Console and then transfer to RCaller. It's a lot easier to debug there, see the section Building an R Script above. If you encounter the RCaller error:
ERROR |RCaller(InlineQueryFactory): InlineQueryFactory failed with exit code 1 when executing R script. Output was: Loading required package: gsubfn Loading required package: proto Loading required package: RSQLite
This seems to be a common error response if there is a syntax error in your script, or an undefined variable reference, so carefully check your script for unassigned variables or misspellings.
Remember, like FME, R is case sensitive.
Additional Resources
Here are some useful resources around using R in FME:
FME RCaller documentation
'R' tutorials: http://www.r-tutor.com/ and here
Extracting 'summary' information using summary(): example here
Appending to a data frame examples
Knowledge Center RCaller articles:
Comments
0 comments
Please sign in to leave a comment.