RCaller: Ins and outs of using R in FME

Liz Sanderson
Liz Sanderson
  • Updated

FME Version

  • FME 2016.x

Introduction

If you need to perform more advanced statistics than is available in the StatisticsCalculator transformer, the RCaller transformer makes much more advanced statistical analysis possible in FME. The RCaller gives you the ability to run R scripts in FME.

 

Getting Started

Before you can use the RCaller you need to install the appropriate R packages. See the section on Installing the R Interpreter in the FME User Documentation.

Before you get going with the RCaller there's a couple of useful things to remember:

  • R doesn't like UNC path names so you can't run a FME Workspace stored on a UNC path, i.e.: \\myprojects\fmeWorkspaces. You have to be running your FME workspaces on a mapped drive, i.e: f:\myprojects\fmeWorkspaces
  • Do a bit of reading about R, if you're not already familiar with the concepts. This is a pretty good tutorial. More resources are listed at the end of this article.

 

Source Data

FME adds new ports to the RCaller as you connect transformers or feature types to the Connect Input port. The new input port will inherit its name from the source object (i.e the transformer name or the Feature Type Name).

rcaller.png

The port names are used as the data frame names in R, so rename the port names to something you'll be able to use in your R scripts.

FME loads your data into a temporary SQLite database, so for both performance and clarity, only select the attributes you're going to use in your R scripts. Make sure the data types are correct.

rcallertables.png

FME transfers the data into R as a R data frame. You can access the data frame or data frame columns in your R script by dragging items from the Data Frames menu:

rscript1.png

So to access the vector of Estimated values drag the Data - Estimated item into the script window and you'll see Data$Estimated in your R script window.

 

Building an R Script

This is not an R tutorial. To learn more about R, please see the resources section at the end of this article. If you're new to R, I'd recommend that you use the R Console to develop and debug your scripts - you'll get better feedback and it's a little easier to see the intermediate results. Then copy and paste the script into the RCaller. Build your script incrementally in the R Console so it's clear where any issues arise. You can load a sample of your data using the R readers: i.e. :

Data = read.csv("D:/tmp/SampleData.csv")

# Note R uses UNIX paths, i.e. '/' not '\'.

rconsole.png

 

Getting the Results out of R...

... can be tricky! RCaller passes data back to FME via a data frame called 'fmeOutput' . Each row in the data frame will become a separate output feature in FME. If you know how to build and append to R data frames you can probably skip this section.

To populate fmeOutput data frame, you can simply pass back a list of values (vectors of length one), i.e.:

> Data = read.csv("D:/tmp/SampleData.csv") 
> meanAct = mean(Data$Actual) 
> meanEst = mean(Data$Estimated) 
> fmeOutput = data.frame(meanEst, meanAct) 
This will result in a single FME feature with the two mean values that have FME attribute names meanX & meanY

But many R functions return more complex results. For example a linear regression function solving for y=mx+k:

lm.linear <- lm(Data$Actual ~ Data$Estimated) 

Use the R summary() function to see the results:

> summary(lm.linear)
Call: lm(formula = Data$Actual ~ Data$Estimated)  Residuals:
    Min      1Q  Median      3Q     Max 
-9.9667 -2.1022  0.2679  2.3813  8.3354 
Coefficients:
             		Estimate Std. Error t value Pr(>|t|)
(Intercept) 		12.051001   9.149612   1.317    0.211
Data$Estimated   	-0.009291   0.861531  -0.011    0.992
Residual standard error: 5.045 on 13 degrees of freedom
Multiple R-squared:  8.946e-06, Adjusted R-squared:  -0.07691 
F-statistic: 0.0001163 on 1 and 13 DF,  p-value: 0.9916 

How to get that back into FME?

The names() function will give you back the variable names in the summary, i.e.:

> names(summary(lm.linear))
[1] "call"          "terms"         "residuals"     "coefficients"  "aliased"      
[6] "sigma"         "df"            "r.squared"     "adj.r.squared" "fstatistic"  
[11] "cov.unscaled"  

But... some of these are more complex objects in their own right, like the "coefficients":

> summary(lm.linear)$coefficients
		Estimate Std. Error     t value  Pr(>|t|)
(Intercept) 	12.051001492  9.1496116  1.31710525 0.2105480
Data$Estimated  -0.009291111  0.8615308 -0.01078442 0.9915592 

So what to do if you want to return to FME the common characteristics of the y=mx+k analysis such as the r squared value, m & k?

You have to pick-out the values you need and pass them to the fmeOutput data frame. In the above example, m is given by the Data$Estimated Estimate = -0.009291111 and k (y intercept) is given by (Intercept) Estimate = 12.051001492 and the r squared result is simple value: r.squared. So you can use:

k  <- summary(lm.linear)$coefficients[1,1]  (the first column of the first row)
m  <- summary(lm.linear)$coefficients[2,1]  (the second column of the first row)
r2 <- summary(lm.linear)$r.squared

That was easy!

The workspace rcallerlinearregression.fmwt illustrates the example described above.

One final tip: expose the result variables in the RCaller to make life easier in workbench:

rcalleroutputattributes.png

 

Exporting R Results to a Temporary File

In some cases it may not be appropriate to use a data frame for your R results, i.e for a large raster or an image. In this case you can export your R results to a temporary data file and have FME re-read those results. The article RCaller: Interpolate Points to Raster Through Kriging illustrates how you can do this.

 

Grouping with looping

For many statistical problems, you have a qualitative value, i.e. Codes ABC ABD TXU, that have some bearing on the quantitative values. So simple grouping makes a lot of sense.

For example, you might want to calculate the mean of each Code value:

Date	Code	Estimated	Actual
11/29/2016	TXU	46.14	59.5
11/28/2016	ABD	43.89	34.1
11/27/2016	TXU	42.15	25.8
11/27/2016	ABC	9.3	20.3
11/26/2016	ABD	42.15	50.6
11/25/2016	ABC	11.04	11.7

You can put your analysis in a loop, sample the data by the Code and then calculate the regression. Something like:

for ( currentCode in unique(Data$Code)) {    # assuming the input data.frame is 'Data'
   tmpData = Data[Data$Code == currentCode,]
   lm = lm(tmpData$Actual ~ tmpData$Estimated)  # linear model on y = mx+k
}
But you can't just use:
r2 <- summary(lm)$r.squared
to return the result, since you'll just return the last r2 out of the three calculated values.
One approach is to build vectors for each result and then copy these result vectors to the fmeOutput data frame.
# initialize vectors to hold results
r2 <- c()
m <- c()
k <- c()
Code <- character() 
# y = Actual  x = Estimated 
for ( currentCode in unique(Data$Code)) {
  tmpData = Data[Data$Code == currentCode,]
  # linear regression for y = mx+k
  lm.linear = lm(tmpData$Actual ~ tmpData$Estimated)
  # linear model result vectors y = mx+k
  r2 = c(r2, summary(lm.linear)$r.squared)
  k = c(k, summary(lm.linear)$coefficients[1,1])
  m = c(m, summary(lm.linear)$coefficients[2,1])
  Code = c(Code, currentCode)
}

fmeOutput<-data.frame(Code, m, k, r2) 

You can assign the results directly to a data frame which would be more efficient, if you can figure it out.

The workspace rcallerlinearregressionwithgroups.fmwt illustrates the example described above.

 

Debugging your R Script

If you are relatively new to R, then I would recommend that you first develop your script in the R Console and then transfer to RCaller. It's a lot easier to debug there, see the section Building an R Script above. If you encounter the RCaller error:

ERROR |RCaller(InlineQueryFactory): InlineQueryFactory failed with exit code 1 when executing R script. Output was: Loading required package: gsubfn Loading required package: proto Loading required package: RSQLite

This seems to be a common error response if there is a syntax error in your script, or an undefined variable reference, so carefully check your script for unassigned variables or misspellings.

Remember, like FME, R is case sensitive.

 

Additional Resources

Here are some useful resources around using R in FME:

FME RCaller documentation

'R' tutorials: http://www.r-tutor.com/ and here

Extracting 'summary' information using summary(): example here

Appending to a data frame examples

Knowledge Center RCaller articles:

RCaller: Interpolate Points to Raster Through Kriging

RCaller: Is Tree Height and Tree Width Correlated?

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.