RCaller: Ins and outs of using R in FME

Files

rcallerlinearregressionwithgroups.fmwt
- 20 KB
- Download
rcallerlinearregression.fmwt
- 10 KB
- Download

Introduction

If you need to perform statistics beyond what is available in the StatisticsCalculator transformer, the RCaller transformer enables much more advanced statistical analysis in FME. The RCaller lets you run R scripts in FME.

Getting Started

Before you can use the RCaller, you need to install the appropriate R packages. See the section on Installing the R Interpreter in the FME User Documentation.

Before you get going with the RCaller, there are a couple of useful things to remember:

R doesn't like UNC path names so you can't run a FME Workspace stored on a UNC path, i.e.: \\myprojects\fmeWorkspaces. You have to be running your FME workspaces on a mapped drive, i.e, f:\myprojects\fmeWorkspaces
Do a bit of reading about R, if you're not already familiar with the concepts. This is a pretty good tutorial. More resources are listed at the end of this article.

Source Data

FME adds new ports to the RCaller as you connect transformers or feature types to the Connect Input port. The new input port will inherit its name from the source object (i.e, the transformer name or the Feature Type Name).

The port names are used as the data frame names in R, so rename the port names to something you'll be able to use in your R scripts.

FME loads your data into a temporary SQLite database, so for both performance and clarity, only select the attributes you're going to use in your R scripts. Make sure the data types are correct.

FME transfers the data into R as an R data frame. You can access the data frame or data frame columns in your R script by dragging items from the Data Frames menu:

To access the vector of estimated values, drag the Data - Estimated item into the script window, and you'll see Data$Estimated in your R script window.

Building an R Script

This is not an R tutorial. To learn more about R, please see the resources section at the end of this article. If you're new to R, I'd recommend that you use the R Console to develop and debug your scripts - you'll get better feedback, and it's a little easier to see the intermediate results. Then copy and paste the script into the RCaller. Build your script incrementally in the R Console so it's clear where any issues arise. You can load a sample of your data using the R readers: i.e. Data = read.csv("D:/tmp/SampleData.csv")

R uses UNIX paths, i.e., '/', not '\'.

Getting the Results out of R...

... can be tricky! RCaller returns data to FME via a data frame named 'fmeOutput'. Each row in the data frame will become a separate output feature in FME. If you know how to build and append to R data frames, you can probably skip this section.

To populate the fmeOutput data frame, you can simply pass back a list of values (vectors of length one), i.e.:

> Data = read.csv("D:/tmp/SampleData.csv") 
> meanAct = mean(Data$Actual) 
> meanEst = mean(Data$Estimated) 
> fmeOutput = data.frame(meanEst, meanAct)

This will result in a single FME feature with the two mean values that have FME attribute names meanX & meanY

But many R functions return more complex results. For example, a linear regression function solving for y=mx+k:

lm.linear <- lm(Data$Actual ~ Data$Estimated)

Use the R summary() function to see the results:

> summary(lm.linear)
Call: lm(formula = Data$Actual ~ Data$Estimated)  Residuals:
    Min      1Q  Median      3Q     Max 
-9.9667 -2.1022  0.2679  2.3813  8.3354 
Coefficients:
             		Estimate Std. Error t value Pr(>|t|)
(Intercept) 		12.051001   9.149612   1.317    0.211
Data$Estimated   	-0.009291   0.861531  -0.011    0.992
Residual standard error: 5.045 on 13 degrees of freedom
Multiple R-squared:  8.946e-06, Adjusted R-squared:  -0.07691 
F-statistic: 0.0001163 on 1 and 13 DF,  p-value: 0.9916

How to get that back into FME?

The names() function will give you back the variable names in the summary, i.e.:

> names(summary(lm.linear))
[1] "call"          "terms"         "residuals"     "coefficients"  "aliased"      
[6] "sigma"         "df"            "r.squared"     "adj.r.squared" "fstatistic"  
[11] "cov.unscaled"

But some of these are more complex objects in their own right, like the "coefficients":

> summary(lm.linear)$coefficients
		Estimate Std. Error     t value  Pr(>|t|)
(Intercept) 	12.051001492  9.1496116  1.31710525 0.2105480
Data$Estimated  -0.009291111  0.8615308 -0.01078442 0.9915592

So what to do if you want to return to FME the common characteristics of the y=mx+k analysis such as the r squared value, m & k?

You have to pick out the values you need and pass them to the fmeOutput data frame. In the above example, m is given by the Data$Estimated Estimate = -0.009291111, and k (y intercept) is given by (Intercept) Estimate = 12.051001492, and the r-squared result is a simple value: r.squared. So you can use:

k  <- summary(lm.linear)$coefficients[1,1]  (the first column of the first row)
m  <- summary(lm.linear)$coefficients[2,1]  (the second column of the first row)
r2 <- summary(lm.linear)$r.squared

The workspace rcallerlinearregression.fmwt illustrates the example described above.

One final tip: expose the result variables in the RCaller to make life easier in FME Workbench:

Exporting R Results to a Temporary File

In some cases, it may not be appropriate to use a data frame for your R results, i.e, for a large raster or an image. In this case, you can export your R results to a temporary data file and have FME re-read those results. The article "RCaller: Interpolate Points to Raster Through Kriging" illustrates how to do this.

Grouping with looping

For many statistical problems, you have a qualitative value, i.e., Codes ABC ABD TXU, that have some bearing on the quantitative values. So simple grouping makes a lot of sense.

For example, you might want to calculate the mean of each Code value:

Date	Code	Estimated	Actual
11/29/2016	TXU	46.14	59.5
11/28/2016	ABD	43.89	34.1
11/27/2016	TXU	42.15	25.8
11/27/2016	ABC	9.3	20.3
11/26/2016	ABD	42.15	50.6
11/25/2016	ABC	11.04	11.7

You can put your analysis in a loop, sample the data using the Code, and then calculate the regression. Something like:

for ( currentCode in unique(Data$Code)) {    # assuming the input data.frame is 'Data'
   tmpData = Data[Data$Code == currentCode,]
   lm = lm(tmpData$Actual ~ tmpData$Estimated)  # linear model on y = mx+k
}

But you can't just use: r2 <- summary(lm)$r.squared

To return the result, since you'll just return the last r2 out of the three calculated values.

One approach is to build vectors for each result and then copy them into the fmeOutput data frame.

# initialize vectors to hold results
r2 <- c()
m <- c()
k <- c()
Code <- character() 
# y = Actual  x = Estimated 
for ( currentCode in unique(Data$Code)) {
  tmpData = Data[Data$Code == currentCode,]
  # linear regression for y = mx+k
  lm.linear = lm(tmpData$Actual ~ tmpData$Estimated)
  # linear model result vectors y = mx+k
  r2 = c(r2, summary(lm.linear)$r.squared)
  k = c(k, summary(lm.linear)$coefficients[1,1])
  m = c(m, summary(lm.linear)$coefficients[2,1])
  Code = c(Code, currentCode)
}

fmeOutput<-data.frame(Code, m, k, r2)

You can assign the results directly to a data frame, which would be more efficient if you can figure it out.

The workspace rcallerlinearregressionwithgroups.fmwt illustrates the example described above.

Debugging your R Script

If you are relatively new to R, I would recommend developing your script in the R Console first and then transferring it to RCaller. It's a lot easier to debug there; see the section "Building an R Script" above. If you encounter the RCaller error:

ERROR |RCaller(InlineQueryFactory): InlineQueryFactory failed with exit code 1 when executing R script. Output was: Loading required package: gsubfn Loading required package: proto Loading required package: RSQLite

This seems to be a common error response if there is a syntax error in your script, or an undefined variable reference, so carefully check your script for unassigned variables or misspellings.

Remember, like FME, R is case sensitive.