FME Version
Files
As of FME 2022.0, Python 2.7 has been deprecated and is no longer available within FME. Please see the Python 2.7 Deprecation article. This article has not yet been updated to use Python 3+, to continue with this article, please use FME 2020 or older.
Introduction
The Shapiro-Wilk test calculates whether a random sample of data comes from a normal distribution. When the p-value is less than or equal to 0.05 (assuming a 95% confidence level) the data is not normal. If this test fails you can state with 95% confidence that your data does not fit in the normal distribution.
This tutorial will go into the details of how to set up a reusable custom transformer to perform a statistical test using R or Python. If you create your own custom transformer using a different statistical test, we encourage you to publish it to the FME Hub.
Requirements
If using R:
R Installed - How to install R instructions
sqldf package installed
If using Python:
Scipy package installed
Step-by-step Instructions
1. Add Data
In a blank workspace read in a dataset to test. For this example, we are going to use cat.csv which is just a randomly generated CSV file containing 1000 rows with values between -1.7824 and 1.1977. The data is normally distributed. You could use any dataset as long as the value you are testing is numerical.
Add a CSV Reader to the canvas and browse to the cat.csv file, the default parameters are ok.
2. Create Custom Transformer
Right-click anywhere on the canvas and select Create Custom Transformer. Name the transformer ShapiroWilkCalculator-R or ShapiroWilkCalculator-Py. You can enter description details if you wish.
3. Create Input Parameter
We will need to use an attribute from our input data source throughout the custom transformer so let’s make a published parameter to do that easily. In the custom transformer tab, create a new published parameter. Then set the following:
Type | Attribute Name |
Name | input_data |
Prompt | Attribute to Test: |
Published | Yes |
Optional | No |
Attribute Assignment | Off |
input_data Published Parameter setup
4. Create Attribute
To be able to reuse this custom transformer easily we will need to create an attribute that doesn’t change regardless of what the attribute we are evaluating is named. Add an AttributeManager transformer and connect it to the Input port inside the custom transformer. For the New Attribute call it shapiro.x then for the Attribute Value set it to:
@real64(@Value($(input_data)))
We’ve enclosed the parameter value in @real64() to ensure that our values are the float data type which is required for our statistical calculation.
AttributeManager parameters to create the constant attribute shapiro.x
If using R, follow steps 5-6, if using Python follow steps 7-9.
RCaller
Before you continuing with the RCaller, please ensure that you have R installed on your computer, as well as the R Package sqldf. See the RCaller documentation for instructions on how to do this.
5. Set up Shapiro-Wilk Test using R
Now that we have cleaned up the data and created a constant attribute we can set up the RCaller transformer to perform the Shapiro-Wilk Test.
Add an RCaller transformer to the canvas and connect it to the AttrbuteManager. In the parameters change the Input Table name to R then for Columns change the Type for shapiro.x to float. Click OK to accept the parameters. You will need to reconnect the RCaller to the AttributeManager after the table name changed.
Open the RCaller parameters again and paste the following code:
shapiro <- shapiro.test(R$shapiro.x) fmeOutput<-data.frame(shapiro$statistic, shapiro$p.value)
The first line is creating an object named shapiro and is performing the function shapiro.test (which is the Shapiro-Wilk Test) on the R table and shapiro.x column. This function results in a list object, so shapiro becomes a list.
Line two is outputting a data frame called shapiro with the elements statistic and p.value to fmeOutput. R uses the $ character to access elements of objects. In this case, it is accessing the elements from a list.
The last parameter to set in the RCaller is Attributes to Expose. Click the ellipse and add shapiro.statistic and shapiro.p.value as attributes to expose. This allows these attributes to be used in the FME workspace after the RCaller.
RCaller parameters for the Shapiro-Wilk Test
6. Finish Custom Transformer
To finish off the custom transformer, connect the RCaller Output port to the Output port for the custom transformer. Then continue with step 10.
ShapiroWilk-R custom transformer workspace
PythonCaller
Before continuing please ensure that you have Scipy installed using the same version of Python that you are using in FME.
7. Keep shapiro.x
We only need the shapiro.x attribute, so add an AttributeKeeper transformer to the canvas and connect it to the AttributeManager. This will remove all the attributes from the schema except the one we are interested. In the parameters select shapiro.x as the Attribute to Keep.
8. Setup the Shapiro-Wilk Test using Python
Now that we have cleaned up the data and created a constant attribute we can set up the PythonCaller transformer to perform the Shapiro-Wilk Test.
Add a PythonCaller transformer to the canvas and connect it to the AttrbuteKeeper. In the parameters paste the following code:
import fme import fmeobjects import scipy.stats class FeatureProcessor(object): def __init__(self): self.x = [] def input(self,feature): self.x.append(float(feature.getAttribute('shapiro.x'))) def close(self): results = scipy.stats.shapiro(self.x) feature = fmeobjects.FMEFeature() feature.setAttribute('shapiro.result', results[0]) feature.setAttribute('shapiro.pvalue', results[1]) self.pyoutput(feature)
The first three lines are importing different packages. Import fme and import fmeobjects will already be in the PythonCaller when you first open it, so you will only need to add import scipy.stats to be able to use the Shapiro-Wilk test.
self.x.append(feature.getAttribute('shapiro.x')) uses the attribute shapiro.x as the input attribute.
results = scipy.stats.shapiro(self.x) calls the shapiro function from the scipy.stats package
Finally, the last four lines are creating the attributes to use within fme.
Before closing the Python caller, click the ellipsis next to Attributes to Expose and add shapiro.result and shapiro.pvalue then for Attributes to Hide select shapiro.x.
Click OK to close the PythonCaller.
PythonCaller parameters for the Shapiro-Wilk Test
9. Finish Custom Transformer
To finish off the custom transformer, connect the PythonCaller Output port to the Output port for the custom transformer.
ShapiroWilk-Py custom transformer workspace
Both R and Python
10. Run translation
Switch back to the Main tab and add an inspector to the Output port of the ShapiroWilk-R or ShapiroWilk-Py transformer. Open the parameters for the custom transformer and set the Attribute to Test to X, then run the translation.
The final results:
R:
Python:
11. Interpretation
If the p-value is less than the significance level (in this case 0.05, a 95% confidence interval), the null hypothesis that the data is normally distributed can be rejected. Put in plain language, if the p-value is less than 0.05, we can assume the data is not normally distributed. So for our data the p-value is 4.44, which is greater than 0.05, so our data is normally distributed.
Additional Resources
RCaller: Ins and outs of using R in FME
Tutorial: Python and FME Basics
Shapiro-Wilk Test R Documentation
Comments
0 comments
Please sign in to leave a comment.