FME Version
Files
Introduction
The Shapiro-Wilk test calculates whether a random sample of data comes from a normal distribution. When the p-value is less than or equal to 0.05 (assuming a 95% confidence level), the data is not normal. If this test fails, you can state with 95% confidence that your data does not fit in the normal distribution.
This tutorial will explain how to set up a reusable custom transformer to perform a statistical test using R or Python. If you create your own custom transformer using a different statistical test, we encourage you to publish it to FME Hub.
Requirements
If using R:
R Installed - How to install R
sqldf package installed
If using Python:
Scipy package installed
Step-by-Step Instructions
1. Add Data
In a blank workspace, read in a dataset to test. For this example, we are going to use cat.csv, a randomly generated CSV file containing 1000 rows with values between -1.7824 and 1.1977. The data is normally distributed. You could use any dataset as long as the value you are testing is numerical.
Add a CSV Reader to the canvas and browse to the cat.csv file; the default parameters are ok here. Click OK.
2. Create Custom Transformer
Right-click anywhere on the canvas and select Create Custom Transformer. Name the transformer ShapiroWilkCalculator-R or ShapiroWilkCalculator-Py. You can enter description details if you wish.
3. Create Input Parameter
We will need to use an attribute from our input data source throughout the custom transformer, so let’s make a published parameter to do that easily. In the custom transformer tab, in the Navigator pane, create a new user parameter (Right-click User Parameters > Manage User Parameters...). Then set the following (Green plus sign (Insert)):
Type | Attribute Name |
Parameter Identifier | inputData |
Prompt | Attribute to Test |
Published | Checked |
Required | Checked |
Disable Attribute Assignment | Checked |
4. Create Attribute
To be able to reuse this custom transformer easily, we will need to create an attribute that doesn’t change regardless of what the attribute we are evaluating is named. Add an AttributeManager transformer and connect it to the Input port inside the custom transformer. For the New Attribute call it shapiro.x, then for the Attribute Value, set it to:
@real64(@Value($(inputData)))
We’ve enclosed the parameter value in @real64() to ensure that our values are the float data type which is required for our statistical calculation.
If using R, follow steps 5-6; if using Python, follow steps 7-9.
RCaller
Before you continue with the RCaller, please ensure that you have R installed on your computer and the R Package sqldf. See the RCaller documentation for instructions on how to do this.
5. Set up the Shapiro-Wilk Test using R
Now that we have cleaned up the data and created a constant attribute, we can set up the RCaller transformer to perform the Shapiro-Wilk Test.
Add an RCaller transformer to the canvas and connect it to the AttributeManager. In the parameters, change the Input Table name to R, and for Query Columns, change the Type for shapiro.x to float.
Paste the following code:
shapiro <- shapiro.test(R$shapiro.x) fmeOutput<-data.frame(shapiro$statistic, shapiro$p.value)
The first line creates an object named Shapiro and performs the function shapiro.test (which is the Shapiro-Wilk Test) on the R table and shapiro.x column. This function results in a list object, so shapiro becomes a list.
Line two is outputting a data frame called shapiro with the elements statistic and p.value to fmeOutput. R uses the $ character to access elements of objects. In this case, it is accessing the elements from a list.
The last parameter to set in the RCaller is Attributes to Expose. Click the ellipse and add shapiro.statistic and shapiro.p.value as attributes to expose. Leave the Type as default. This allows these attributes to be used in the FME workspace after the RCaller.
Click OK twice to accept these parameters.
6. Finish Custom Transformer
To finish off the custom transformer, connect the RCaller Output port to the Output port for the Custom Transformer. Then continue with step 10.
PythonCaller
Before continuing, please ensure that you have Scipy installed using the same version of Python that you are using in FME (for this example - scipy 1.11.2 with FME 2024.x).
7. Keep shapiro.x
We only need the shapiro.x attribute, so add an AttributeKeeper transformer to the canvas and connect it to the AttributeManager. This will remove all the attributes from the schema except the one we are interested. In the parameters select shapiro.x as the Attribute to Keep.
8. Setup the Shapiro-Wilk Test using Python
Now that we have cleaned up the data and created a constant attribute, we can set up the PythonCaller transformer to perform the Shapiro-Wilk Test.
Add a PythonCaller transformer to the canvas and connect it to the AttrbuteKeeper. In the parameter,s paste the following code:
import fme
from fme import BaseTransformer
import fmeobjects
import scipy.stats
class FeatureProcessor(BaseTransformer):
"""Template Class Interface:
When using this class, make sure its name is set as the value of the 'Class to Process Features'
transformer parameter.
This class inherits from 'fme.BaseTransformer'. For full details of this class, its methods, and
expected usage, see https://docs.safe.com/fme/html/fmepython/api/fme.html#fme.BaseTransformer.
"""
def __init__(self):
"""Base constructor for class members."""
self.x = []
def has_support_for(self, support_type: int):
"""This method is called by FME to determine if the PythonCaller supports Bulk mode,
which allows for significant performance gains when processing large numbers of features.
"""
return support_type == fmeobjects.FME_SUPPORT_FEATURE_TABLE_SHIM
def input(self, feature: fmeobjects.FMEFeature):
"""This method is called for each feature which enters the PythonCaller."""
self.x.append(feature.getAttribute('shapiro.x'))
def close(self):
"""This method is called once all the FME Features have been processed from input()."""
results = scipy.stats.shapiro(self.x)
feature = fmeobjects.FMEFeature()
feature.setAttribute('shapiro.result', results[0])
feature.setAttribute('shapiro.pvalue', results[1])
self.pyoutput(feature)
def process_group(self):
"""This method is called by FME for each group when group processing mode is enabled."""
pass
def reject_feature(self, feature: fmeobjects.FMEFeature, code: str, message: str):
"""This method can be used to output a feature to the <Rejected> port."""
feature.setAttribute("fme_rejection_code", code)
feature.setAttribute("fme_rejection_message", message)
self.pyoutput(feature, output_tag="<Rejected>")
The first four lines import different packages. When you first open the PythonCaller, import fme, from fme import BaseTransformer, and import objects will already be there, so you will only need to add import scipy.stats to use the Shapiro-Wilk test.
self.x.append(feature.getAttribute('shapiro.x')) uses the attribute shapiro.x as the input attribute (def input() function).
results = scipy.stats.shapiro(self.x) calls the shapiro function from the scipy.stats package (def close() function)
Finally, the last four lines in the def close() function are creating the attributes to use within fme.
Before closing the PythonCaller, click the ellipsis next to Attributes to Expose and add shapiro.result and shapiro.pvalue. For Attributes to Hide select shapiro.x.
Click OK twice to close the PythonCaller.
9. Finish Custom Transformer
To finish off the custom transformer, connect the PythonCaller Output port to the Output port for the Custom Transformer.
Both R and Python
10. Run translation
Switch back to the Main tab and add an inspector to the Output port of the ShapiroWilk-R or ShapiroWilk-Py transformer. Open the parameters for the custom transformer and set the Attribute to Test to X, then run the translation.
The final results:
R:
Python:
11. Interpretation
If the p-value is less than the significance level (in this case, 0.05, a 95% confidence interval), the null hypothesis that the data is normally distributed can be rejected. Put in plain language, if the p-value is less than 0.05, we can assume the data is not normally distributed. So, for our data, the p-value is 4.44 (4.45 for Python), which is greater than 0.05, so our data is normally distributed.
Additional Resources
RCaller: Ins and outs of using R in FME
Tutorial: Python and FME Basics
Shapiro-Wilk Test R Documentation
Comments
0 comments
Please sign in to leave a comment.