Perform a Shapiro-Wilk Statistical Test using R or Python

Liz Sanderson
Liz Sanderson
  • Updated

FME Version

As of FME 2022.0, Python 2.7 has been deprecated and is no longer available within FME. Please see the Python 2.7 Deprecation article. This article has not yet been updated to use Python 3+, to continue with this article, please use FME 2020 or older.

Introduction

The Shapiro-Wilk test calculates whether a random sample of data comes from a normal distribution. When the p-value is less than or equal to 0.05 (assuming a 95% confidence level) the data is not normal. If this test fails you can state with 95% confidence that your data does not fit in the normal distribution.

This tutorial will go into the details of how to set up a reusable custom transformer to perform a statistical test using R or Python. If you create your own custom transformer using a different statistical test, we encourage you to publish it to the FME Hub.

 

Requirements

If using R:

R Installed - How to install R instructions

sqldf package installed

 

If using Python:

Scipy package installed

 

Step-by-step Instructions

1. Add Data

In a blank workspace read in a dataset to test. For this example, we are going to use cat.csv which is just a randomly generated CSV file containing 1000 rows with values between -1.7824 and 1.1977. The data is normally distributed. You could use any dataset as long as the value you are testing is numerical.

Add a CSV Reader to the canvas and browse to the cat.csv file, the default parameters are ok.

 

2. Create Custom Transformer

Right-click anywhere on the canvas and select Create Custom Transformer. Name the transformer ShapiroWilkCalculator-R or ShapiroWilkCalculator-Py. You can enter description details if you wish.

 

3. Create Input Parameter

We will need to use an attribute from our input data source throughout the custom transformer so let’s make a published parameter to do that easily. In the custom transformer tab, create a new published parameter. Then set the following:

Type Attribute Name
Name input_data
Prompt Attribute to Test:
Published Yes
Optional No
Attribute Assignment Off

inputparam.png

input_data Published Parameter setup

 

4. Create Attribute

To be able to reuse this custom transformer easily we will need to create an attribute that doesn’t change regardless of what the attribute we are evaluating is named. Add an AttributeManager transformer and connect it to the Input port inside the custom transformer. For the New Attribute call it shapiro.x then for the Attribute Value set it to:

@real64(@Value($(input_data)))

We’ve enclosed the parameter value in @real64() to ensure that our values are the float data type which is required for our statistical calculation.

attributeman.png

AttributeManager parameters to create the constant attribute shapiro.x

 

If using R, follow steps 5-6, if using Python follow steps 7-9.

 

RCaller

Before you continuing with the RCaller, please ensure that you have R installed on your computer, as well as the R Package sqldf. See the RCaller documentation for instructions on how to do this.

 

5. Set up Shapiro-Wilk Test using R

Now that we have cleaned up the data and created a constant attribute we can set up the RCaller transformer to perform the Shapiro-Wilk Test.

Add an RCaller transformer to the canvas and connect it to the AttrbuteManager. In the parameters change the Input Table name to R then for Columns change the Type for shapiro.x to float. Click OK to accept the parameters. You will need to reconnect the RCaller to the AttributeManager after the table name changed.

 

Open the RCaller parameters again and paste the following code:

shapiro <- shapiro.test(R$shapiro.x)

fmeOutput<-data.frame(shapiro$statistic, shapiro$p.value)

The first line is creating an object named shapiro and is performing the function shapiro.test (which is the Shapiro-Wilk Test) on the R table and shapiro.x column. This function results in a list object, so shapiro becomes a list.

 

Line two is outputting a data frame called shapiro with the elements statistic and p.value to fmeOutput. R uses the $ character to access elements of objects. In this case, it is accessing the elements from a list.

 

The last parameter to set in the RCaller is Attributes to Expose. Click the ellipse and add shapiro.statistic and shapiro.p.value as attributes to expose. This allows these attributes to be used in the FME workspace after the RCaller.

rcaller.png

RCaller parameters for the Shapiro-Wilk Test

 

6. Finish Custom Transformer

To finish off the custom transformer, connect the RCaller Output port to the Output port for the custom transformer. Then continue with step 10.

rworkflow.png

ShapiroWilk-R custom transformer workspace

 

PythonCaller

Before continuing please ensure that you have Scipy installed using the same version of Python that you are using in FME.

 

7. Keep shapiro.x

We only need the shapiro.x attribute, so add an AttributeKeeper transformer to the canvas and connect it to the AttributeManager. This will remove all the attributes from the schema except the one we are interested. In the parameters select shapiro.x as the Attribute to Keep.

 

8. Setup the Shapiro-Wilk Test using Python

Now that we have cleaned up the data and created a constant attribute we can set up the PythonCaller transformer to perform the Shapiro-Wilk Test.

 

Add a PythonCaller transformer to the canvas and connect it to the AttrbuteKeeper. In the parameters paste the following code:

import fme
import fmeobjects
import scipy.stats

class FeatureProcessor(object): 
	def __init__(self): 
		self.x = []  
	def input(self,feature): 
		self.x.append(float(feature.getAttribute('shapiro.x')))  
	def close(self): 
		results = scipy.stats.shapiro(self.x)  

		feature = fmeobjects.FMEFeature() 
		feature.setAttribute('shapiro.result', results[0]) 
		feature.setAttribute('shapiro.pvalue', results[1])  

		self.pyoutput(feature)

The first three lines are importing different packages. Import fme and import fmeobjects will already be in the PythonCaller when you first open it, so you will only need to add import scipy.stats to be able to use the Shapiro-Wilk test.

self.x.append(feature.getAttribute('shapiro.x')) uses the attribute shapiro.x as the input attribute.

results = scipy.stats.shapiro(self.x) calls the shapiro function from the scipy.stats package

Finally, the last four lines are creating the attributes to use within fme.

Before closing the Python caller, click the ellipsis next to Attributes to Expose and add shapiro.result and shapiro.pvalue then for Attributes to Hide select shapiro.x.

Click OK to close the PythonCaller.

pythoncaller.png

PythonCaller parameters for the Shapiro-Wilk Test

 

9. Finish Custom Transformer

To finish off the custom transformer, connect the PythonCaller Output port to the Output port for the custom transformer.

pythonworkflow.png

ShapiroWilk-Py custom transformer workspace

 

Both R and Python

10. Run translation

Switch back to the Main tab and add an inspector to the Output port of the ShapiroWilk-R or ShapiroWilk-Py transformer. Open the parameters for the custom transformer and set the Attribute to Test to X, then run the translation.

The final results:

R:

rfinal.png

Python:

pythonfinal.png

 

11. Interpretation

If the p-value is less than the significance level (in this case 0.05, a 95% confidence interval), the null hypothesis that the data is normally distributed can be rejected. Put in plain language, if the p-value is less than 0.05, we can assume the data is not normally distributed. So for our data the p-value is 4.44, which is greater than 0.05, so our data is normally distributed.

 

Additional Resources

RCaller: Ins and outs of using R in FME

Tutorial: Python and FME Basics

Shapiro-Wilk Test R Documentation

Shapiro-Wilk Test Scipy Python Documentation

Publishing an Item to the FME Hub

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.