Using the XMLXQueryExtractor to Extract from XML or JSON

Files

bedrock_geology_dcat.xml
- 10 KB
- Download
xmlXQueryExtractor_Part2_Completed.fmwt
- 60 KB
- Download
xmlXQueryExtractor_Part1_Completed.fmwt
- 60 KB
- Download
bedrock_geology_dcat.jsonld
- 10 KB
- Download

Introduction

XQuery is a query and functional programming language designed to retrieve, transform, and manipulate XML data, much like SQL for relational databases. It is built on XPath (XML Path Language) and allows users to navigate XML structures, filter nodes, join data from multiple XML documents, and construct new XML or other text-based outputs.

The XMLXQueryExtractor transformer allows you to use XQuery to extract portions of text from XML or other XML-based formats such as Extensible HyperText Markup Language (XHTML), Geography Markup Language (GML), or Keyhole Markup Language (KML). It is also possible to use this transformer to extract from JSON or JSON-based documents using the JSONiq extension to XQuery.

It is possible to use XMLFragmenter and XMLFlattener transformers to work with XML documents or JSONFragmenter and JSONFlattener to work with JSON in FME. However, leveraging the XQuery or JSONiq extension via the XMLXQueryExtractor allows for greater control and customization over what and how information is extracted from XML- or JSON-based documents.

This is especially useful when working with complex or large XML or JSON documents, where flattening all XML elements or JSON objects into attributes is neither required nor performance-impacting.

Part 1 contains examples of using the XMLXQueryExtractor to extract information from an XML-based document with XQuery. A basic familiarity with XML, XPath, and XQuery is recommended.

Part 2 contains examples of extending the XMLXQueryExtractor to extract text from JSON documents by using JSONiq Extension to XQuery in the XQuery Expression parameter. A basic familiarity with JSON and XQuery is recommended.

The completed workspace for part 1, xmlXQueryExtractor_Part1_Completed.fmwt, and for part 2, xmlXQueryExtractor_Part2_Completed.fmwt, are attached to the article and can be downloaded from the Files section.

The workspace was created in FME 2025.2.1 build 25815. However, the article's steps should work with any FME version.

Data Source

bedrock_geology_dcat.xml is a Data Catalog Vocabulary (DCAT) catalog file representing descriptive metadata for a bedrock geology of British Columbia, Canada dataset serialized in RDF/XML syntax.

bedrock_geology_dcat.jsonld is a DCAT catalog file representing descriptive metadata for the same bedrock geology dataset as the previous paragraph, but serialized in JavaScript Object Notation for Linked Data (JSON-LD) syntax.

Step-by-Step Instructions

In this article, we use the XMLXQueryExtractor transformer to extract portions of text from an XML or JSON document using different XQuery or JSONiq extensions to XQuery expressions, then set the extracted text as various forms of record attributes.

Part 1: Extract Information from XML

If you would like to create your own workspace following the instructions from the tutorial, download the source XML data, bedrock_geology_dcat.xml, from the Files section before starting. Otherwise, you can follow along with the completed workspace for part 1, xmlXQueryExtractorForXML_Part1_Completed.fmwt, also available from the Files section.

1. Configure XMLQueryExtractor to Extract XML Array to an Attribute

Start FME Workbench, and click on New to open a blank workspace. Add an XMLXQueryExtractor transformer to the canvas.

If you are using FME 2025.1 and older, an input feature is required to trigger XMLXQueryExtractor transformers to run. Add a Creator to the workspace and connect the output port of the Creator to the input port of the XMLXQueryExtractor.

For more information on option input ports, please see Transformers with an Optional Input Port.

We are interested in extracting the information stored in the dcat:keywords elements. To do this, set the following parameters in the XMLXQuery parameter dialog:

XQuery Type
- XQuery Expression:
  declare namespace dcat="http://www.w3.org/ns/dcat#";
  declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
  /rdf:RDF/dcat:Dataset/dcat:keyword
XML Source
- XML File: ./bedrock_geology_dcat.xml
  - Click on the ellipsis and browse to the downloaded dataset

All other parameters can remain as the default values. For this article, only change the parameters listed in the bulleted lists and then exit the parameters dialog by clicking OK. All parameters not mentioned can remain at their default values.

An XQuery expression consists of two parts: an optional prolog and a body. In the XQuery expression above, the prolog is the part where the namespace is declared and bound to a URI (the two lines starting with “declare namespace”), and the body is the line starting with /rdf:RDF/dcat:Dataset/…. The prolog declaring the namespaces is required in this case because the source XML uses name prefixes (dcat, dct, foaf, etc.).

Since we are only interested in extracting text stored in the dcat:keyword elements, only the prefixes referenced in the XPath in the body of the expression (dcat and dct) need to be declared in the prolog.

The body consists of the XPath expression used to select the node in the XML document. The path expression is the basis of an XQuery expression in the XMLXQueryExtractor, as a valid XQuery expression may consist of only the XPath and nothing else.

The values we are interested in are contained within dcat:keyword elements of the dcat:Dataset node, which in itself is part of the rdf:RDF node at the root. So the body of the expression is the following XPath:
/rdf:RDF/dcat:Dataset/dcat:keyword

The XQuery expression is specified directly in the XMLXQueryExtractor transformer as in this example. XQuery expressions can also be specified using a file or via an attribute by setting the XQuery Input parameter to the appropriate value. This can be useful if the XQuery expression frequently changes or should differ for each record.

The XMLXQueryExtractor is set to retrieve the XML source from the specified XML file. The XMLXQueryExtractor also allows the XML source to be read from an attribute or specified in the XQuery expression itself.

Add an Inspector transformer and connect the QueryResults output port to the input port of the Inspector transformer.

XMLXQueryExtractor's ooutput QueryResults port connected to an Inspector input port

Run the workspace and inspect the output. Inspect the results of the query via the _result attribute on the output feature in the FME Data Inspector or Data Preview.

Output results of XQuery expression containing namespaces, attributes, start and end tags

2. Refine XQuery Expression to Extract Content of XML Array

The XQuery expression from step 1 extracts a block of text from the XML document, including the start and end tags, namespaces, and attributes. The result includes unnecessary text that makes it difficult to read.

We want to extract only the content between the dcat:keyword start and end tags, so we need to add the XQuery function, fn:data(), to our XQuery expression. This function extracts the actual value of an XML node rather than the entire node. You can apply fn:data() to elements or attributes, where it returns their typed value (usually text, numbers, or dates) and XML tags are stripped away.

Change the XMLXQueryExtractor to:

XQuery Type
- XQuery Expression:
  declare namespace dcat="http://www.w3.org/ns/dcat#";
  declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
  fn:data(/rdf:RDF/dcat:Dataset/dcat:keyword)

XPath and XQuery include a number of built-in functions, such as fn:data(), that can enhance your queries. You can view a number of available functions at https://www.w3schools.com/xml/xsl_functions.asp

Rerun the workspace and inspect the output. You can see the query results in content that is much shorter and does not include start or end tags, or namespaces. Since the dcat:keyword element is an XML array, 14 keywords have been extracted to the _result attribute with a comma separating each keyword as set by default values in the parameters of the Results parameter section.

Output of step 2's XQuery expressions with fourteen dataset keywords stored in _result attribute and delimited by commas

3. Use Predicates and Extract Content to List Attribute

It is possible to filter results based on criteria using XQuery predicates. Predicates are enclosed in square brackets and immediately follow the node test or step to which they apply.

Predicates are based on the current item, also known as the context item, which is the specific item currently being processed. When filtering, the current item represents each individual item in the sequence being checked against the criteria specified in the predicate. Only items that evaluate to true based on the predicate are kept in the result.

Predicate Type	Example
Value	`.../dcat:keyword[@xml:lang=”en”]`
Existence	`.../dcat:keyword[fn:position() > 2]`
Position	`.../dcat:keyword[fn:starts-with(.,'li')]`
Multiple	`.../dcat:keyword[@xml:lang="en"][fn:position() > 2]`

Examples of various types of XQuery predicates

The XQuery expression from step 2 extracts the value of dcat:keyword nodes. There is an attribute named xml:lang associated with the dcat:keyword node. The attribute values are “en”, representing English keywords, and “fr-t-en”, representing French keywords.

We want to apply a criterion and return only French-language keywords as query results. We will add a predicate to the XQuery expression so it only returns nodes where the xml:lang attribute has a value of “fr-t-en”.

To do this, we will specify @xml:lang="fr-t-en" inside the square brackets of the predicate. The at sign ( @ ) is the operator used to select or reference attribute(s) of an element in XQuery, so this symbol prefixes the attribute name. The equal to operator ( = ) is used to compare the attribute value of the current item to the test value,”fr-t-en”.

In addition, instead of adding all returned keywords to a single attribute, we want to store them as a list attribute. We can do this by changing the value set in the Return Value parameter.

Open the XMLXQueryExtractor and change the following parameters:

XQuery Type
- XQuery Expression:
  declare namespace dcat="http://www.w3.org/ns/dcat#";
  declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
  fn:data(/rdf:RDF/dcat:Dataset/dcat:keyword[@xml:lang="fr-t-en"])
Return Value: List attribute

Rerun the workspace and view the output. Since the keywords are stored in the _results list attribute, you will have to view the results by opening FME Data Inspector or the Data Preview window of FME Workbench, clicking the output record, and navigating to the Exposed Attributes section of the Record Information pane. Right-click on _results{} and select "Expand All Under This Node".

Click on output record in Data Inspector, right-click on _results{} list and select "Expand All Under This Node" to see list attribute in Record Information window

View of seven list attributes the XQuery expression in step 3 created on the output record

4. Enhance XQuery Expressions with FME XQuery Functions

FME provides a variety of XQuery functions, including ones that allow access to and manipulation of record attribute values and geometry data. The list of XQuery functions in FME are documented here. These FME XQuery functions allow you to enhance query results returned, especially when used in combination with FLWOR expressions and built-in XQuery functions.

FLWOR is an acronym for For, Let, Where, Order by, and Return, and is an XQuery structure that allows you to select, filter, sort, and format XML data in a clear, readable way. It is similar to SELECT-FROM-WHERE statements in SQL for relational databases. You can read more about XQuery FLWOR expressions here.

The XQuery expression from step 3 extracts the query results and assigns them to a list attribute. However, list attributes require more steps to view, and not all formats support list attributes. By using an FLWOR expression and the FME XQuery function fme:set-attribute() to set a record attribute, you can return query results with customized attribute names and make it easier to manipulate the results in downstream transformers.

Change the following XMLXQueryExtractor parameters:

XQuery Type
- XQuery Expression:
  declare namespace dcat="http://www.w3.org/ns/dcat#";
  declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
  let $keywords := data(/rdf:RDF/dcat:Dataset/dcat:keyword[@xml:lang="fr-t-en"])
  for $keyword at $n in $keywords
  return {fme:set-attribute("keyword"||fn:string($n),$keyword)}
Expose Attributes
- Attributes to Expose:
  
  Output Attribute Type
  
  keyword1 buffer
  
  keyword2 buffer
  
  keyword3 buffer
  
  keyword4 buffer
  
  keyword5 buffer
  
  keyword6 buffer
  
  keyword7 buffer

The prolog remains the same as before, as all changes are in the body of the XQuery expression. We’ll go through the changes line by line.

The fourth line binds the keywords variable to the same sequence of results returned in step 3 using the let clause and the assignment operator, a colon followed by an equal sign (:= ). Variable names in XQuery expressions must start with a dollar sign ($).

Then, the fifth line iterates over each $keyword in the $keywords variable using the for clause, while keeping track of the sequence index position with the $n variable.

The last line uses the return clause to dynamically create an attribute name by calling the FME XQuery function, fme:set-attribute(). This function accepts two required arguments and an optional third argument. The first argument is the attribute name, the second is the attribute value, and the third optional argument is the delimiter.

The double pipe operator ( || ) in XQuery is used for string concatenation. The attribute name is dynamically created by concatenating the text “keyword” with the position number of the sequence cast to a string using the double pipe operator. The position number needs to be cast to a string because the double pipe operator only works with strings. The attribute value is set as the corresponding $keyword variable.

Since the return clause is evaluated for each node, each keyword will be stored in a different attribute. Since attribute names are dynamically generated, the attributes must be exposed before they appear in the Data Inspector or Visual Preview table view. The XMLXQueryExtractor allows you to expose attributes within the transformer itself in the Expose Attributes section. Alternatively, an AttributeExposer transformer can be used.

Rerun the workspace and view the output.

Now the French keywords are stored in separate attributes named “keyword1”, “keyword2”, etc. XQuery is a one-based indexing language where the first item in a sequence is at index 1, not 0.
Output of step 4's XQuery expressions where each dataset keyword is stored in a uniquely named attribute

Now you have successfully used the XMLXQueryExtractor transformer to pull information from an XML document and set the extracted information in various forms of attribute information on a record.

Part 2: Extract Information from JSON

The XMLXQueryExtractor is not limited to XML documents; it can extract information from JSON documents using the JSONiq extension of XQuery. The same applies to all transformers that accept XQuery, such as XMLXQueryExploder, JSONTemplater, etc. The exceptions are the XMLTemplater and XMLUpdater transformers.

We will repeat the same exercises as part 1 of this article, but use a JSON-formatted version of the source descriptive metadata dataset instead.

If you would like to create your own workspace following the instructions from the tutorial, download the source JSON data, bedrock_geology_dcat.jsonld, from the Files section before starting. Otherwise, you can follow along with the completed workspace for part 2, xmlXQueryExtractor_Part2_Completed.fmwt, also available from the Files section.

1. Configure XMLQueryExtractor to Extract JSON Array to an Attribute

Start FME Workbench, and click on New to open a blank workspace. Add an AttributeFileReader transformer to the canvas.

General
- Source Filename: ./bedrock_geology_dcat.jsonld
  - Click on the ellipsis and browse to the downloaded dataset
- Source File Character Encoding: Unicode 8-bit (utf-8)

If you are using FME 2025.1 and older, an input feature is required to trigger AttributeFileReader transformers to run. Add a Creator to the workspace and connect the output port of the Creator to the input port of the AttributeFileReader.

For more information on option input ports, please see Transformers with an Optional Input Port.

Add an XMLXQueryExtractor transformer to the canvas and add a connection from the output port of the AttributeFileReader to the input port of the XMLXQueryExtractor.

We are interested in extracting information from the dcat:keywords elements. To do this, set the following parameters in the XMLXQuery parameter dialog:

XQuery Type
- XQuery Expression:
  let $jsondoc := fme:get-json-attribute("_file_contents")
  let $keywords := $jsondoc("@graph")(1)("dcat:keyword")
  return $keywords
XML Source
- XML Input: None (File is specified in query)

The input JSON document is read into the workspace and stored within an attribute using the AttributeFileReader transformer. The string representing the JSON data is fetched from the _file_contents attribute within the XQuery expression using the fme:get-json-attribute() FME XQuery function in the first line of the XQuery. This function retrieves attribute values containing valid JSON. FME XQuery functions are documented here.

The JSON data fetched from the attribute using the FME XQuery function is bound to a $jsondoc variable using the let clause and the assignment operation ( := ). Same as XQuery, variable names in the JSONiq extension must start with a dollar sign ($).

Since the JSONiq extension to XQuery is built on XQuery, many XQuery structures remain the same, including FLWOR, variable naming syntax, and operators. However, there are differences as well.

The second line of the expression accesses object content from the JSON document bound to the $jsondoc variable. Object lookup for JSONiq is not done with XPath, since the data is not XML. Object lookup in JSONiq extension to XQuery is done with parentheses notation, with the object key as a string or an integer for an array lookup inside the parentheses.

The keywords we want to access are stored under the "dcat:keyword" key, which is part of the first element of the "@graph" object at the root level of the JSON document. This results in the following object lookup: $jsondoc("@graph")(1)("dcat:keyword"). Like XQuery, the JSONiq extension to XQuery uses one-based indexing, so the first array value is accessed with 1 and not 0.

This differs from the full JSONiq specification, where 0-based indexing is utilized, and the dot notation is supported for object lookup (e.g., $jsondoc.@graph[0].dcat:keyword). But since FME uses the JSONiq extension to XQuery and not the full JSONiq specification, only the parentheses notation is supported in FME transformers, and the first element is at index 1.

Finally, the third and last line returns results from the second line of the expression to the default attribute name specified in the Result Attribute parameter.

Add an Inspector transformer and connect the QueryResults output port to the input port of the Inspector transformer.

Workflow where AttributeFileReader's output port is connected to XMLXQueryExtractor input port. The QueryResults output port of the XMLXQueryExtractor is connected to an Inspector
Run the workspace and inspect the output. Inspect the results of the query via the _result attribute on the output feature in the FME Data Inspector or Data Preview window of FME Workbench.

Results of JSONiq extension to XQuery expression in step 1 where the dcat:Keyword array is extracted

2. Refine JSONiq Expression to Extract Content of JSON Array

The JSONiq extension to the XQuery expression from step 1 extracts the entire JSON array from the JSON document, including the square brackets for arrays, the curly braces for key-value pairs, and the key names. These extra texts and symbols are not necessary and make the extracted output difficult to read.

We want to extract only the value from key-value pairs in the array, so we need to use the JSONiq function, jn:members(), in the second line of our JSONiq expression. This function can be used to unpack arrays, returning all members of one or more JSON arrays while preserving order.

JSONiq includes a number of built-in functions, such as jn:members(), that can enhance your query. You can view a number of available functions at
https://www.jsoniq.org/docs/JSONiqExtensionToXQuery/html/section-builtin-functions.html

After unpacking the array, object lookup can be used to fetch the value of the keyword stored in each $keywords member in the "@value" object.

Change the XMLXQueryExtractor to:

XQuery Type
- XQuery Expression:
  let $jsondoc := fme:get-json-attribute("_file_contents")
  let $keywords := jn:members($jsondoc("@graph")(1)("dcat:keyword"))
  return $keywords("@value")

Rerun the workspace and inspect the output. You can see the query results in the content that contain only the 14 dataset keywords with a comma separating each, as set by the default values in the parameters of the Results parameter section.

Results of JSONiq extension to XQuery expression in step 1 where $value objects is extracted from dcat:Keyword

3. Use Predicates and Extract Content to List Attribute

Similar to XQuery, it is possible to filter results based on criteria using predicates in the JSONiq extension to XQuery expressions, keeping only items that pass the criteria. Predicates are enclosed in square brackets and should immediately follow the item they apply to.

The following table contains example syntax for various predicates in the JSONiq extension to XQuery.

Predicate Type	Example
Value	`…"dcat:keyword"))[.("@language")="en"])`
Existence	`…"dcat:keyword"))[.("@language")]`
Position	`…"dcat:keyword"))[position() > 9]`
Multiple	`…"dcat:keyword"))[.("@language")="en"][position() > 1]`

Examples of various types of JSONiq extension to XQuery predicates

Predicates depend on the current item, also known as the context item, which is the specific item currently being processed in the sequence. When filtering, the current item represents each item in the sequence being checked against the specified condition. In the JSONiq extension to XQuery, the current item is represented by the standard XQuery dot (.) rather than the pure JSONiq syntax of a double dollar sign ($$).

The JSONiq expression from step 2 extracts the value of each member of the "dcat:keyword" array. In each member, there is a key-value pair representing the language associated with the "dcat:keyword" member. Like the XML dataset, values for the "@language" key are “en”, representing keywords in English, and “fr-t-en”, representing keywords in French.

We want to apply a criterion to the query and return only the French language keywords as list attribute results. To do this, open the XMLXQueryExtractor and change the following parameters:

XQuery Type
- XQuery Expression:
  let $jsondoc := fme:get-json-attribute("_file_contents")
  let $keywords := jn:members($jsondoc("@graph")(1)("dcat:keyword"))[.("@language")="fr-t-en"]
  return keywords("@value")
Return Value: List attribute

The first line of the expression remains the same. The second has a predicate added to filter for any "dcat:keywords" array members whose "@language" key has a value of “fr-t-en”. The dot prefixing the object lookup within the predicate means the query will compare the value of the "@language" key within the current item to the test value.

In addition, instead of adding all returned keywords to a single attribute, we want to store them as a list attribute. We can do this by changing the value set in the Return Value parameter.

Rerun the workspace and view the output. Since the keywords are stored in the _results list attribute, you will have to view the results by opening FME Data Inspector, clicking the output record, and navigating to the Exposed Attributes section of the Record Information pane. Right-click on _results{} and select "Expand All Under This Node".

4. Enhance JSONiq Expressions with FME XQuery Functions

FME provides a variety of XQuery functions, including ones that allow access to and manipulation of record attribute values and geometry data. Many XQuery functions also work with the JSONiq extension. The list of XQuery functions in FME are documented here. These FME XQuery functions allow you to enhance query results returned, especially when used in combination with FLWOR expressions and built-in functions of XQuery.

The XQuery expression from step 3 extracts the query results and assigns them to a list attribute. However, list attributes require more steps to view, and not all formats support list attributes. By using the FME XQuery function fme:set-attribute() to set a record attribute, you can return query results with customized attribute names and make it easier to manipulate the results in downstream transformers.

Change the following XMLXQueryExtractor parameters:

XQuery Type
- XQuery Expression:
  let $jsondoc := fme:get-json-attribute("_file_contents")
  let $keywords := jn:members($jsondoc("@graph")(1)("dcat:keyword"))[.("@language")="fr-t-en"]
  for $keyword at $n in $keywords
  return {fme:set-attribute("keyword"||xs:string($n),$keyword("@value"))}
Expose Attributes
- Attributes to Expose:
  
  Output Attribute Type
  
  keyword1 buffer
  
  keyword2 buffer
  
  keyword3 buffer
  
  keyword4 buffer
  
  keyword5 buffer
  
  keyword6 buffer
  
  keyword7 buffer

The first and second lines remain the same, while the third and fourth lines are new or modified. We’ll go through the changes line by line.

The fourth line iterates over each $keyword in the $keywords variable using the for clause, while keeping track of the sequence index position with the $n variable.

The double pipe (||) in the JSONiq extension to XQuery is used for string concatenation, as in XQuery. The attribute name is dynamically created by concatenating the text “keyword” with the position number of the sequence cast to a string using a double pipe operator, as the pipe operator only works with strings. The attribute value is set with the "$value" object that is present on the current $keyword XQuery variable.

Since the return clause is evaluated for each member of the sequence, each keyword will be stored in a different attribute. Because attribute names are dynamically created, the attributes must be exposed before they appear in the Data Inspector table view or the Data Preview window of FME Workbench. The XMLXQueryExtractor allows you to expose attributes within the transformer itself in the Expose Attributes section, or you can use an AttributeExposer transformer.

Rerun the workspace and view the output. Now the French keywords are stored in separate attributes named “keyword1”, “keyword2”, etc. Remember, XQuery is a one-based indexing language, where the first member of a sequence is at index 1, not 0, and the same applies to the JSONiq extension to XQuery.

Output of step 4's JSONiq extension to XQuery expression where each dataset keyword is stored in a uniquely named attribute

Now you have successfully used the XMLXQueryExtractor to retrieve information from a JSON document using the JSONiq extension to XQuery and set the extracted information in various forms of attribute information on a record.

Additional Resources

XPath tutorial: https://www.w3schools.com/xml/xpath_intro.asp

XQuery tutorial: https://www.w3schools.com/xml/xquery_intro.asp

FME XQuery functions: https://docs.safe.com/fme/html/FME-Form-Documentation/FME-Transformers/XQuery/XQuery_functions.htm

Introduction to JSONiq: https://www.jsoniq.org/docs/Introduction_to_JSONiq/html/

Built-in functions for JSONiq extension to XQuery: https://www.jsoniq.org/docs/JSONiqExtensionToXQuery/html/section-builtin-functions.html

Data Attribution

The data used here originates from open data made available by the Government of Canada. It contains information licensed under the Open Government License - Canada.

Output Attribute	Type
keyword1	buffer
keyword2	buffer
keyword3	buffer
keyword4	buffer
keyword5	buffer
keyword6	buffer
keyword7	buffer

Search