Reading Complex XML or GML using the XMLFlattener

Files

XMLFlattener_Demo1.zip
- 30 KB
- Download

Introduction

Many users have problems reading complex XML or GML. In the past, you had to make xfMap or XQuery scripts. To address this, we recently added the concept of XML flattening to our XML and GML readers. However, sometimes you have XML elements or attributes that need to be processed within a workspace, such as the result from a URL request. For thi,s you can use the new XMLFlattener transformer.

With XMLFlattener, all you have to do is feed it XML on an attribute or from a file, specify the node you want to query, and XMLFlattener transformer will make an FME feature for each occurrence of that node in your XML and flatten all the elements nested within that node into simple FME attributes. This will allow you to read virtually any XML data and extract the information you want. Below, we describe how to use it, its limitations, and give you suggestions on how to work around some of them.

Note that the same approach can now be used within the standard FME XML reader by using the reader parameter configuration type = 'Feature Paths' and flatten options set to enable flattening. Also note that the XMLFlattener in the current version of FME only flattens XML attributes into FME attributes on the same feature. To explode XML into separate features based on multiple occurrences of an XML element, use the XMLFragmenter with flattening instead.

XMLFlattener

So, given the following input, xml_string =

<Feature> <attribute1>John</attribute1> <attribute2>Vancouver</attribute2> <activeDate> <from>11-22-99</from> <to>12-11-09</to> </activeDate> </Feature>

Setting the query node to 'Feature' and the XML input field to 'xml_string', and you will get:

to:

attribute1 = John
attribute2 = Vancouver
activeDate_from = 11-22-99
activeDate_to = 12-11-09

Examples

Example Source XML

The easiest way to understand how this works is to look at a complete example of source.xml and the output that is generated. Suppose we want to read the source XML below:

<?xml version="1.0" encoding="UTF-8"?>
<FeatureCollection>
  <Feature>
  <attribute1>John</attribute1>
  <attribute2>Vancouver</attribute2>
  <activeDate>
     <from>11-22-99</from>
     <to>12-11-09</to>
  </activeDate>
  <Coordinate_BOX id="101">
  <coords>-123.1,49.25 -122.9,49.15</coords>
  </Coordinate_BOX>
  </Feature>
  <Feature>
  <attribute1>June</attribute1>
  <attribute2>Surrey</attribute2>
  <activeDate>
     <from>02-25-05</from>
     <to>9-15-10</to>
  </activeDate>
  <Coordinate_BOX id="102">
  <coords>-122.8,49.12 -122.5,49.0</coords>
  </Coordinate_BOX>
  </Feature>
</FeatureCollection>

Example Output FME Features

The XMLFlattener transformer allows us to read all the attributes below the <Feature> tag, and generates the following 2 output records:

attribute1 = John
attribute2 = Vancouver
activeDate_from = 11-22-99
activeDate_to = 12-11-09
Coordinate_BOX_id = 101
Coordinate_BOX_coords = -123.1,49.25 -122.9,49.15
attribute1 = Jane
attribute2 = Surrey
activeDate_from = 02-25-05
activeDate_to = 9-15-10
Coordinate_BOX_id = 102
Coordinate_BOX_coords = -122.8,49.12 -122.5,49.0

Feature Types

Remember, you can always have as many queries as you want. Therefore, there is no need to capture all the information in one XMLFlattener query. Look for repeating structures. If you have more than one repeating structure, consider querying each separately and making a separate feature type for each. An example of this is provided in XMLFlattener_Demo1.fmw, attached below. This workspace has 2 XMLFLatteners. One queries at node = <Feature>, which gives you the features above. The second one queries at node <Coordinate_BOX>, which just gives you the records:

Coordinate_BOX_id = 101
Coordinate_BOX_coords = -123.1,49.25 -122.9,49.15
Coordinate_BOX_id = 102
Coordinate_BOX_coords = -122.8,49.12 -122.5,49.0

So you could use this approach to create different feature types based on the same source dataset. You can either have one XML reader with a Feature path that extracts from the root node, you can have multiple XML readers with different Path expressions on each, or you can just use AttributeFileReaders to read your XML onto a Creator feature and then process that with separate XMLFlatteners.

Interpreting Geometry from XMLFlattener's Results

So XMLFlattener works fine to extract XML into features and fields, but how do we generate the feature geometry? If all you had was a point stored as x and y values for each feature, then the easiest thing to do would be to just use a 2DPointAdder. Typically, with xfmaps, geometry rendering is done by employing special xfmap methods to match geometry features and interpret them as points, lines, arcs, polygons, etc. This is probably the most efficient approach if you are going to work with more complex geometry structures. But if you just want to read basic geometries, here is an example that shows how you can do this within Workbench for things more complex than just points.

Out of the box, we can see from the output above that the field created by XMLFlattener, which holds out geometry information, is:

Coordinate_BOX_coords = -123.1,49.25 -122.9,49.15

To build geometry from this we need to go through the following steps:

1. AttributeExposer to expose this geometry field: 'Coordinate_BOX_coords'
2. Now we need to get the coordinates into a form that can be read by
GeometryReplacer. One of the easiest methods to
use is GeoRSS which is just space delimited coordinates. So we use a
StringReplacer to replace all our commas with spaces.
3. Again, to comply with the GeoRSS requirements, we need to make our
coordinate list into a valid GeoRSS expression.
To do this we concatenate <georss:line xmlns:georss="<a href="http://www.georss.org/georss">http://www.georss.org/georss</a>">
+ Coordinate_BOX_coords +  </georss:line>
4. Now we just point the GeometryReplacer at our _concatenated output and
choose the GeoRSS method.
5. We need to do a little cleanup. First, our X and Y axis are in the wrong
order, so we just need to Affine with
0,1,0,1,0,0 which will swap axis on all our vertices.
5. Finally, we use a BoundingBoxReplacer to convert the lines into polygons
because the original intent of the data was to store boxes not lines.

Considerations

The underlying xfmap structure approach that is used builds attribute names of the form parent.childAttribute for everything below the query node. Also, where 'childAttribute' occurs multiple times, you get a list construct in FME childAttribute{0}, childAttribute{1}. You could then decide to match the tag at the childAttribute level rather than the parent level, or you could use a ListExploder within FME to create individual features for every list element. If you do this make sure you preserve the parent and grand parent IDs in the advanced settings.

This is a quick way to read XML data into FME, particularly if it is attribute-only. If there are geometry attributes, then you will need to build the geometry using FME's geometry builder transformers, such as 2DPointAdder, PointConnector, AreaBuilder, and GeometryReplacer. GeometryReplacer is particularly useful in that it allows you to build points lines or entire polygons directly from the kind of large point lists that you see when extracting xml and gml - see Building Geometry above.

As stated above, to actually make use of the fields created, if you use the XMLFlattener and not the XML reader, you have to either use AttributeExposer to list the fields created or you have to import the new flat schema onto your destination feature type using the Writer - Import Feature Types command with the XML Reader. Set up the XML reader with the feature path set to the XML node from which you want to flatten. Note that you can always use your Viewer to explore what fields you have available.

This process does not preserve the original XML structure. By its nature, 'flattening' means you are coercing the structure down to a flat relational or table-like structure. This is irreversible, as the original structure is not stored anywhere. If you want to read XML and just update a few fields, you may be better off using XMLUpdater or XQueryUpdater with XQuery commands. The alternative is to use xfmaps to get at the field values you want and the XMLTemplates to rebuild the structure, but this may be more work.

Field names can get pretty long, depending on which query node you choose. If you have a structure that looks like:

<a><c><d>
   <e>1</e>
   <f>a</f>
   <e>2</e>
   <f>b</f>
</d></c></a>

Flattening at node <a> will give you fields that look like:

a_b_c_d_e=1
a_b_c_d_f=a
a_b_c_d_e=2
a_b_c_d_f=b

whereas if you flatten at d you will get:

d_e=1
d_f=a
d_e=2
d_f=b

If you choose node = e you will only get:

e=1
e=2

Choose your query node high enough to retrieve the data you want, but deep enough to minimize your field lengths. Note, you can always remove long prefixes later with the AttributeExpressionRemover.