Extracting Text and Tabular data from PDF

Liz Sanderson
Liz Sanderson
  • Updated

FME Version

  • FME 2023.0

Introduction

Transportation agencies often receive Traffic Speed Reports at multiple locations across their municipalities from consultants. Many important types of information can be derived from these reports. The most common one is Average Daily Traffic (ADT) data, which supports transportation planning, roadway design and construction, and the operation of a city’s road network. Traffic Speed Reports usually have a consistent PDF format containing a mixture of texts and tables, all of which can be easily inspected and extracted in FME Form. 

This article demonstrates how the Adobe Geospatial PDF Reader can read in these fairly complex PDFs and extract important information including Report Location and Collection Date, Travel Direction, and ADT counts. Please note that while we are using PDFs for this demo, the workflow can also be accomplished with any other format supported by FME like Word, Excel, CSV, JSON, etc. (500+ formats). 


image19.png

(Viewing the PDF document in Adobe Acrobat Reader, This sample Traffic Speed report was recreated from an actual one from San Jose DOT)
 

Scenario

You are a GIS Specialist at a local Department of Transportation. Your DOT has its consultants collecting traffic counts at different locations across the municipality. You receive dozens of new reports every week and have to manually extract the traffic counts and then populate it to a file geodatabase, before publishing to the City’s open data portal. This process is not only time-consuming but also prone to errors from manual entry. Thus, you are creating an FME Workspace to automatically read the PDF reports and extract any information of interest. 
 

Step-by-Step Instructions

Part 1: Read the PDF and extract important attributes

1. Add a Adobe Geospatial PDF Reader 

Add a new Adobe Geospatial PDF Reader to FME Workbench. Set the Source Adobe Geospatial PDF to one of the Traffic Speed Reports. (ADTintersection.pdf)

Next, click on the Parameters button to open up the Format Parameters dialog. Now, we need to inspect the format of our input PDF before populating all the parameters. Overall, this Traffic Speed Report has a consistent format containing several lines of text describing the collection site, and two tables showing hourly traffic counts for multiple speed ranges, and their respective collection dates. Since the PDF does not have any Spatial Data (maps), make sure that the Spatial box is unchecked and the Non-Spatial box is checked. Under the Non-Spatial drop-down, change Read Non-Spatial Text to Yes.

All other parameters can be left as default. Click OK twice to add the AdobeGeospatialPDFReader to the canvas. 

image8.png

After the Reader Feature Type has been added to the canvas, click the cogwheel on the right of the transformer then switch to the Format Attributes tab and type "pdf_page" in the filter bar to look for the following attributes: pdf_page_number and pdf_page_text. Make sure they are selected by checking the blue boxes under the Exposed column before clicking OK. 

image56.png

Next, run the workspace and confirm that all 3 pages of the PDF are read in to the workspace. To inspect, click on the Table icon in the Visual Preview pane and make sure that the table is showing two attributes we selected previously and there should be 3 rows of data representing each page of the PDF. You can also check the content being read from each page by clicking the ellipsis button at the end of each row. At this point, every page will be represented as an individual feature, with all texts stored in the pdf_page_text attribute. Contents read from the first page will look like the screenshot below. 

2. Parse the PDF and Explode Content to Individual Features
Next, we will add two transformers to the canvas. First, an AttributeSplitter to split the text attribute line by line, followed by a ListExploder to explode these lines to individual features. 

Connect AdobeGeospatialPDFReader with the AttributeSplitter then click on the cogwheel to open up its parameters dialog. 

For Attribute to Split, select pdf_page_text. For the second parameter, Delimiter or Format String, click on the arrow at the end and choose Open Text Editor. In the Text Editor window, key in enter/return to + the list delimiter and double check that the text cursor is flashing on the second row, then click OK. For the third parameter, choose Both to trim any whitespace before and after every line. List Name and Drop Empty Parts can be left as default. See the screenshot below to double-check all the parameters. 

image54.png


*Tip: On the Text Editor Window, You can also click Options next to the Help button and check Show Spaces/Tab to expose these special characters. As a result, the Enter/Return character here is shown as LF.

image25.png


Next, connect the ListExploder transformer to the output port of the AttributeSplitter. Enter the ListExploder Parameters as follows: 

  • List Attribute: _list{}
  • Accumulation Mode: Merge List Attributes
  • Conflict Resolution: Use List Attribute Values
  • Element Index: (Delete the default text and leave it blank)
image2.png
 

Save the workspace and run it. Now look at the attribute table in Visual Preview, you can see that every line of the page represents a single feature now. In coding language, this means our lists (pages) have been parsed into individual strings (lines). Confirm that the Elements port of the ListExploder cached 160 features. 

image33.png
 
3. Understand the Attributes to be Extracted
One best practice when working with any kind of data is to understand your data thoroughly before performing any analysis. Let’s get familiarized with all the important information from the PDF before finding ways to extract them. For data collection and mapping purposes, pay attention to the following information: 

  • Site: [14266NN] 1ST ST N OF MAIN ST. Let’s pay attention to the string inside the brackets, 14266NN. The first part of this string is a number representing the Intersection ID, which will be used for a table join later on. The second part indicates the relative location of this collection point. NN actually means North of the intersection, vehicles traveling Northbound. These acronyms differ across different speed reports depending on the collection point and travel direction. However, we will only need to understand the very first letter of the entire string, which indicates the relative location of our collection point (the letter N in this case means N of the Intersection). 
  • Direction: North (bound). This indicates the vehicles’ travel direction being collected. There is only one travel direction for each report. 
  • * Wednesday, May 24, 2023. This is the traffic counts’ collection date. There are at least 2 continuous collection dates in every report. 
  • Traffic counts - all the attributes within the traffic count table, right below each collection date. Each row of the table contains travel counts recorded every 30 minutes for different speed ranges (i.e. column name Vbin 35-40 means speed range is 35-40 mph) throughout the entire collection day. The last column is the total number for each half-hour period. Pay attention to the total number of the columns and their order, we will need this information at a later step. There are at least two traffic count tables in every report, corresponding to the collection dates.  
image38.png
(Sample Traffic Speed report with all important information highlighted) 


4. Add a Testfilter for the Four Important Attributes
Now that we are quite familiar with all these important attributes and the PDF has already been parsed into individual features, let’s have each attribute filtered to a separate stream. To start, add a TestFilter to the canvas, and connect it to the Elements port of the ListExploder. Then, open up its parameters and format the port definitions to filter the four attributes we need.

First, for Traffic Counts, we need to filter all the lines that start with 4 digits. If you have yet to realize, the 4 digits at the beginning of each row indicate the collection time (i.e. 0000 is 00:00 am). Therefore, configuring the filter statement to match only features starting with 4 digits will leave us with the entire traffic count table itself. From the TestFilter Parameters window, double-click on the first If statement to open up its Test Conditions dialog box and fill in the following:  

image15.png

  • Left Value: click on the arrow at the right end, point to Attribute Value then select the _list attribute. 
  • Operator: from the dropdown list, select Contains Regex.
  • Right Value: click on the arrow and select Open Regex Editor then paste the following expression to the Regular Expression box
^\d{4}
  • Output Port: change to TrafficCounts

image42.png
The regular expression might look intimidating at first, but the Quick Reference can help you break down what each part of the REGEX means. 

^ Start of line
\d Any digit
{4} Exactly 4 characters 

Putting it all together, this expression simply means “Match any string (line) that starts with 4 digits”. You can confirm that our regular expression is correct by typing any 4 digits into the Test String box and see if all digits are highlighted in yellow. 

Click OK to close the first Regular Expression Editor And OK again to exit the test conditions. Next, we will create another If statement to filter the second information, Collection Dates. Click on the Else If statement on the second row to open its Test Conditions dialog box then format the Test Clauses as follows: 

  • _list Begins With * 
  • AND _list Contains Regex \d{4}
  • Output Port: CollectionDate

image39.png
This entire Test Clause simply means “Match any string that begins with an asterisk (*) AND contains 4 digits”. If you can recall the collection date from our input PDF (* Wednesday, May 24, 2023), that string starts with an asterisk and contains four digits corresponding to the collection year. If configured correctly, this test clause will output at least two collection dates from each report. 

Now move on to the third attribute, Collection Point. Taking another look at the sample PDF, we can see that the line indicating our Collection Point starts with the word “Site” followed by a colon (Site:). Thus, this Test clause will be much more straightforward. Click on the next Else If statement (third row) and format its Test clause as follow: 

  • _list Begins with Site:
  • Output Port: CollectionPoint

image32.png
For the final attribute, Travel Direction, its Test clause will be very similar to the previous one, as the line containing this information starts with the word “Direction” followed by a colon (Direction:). 

  • _list Begins with Direction:
  • Output Port: TravelDirection

image27.png

Click OK twice to close the Test Conditions and the TestFilterParameters dialog boxes. Save the workspace then click Run.

Double-check that the numbers of features at each output port are similar to the screenshot below and make sure you thoroughly understand these numbers by inspecting all the cached features from each port.

Feel free to organize your transformers then add some bookmarks and annotations to document for your workspace. It helps remind you what we did for each step and keeps everything organized, as our workspace is getting more and more complex.

image7.png
   
5. Extract Traffic Counts
Now that we have all the important attributes filtered to their individual output port, let’s extract each of them, starting with the Traffic Counts. To do so, we will extract individual count values inside each line feature using the AttributeSplitter again. Add another AttributeSplitter to the canvas and connect it to the TrafficCounts port from the TestFilter. The parameters will be populated as follows: 

  • Attribute to Split: _list
  • Delimiter or Format String: space character (highlighted in blue in the screenshot)
  • Trim Whitespace: Both
  • List Name: change to a more intuitive name (i.e. _countvalue)
image4.png


Click Run To This.

If you check the AttributeSplitter_2’s output port, there are still 96 features just like from the Tester; however, each of these features contains a nested list of attributes representing data from each column in the origin table. We can confirm that our AttributeSplitter was configured correctly by reviewing the Feature Information pane on the right of the canvas. Under the Attributes section, there are 18 different list attributes called _countvalue() nested in each feature. Can you guess which _countvalue attribute corresponds to the collection time? We will need to revisit these list attributes at a later step. 

(Hint: _countvalue index starts from 0 and this value corresponds to the first column from our traffic count table)

image29.png
To access more training on list attributes handling, please refer to the following module Work With Multiple Data Models Using Lists. It is important to familiarize yourself with this concept as we will revisit it a few more times throughout this demo series. 

Our next task is to create groups of 48 half-hour increments for each collection day, using a custom transformer called Grouper. In this case, we will have two groups created from 96 features. Add the Grouper transformer, connect it to the AttributeSplitter_2’s output port, and set the Sampling Rate to 48 before clicking OK. After running the workspace, take a look at the Visual Preview table and pay attention to a new field called _group_index, you can see the first 48 features have an index of 0 and the rest have an index of 1. Keep these numbers in mind, we will need them as key-value pairs to join all Traffic Count features with their respective Collection Dates being extracted. Save and document the workspace before moving on to the next step, Extract Collection Dates.   

image36.png

6. Extract Collection Dates
To extract Collection Dates, we will use a StringReplacer and remove all characters before the collection month including the asterisk, the weekday, a comma and a space after it ( *Wednesday, ). Add a StringReplacer transformer to the canvas and connect it to the CollectionDate port from the TestFilter. Fill in the following parameters: 

  • Attributes: _list
  • Mode: Replace Regular Expression
  • Case Sensitive: No
  • Text To Replace: Open the Regular Expression Editor and paste in the following expression
.*y,\s
  • Text To Replace: (leave blank)
  • Set Attribute Value To: <No Action>
image13.png

Here is a breakdown of the regular expression provided, in case you are not comfortable with Regex yet: 
. Any single character
*y, Any character before the letter y, followed by a comma (i.e. Wednesday,)
\s Any whitespace character

image40.png
This entire expression simply means “Match all characters up until the letter y, followed by a comma and a white space”.  

After running the workspace, confirm that the StringReplacer only outputs two date features with the following format: Month DD, YYYY.
Now, we will convert our date features into a numeric format (FME format) using a transformer called DateTimeConverter. Add the DateTimeConverter to the canvas and connect it to the StringReplacer. Use the following information to fill in the parameters: 

  • DateTime Attributes: _list
  • Input Format: %B %d, %Y
  • Output Format: FME
  • Repair Overflow: Yes
  • Passthrough Nulls, Empties or Missing: No
image11.png


Preview the output in the Preview Data section right under General parameters.

Click OK then Run To This, now we have our date time string in a numeric format (It can be any other date time format or even custom formats if needed, we just use the FME format for this demo). 

However, our field name needs to be a bit more intuitive rather than just “_list”. We also want to remove the other two fields (pdf_page_number and pdf_page_text) at this point. So let’s add an AttributeManager and connect it to the DatetimeConverter. Refer to the following screenshot to configure its parameters.

image31.png

Later on, in order to join our Collection Dates with the Traffic Counts extracted from the previous part, we would need a key-value pair. Remember the _group_index attribute from the Grouper earlier? We will replicate these group indexes for each date extracted so that the first Collection Date has an index of 0 and the second collection date has an index of 1. These correspond to the indexes of the two groups of 48 traffic count features we extracted previously. Add the Counter transformer to the canvas and connect it with the AttributeManger. Make sure The Counter parameters are set as follows: 

  • Count start:
  • Output Attribute Names
    • Count: _groupindex
    • Group ID: (leave blank)

image9.png

After Running the workspace, double-check the Counter’s cached features and confirm that each collection date has a group index. We will perform a table join at the final step, when we are done extracting all the important attributes. Now, save and document the workspace before extracting Collection Point. 
 
image47.png
7. Extract Collection Point and Intersection ID
For the third important attribute, Collection Point, we will also extract IntersectionID from the same string feature. This IntersectionID will serve as a key-value pair in the second demo of this series, when we intend to join all the extracted attributes to a point geometry, Intersection point. 

image34.png
To start, we will first extract the entire string between the square brackets using a StringSearcher, paired with another advanced Regular Expression. Add the StringSearcher to the canvas and connect it with the CollectionPoint output port from the TestFilter. Fill in the parameters as follows:

  •  Search In: _list
  • Contains Regular Expression: open the Regex Editor and paste in the following expression 
(?<=\[).*?(?=\])
  •  Matched Result: _first_match

image46.png
Don’t be intimidated by this complex-looking expression, here is a breakdown of it: 
(?<=\[) A positive lookbehind to find any text string preceded by an opening square bracket [.
.*? Any character or sequence of characters, as few as possible. The question mark ? makes it non-greedy, meaning it will match as few characters as possible. This ensures that the match does not extend beyond the closing square bracket. 
(?=\]) A positive lookahead to check if the text string is followed by a closing square bracket ].
Putting it all together, this expression simply means “Match any string that is enclosed within the square brackets”. It will capture the content inside the brackets, but not the brackets themselves. 

After Running the workspace, double-check our output feature to see if the StringSearcher correctly matched the text string we wanted (14266NN). Next, we will add two more StringSearchers to split the IntersectionID and Collection Point (first letter after IntersectionID) from this string. Connect the second StringSearcher to the first one. The parameters for the second StringSearcher are as follows:

  • Search In: _list
  • Contains Regular Expression: .\d+
  • Matched Result: IntersectionID
image35.png


The regular expression .\d+ is a pattern used to match a character followed by one or more digits. Breaking it down:
. Any single character.
\d+ One or more digits. (The \d represents any digit from 0 to 9, and the + quantifier indicates that there must be at least one digit and allows for additional digits).
Putting it together, the regular expression means “Match any character followed by one or more digits”. This will extract all the digits within our string. 

Make sure the Matched Result parameter is changed to IntersectionID before clicking OK. Next, connect the second StringSearcher to the third one. The third StringSearcher should be configured as follows : 

  • Search In: _first_match
  • Contains Regular Expressions: [a-z,A-Z]*$
  • Case Sensitive: Yes
  • Matched Result: CollectionPoint 
image43.png


As you are a bit more comfortable with Regex by now, you might figure out that the regular expression [a-z,A-Z]*$ simply means “Match any string that contains letter(s) and ends with a letter only.” This is to extract the last two characters (NN) indicating the Collection Point. 

Click OK then run the workspace and confirm that it only extracts letters out of the given string and your output feature now has two attributes: IntersectionID and CollectionPoint. 

image14.png
Ensure that the workspace is Saved and properly documented like the screenshot below before moving on to the next step.    

image5.png
 
8. Extract Travel Direction 
Now we will extract the last attribute needed from the PDF, Travel Direction. Add another StringSearcher to the canvas and connect it with the TravelDirection port from the TestFilter. Trust me, this will be the last StringSearcher we need for this workspace! Open up its parameters and specify in the following values:

  •  Search In: _list
  • Contains Regular Expressions: (?<=\:).*?(?=\()  

It is just a slightly different version of the Regex we used to extract the string between square brackets in a previous step. Instead of the opening bracket, it is replaced with a colon : and the closing bracket is replaced with an opening parenthesis. ) Our goal is to extract the word “North” that is located after "Direction:" and before "(bound)". This expression will also work for any other directions found in different reports.

  • Matched Result: TravelDirection

Click OK, then Run the workspace. If you take a closer look at the output feature, you could tell that the expression did not do a perfect job as it also includes two space characters before and after the word North. 

image41.png

Therefore, we would need to add an AttributeTrimmer and connect it with the Stringsearcher_4 to get rid of those space characters. Configure the parameters as follows then click Run:

  • Attribute to Trim: TravelDirection
  • Trim Type: Both
  • Trim Characters: (space) 

image49.png

Next, let’s use a FeatureMerger to merge this TravelDirection feature with the one from our previous step that contains both Intersection ID and Collection Point. Connect the Merged port from FeatureMerger to its Requestor Port, and the Output port from Attribute Trimmer to the Supplier Port. Set the same Join On value for the Requestor and Supplier (i.e. 1). Leave other parameters as default, click OK then Run the workspace. 

image1.png
 
To elaborate, this specific configuration enables data joining even when there is no common key attribute. By setting the Join On parameters this way, we can join a single-feature Supplier to each feature in the Requestor. By doing so, incoming feature(s) from the Requestor will pick up the attributes and corresponding values from the Supplier’s single feature. This should only be with a Supplier that contains only one single feature. If not, you will always need a key attribute to perform any form of table join.

After extracting all the important attributes from the PDF, your workspace should look similar to the screenshot below. The last section of this demo will focus on merging all the extracted attributes into the same feature and summarizing hourly counts to Average Daily Traffic counts.   

image55.png

Part 2: Merge attributes and Calculate Average Daily Traffic 

1. Merge Attributes
Remember what we did at the end of the Extract Collection Date step? We used a Counter to replicate the group index to serve as a key-value pair for a table join. Now, we will join all the Traffic Counts with their respective Collection Date. 

Let’s add a FeatureMerger to the canvas and connect the Grouper’s Output port to the FeatureMerger’s Requestor port, and the Counter to the Supplier port. 
image20.png

Fill in the FeatureMerger_2 parameters as follows: 

  • Supplier First: No
  • Requestor: _group_index (from Traffic Counts)
  • Supplier: _groupindex (from Collection Dates) 
  • Comparison Mode: Automatic

image30.png

Click OK then Run To This and confirm that 96 features came out of the Merged port. 

Next, we will output all the traffic counts to a tabular format that can be easily read and used by other teams or departments if needed. To do so, we need to map each of the list attributes to the corresponding column from the Traffic Count table schema. Add another AttributeManager to the canvas, connect it to the Merged port of the FeatureMerger_2 and start populating its parameters as follows: 

image45.png

To explain, we are creating 18 new fields in our feature and populating them with the data extracted from our Traffic Count table. The value of each field corresponds to the list index position from each of the strings being parsed at the beginning of the demo. For example, the Time value for the first feature corresponds to its first index value (_countvalue{0}), indicating that the collection time was 00:00 am. Similarly, the Total count value for the first feature corresponds to the 18th index value (_countvalue{17}), indicating that the total traffic count at 00:00 was 16. This same pattern is repeated until the very last feature we extracted from the Traffic Count table (96 features). 

Run the workspace and check out the Table in Visual Preview, by now we should have all the traffic counts organized by collection date, and collection time across different speed ranges. 

Congratulations! You successfully extracted the traffic count table from the PDF and output it to a new feature type in FME. 

image52.png

You might want to export the table and share it with other teams or departments, using a CSV or MS Excel Writer at this point.   

2. Calculate Average Daily Traffic
However, since we want the Daily Traffic counts, we would need another transformer to summarize all the half-hour counts and group them together by each collection date. Let’s Add a StatisticCalculator to the canvas and connect it with the AttributeManager_2. For its parameters, enter the following:

  • Group Processing box: Checked
  • Group By: Date. 
  • Calculation Method: Numeric
  • Statistics to Calculate: 
    • TotalCount: Sum

image6.png

Run the workspace and check the output table to see if there are summed total counts for each of the collection dates. These are Daily Traffic counts. 

image53.png

Now for the Average Daily Traffic counts, we want to calculate the mean counts of both dates and keep only the first Collection Date. This can be done using another StatisticsCalculator. Add another StatisticsCalculator and connect it to the Summary Port from the first one. 
Set the parameters as follows:

  • Group Processing: Unchecked
  • Statistics to Calculate:
    • Date: Min 
    • TotalCount.sum: Mean 
image24.png


Save the workspace then click Run and ensure the visual preview table looks similar to the following screenshot, with the collection date showing 20191113 and ADT counts showing 2229. Don’t worry about the attribute names for now, we will change to more intuitive names after merging them with the Collection Point and Travel Direction attributes. 


image28.png

If you recall from the previous step, we already merged the data from IntersectionID, CollectionPoint and TravelDirection into one single feature. Now we will use another FeatureMerger to merge that feature with the one we just created, putting all the important attributes together into one feature. Add another FeatureMerger to the canvas, connect the Summary Port from the StatisticCalculator_2 to its Requestor port and the Merged port from FeatureMerger to its Supplier Port. Set the same Join On value for the requestor and supplier parameters (i.e. 1), click OK then Run.  

image16.png
Check the output port from FeatureMerger_3 and make sure the feature contains all four important attributes we needed. 

image10.png

Finally, to polish the attribute names and remove unnecessary attributes, add another AttributeManager and configure its Parameters as follows: 

image51.png

Let’s break down all these parameters for your reference:

  • Change Date.min to Date.
  • Remove unnecessary attributes including pdf_page_number, pdf_page_text, _list, _first_match, and Total.Count.sum.mean 
  • CollectionPoint values are formatted to more intuitive descriptions as we only need the very first letter to indicate its relative location from an intersection. (For example, either NN or NS will be set to N of). To do this, click on the ellipsis button in the column value to open up the Conditional Value Definition window. 

image22.png

 
Next, click on the first If Statement and populate the following test clause: 
  • Left Value: Collection Point
  • Operators: Begins With
  • Right Value: Click on the Ellipsis button to open up the Text Editor window and key in “N”. 
  • Value: N of
 
image3.png
 
The subsequent Else If statements for other directions are set up in a similar manner, with only minor changes in the Right Value (S, E, W), and Value (S of, E of, W of). Confirm that the Conditional Value Definition window looks like the screenshot above before moving on to the next parameter. 
  • Add the word “bound” after the TravelDirection value (i.e. North -> Northbound). To do this, click on the ellipse button under the Value column to open up the Text Editor window. Under the FME Feature Attributes dropdown, select TravelDirection. The @Value(TravelDirection) will be automatically added to the Editor whiteboard, you only need to key in the word “bound” right after it. Make sure there is no space in between. 

image26.png
 

  • Ensure that the Total.Coun.sum.mean attribute is moved below ADTOne and ADTTwo, we can only remove it after we are done using it to condition ADTOne and ADTTwo.
  • Add two new attributes named ADTOne and ADTTwo, their conditional values are explained below.
Regarding the ADTOne and ADTTwo attributes, each of them represents the ADT counts from a specific Travel Direction extracted from the report. At a given collection point, there are typically two Traffic Speed Reports that gather data from opposite travel directions. Therefore, it is crucial for our ADT data to indicate its corresponding travel direction. ADTOne indicates the count for Northbound or Eastbound travel, while ADTTwo indicates the count for Southbound or Westbound travel. Set up the Conditional Value for ADTOne as follows:  
  • Travel Direction Contains North
  • OR TravelDirection Contains East
  • Value: TotalCount.sum.mean 
image37.png
 
Similarly, the Conditional Value for ADTTwo is as follows:
  • Travel Direction Contains South
  • OR TravelDirection Contains West
  • Value: TotalCount.sum.mean 
 
image12.png


Save the workspace then click Run Entire Workspace. Double-check that your output table looks like the screenshot below.   

image17.png

As you can observe, this report pertains to the Northbound travel direction, and therefore the ADT data will be recorded in the ADTOne field. There is no need to be concerned about ADTTwo at this moment as it will be filled with data from the opposite direction, which will be extracted from another Traffic Speed Report. In an upcoming article of this series, we will demonstrate how to aggregate data from multiple reports and explore methods to automate the entire process using FME Flow.

 

Part 3: Join Extracted Data to an Intersection Point 

1. Add a Shapefile Reader 
 Our last step is to join all the extracted data to an intersection point, using the IntersectionID as the key attribute. To start, let’s add the Street Intersections shapefile using the Esri Shapefile Reader. Click Add Reader and make sure the Format is set to Esri Shapefile. For Dataset, browse to the demo folder and select the Street_Intersections.shp file.  

image21.png

2. Add an AttributeKeeper
Click OK then Run the workspace and double-check if the 22,660 intersections were read in. As you may notice, this feature type contains 36 attributes/fields in total. We only need a few of them so let’s add an AttributeKeeper to remove the irrelevant attributes. Connect the AttributeKeeper to the Shapefile Reader, for the Attributes to Keep parameter, click on the ellipse button near the end and check the following attributes:

  • ASTREETDIR (Street A Direction)
  • ASTREETNAM (Street A Name)
  • BSTREETDIR (Street A Direction)
  • BSTREETNAM (Street A Name)
  • FACILITYID (Facility ID)
  • INTNAME (Intersection Name)
  • INTNUM (Intersection Number which is IntersectionID)
  • LATITUDE 
  • LONGITUDE  

image57.png

3. Add a FeatureJoiner
Next, we will perform a table join between the output from the previous step with the Street Intersections using the FeatureJoiner transformer. Our key attribute will be IntersectionID, which is the same as the INTNUM attribute from the input Street_Intersections shapefile. Add a FeatureJoiner to the canvas, connect its Right port to the AttributeManager_3, and its Left port to the AttributeKeeper.  

image23.png

When prompted, set the Left Join On to INTNUM and the Right to IntersectionID. 

image48.png

Click OK then Run the workspace, ensure that only one feature is cached at the output port of the FeatureJoiner and that the Intersection Point is displayed in the Graphics window. 

image50.png

Lastly, save and properly document your workspace with annotations and bookmarks. You can use the following screenshot for reference. 

image44.png

Congratulations! You have successfully completed an FME workspace that is capable of reading a complex PDF file, extracting relevant data, and configuring attributes tailored to your project needs. By now, you should feel more confident working with list attributes and creating Regular Expressions for text extraction. Well done! 

 

Additional Resources

 

Data Attribution

The data used is made available by the City of San Jose’s Department of Transportation.

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.