Extracting Text and Tabular Data from PDF

Files

SJDOT_ADT_Workspaces.zip
- 200 KB
- Download
SJDOT_ADT_Output.zip
- 60 KB
- Download
Street_Intersections.zip
- 2 MB
- Download
Streets.zip
- 8 MB
- Download
TrafficSpeedReports.zip
- 10 MB
- Download

Introduction

Transportation agencies often receive Traffic Speed Reports from consultants at multiple locations across their municipalities. These reports can provide many important types of information. The most common one is Average Daily Traffic (ADT) data, which supports transportation planning, roadway design and construction, and the operation of a city’s road network. Traffic Speed Reports usually have a consistent PDF format containing a mixture of texts and tables, all of which can be easily inspected and extracted in FME Form.

This article demonstrates how the Adobe Geospatial PDF Reader can read these fairly complex PDFs and extract important information, including Report Location and Collection Date, Travel Direction, and ADT counts. Please note that while we are using PDFs for this demo, the workflow can also be accomplished with any other format supported by FME, like Word, Excel, CSV, JSON, etc. (500+ formats).

Viewing the PDF document in Adobe Acrobat Reader. This sample Traffic Speed report was recreated from an actual one from the San Jose DOT

Scenario

You are a GIS Specialist at a local Department of Transportation. Your DOT has its consultants collecting traffic counts at different locations across the municipality. You receive dozens of new reports every week and have to manually extract the traffic counts and then populate them into a file geodatabase, before publishing to the City’s open data portal. This process is not only time-consuming but also prone to errors from manual entry. Thus, you are creating an FME Workspace to automatically read the PDF reports and extract any information of interest.

Step-by-Step Instructions

Part 1: Read the PDF and extract important attributes

1. Add an Adobe Geospatial PDF Reader
Add a new Adobe Geospatial PDF Reader to FME Workbench. Set the Source Adobe Geospatial PDF to one of the Traffic Speed Reports. (ADTintersection.pdf)

Next, click on the Parameters button to open up the Format Parameters dialog. Now, we need to inspect the format of our input PDF before populating all the parameters. Overall, this Traffic Speed Report has a consistent format containing several lines of text describing the collection site, and two tables showing hourly traffic counts for multiple speed ranges and their respective collection dates. Since the PDF does not have any Spatial Data (maps), make sure that the Spatial box is unchecked and the Non-Spatial box is checked. Under the Non-Spatial drop-down, change Read Non-Spatial Text to Yes.

All other parameters can be left as default. Click OK twice to add the AdobeGeospatialPDFReader to the canvas.

After the Reader Feature Type has been added to the canvas, click the cogwheel on the right of the transformer, then switch to the Format Attributes tab and type "pdf_page" in the filter bar to look for the following attributes:

pdf_page_number
pdf_page_text

Make sure they are selected by checking the blue boxes under the Exposed column before clicking OK.

Next, run the workspace and confirm that all 3 pages of the PDF are read into the workspace. To inspect, click on the Table icon in the Visual Preview pane and make sure that the table is showing two attributes we selected previously, and there should be 3 rows of data representing each page of the PDF. You can also check the content being read from each page by clicking the ellipsis button at the end of each row. At this point, every page will be represented as an individual feature, with all texts stored in the pdf_page_text attribute. Contents read from the first page will look like the screenshot below.

2. Parse the PDF and Explode Content to Individual Features
Next, we will add two transformers to the canvas. First, an AttributeSplitter to split the text attribute line by line, followed by a ListExploder to explode these lines into individual features.

Connect AdobeGeospatialPDFReader with the AttributeSplitter, then click on the cogwheel to open up its parameters dialog.

Attribute to Split: pdf_page_text
Delimiter or Format String: +
- click on the arrow at the end and choose Open Text Editor. In the Text Editor window, type enter/return to + the list delimiter and double-check that the text cursor is flashing on the second row, then click OK.
Trim Whitespace: Both

See the screenshot below to double-check all the parameters.

Tip: On the Text Editor Window, you can also click Options next to the Help button and check Show Spaces/Tab to expose these special characters. As a result, the Enter/Return character here is shown as LF.

Next, connect the ListExploder transformer to the output port of the AttributeSplitter. Enter the ListExploder Parameters as follows:

List Attribute: _list{}
Accumulation Mode: Merge List Attributes
Conflict Resolution: Use List Attribute Values
Element Index: Delete the default text and leave it blank

Save the workspace and run it. Now, look at the attribute table in Visual Preview. You can see that every line of the page represents a single feature. In coding language, this means our lists (pages) have been parsed into individual strings (lines). Confirm that the Elements port of the ListExploder cached 160 features.

3. Understand the Attributes to be Extracted
One of the best practices when working with any kind of data is to understand your data thoroughly before performing any analysis. Let’s get familiarized with all the important information from the PDF before finding ways to extract it. For data collection and mapping purposes, pay attention to the following information:

Site: [14266NN] 1ST ST N OF MAIN ST. Let’s pay attention to the string inside the brackets, 14266NN. The first part of this string is a number representing the Intersection ID, which will be used for a table join later on. The second part indicates the relative location of this collection point. NN actually means North of the intersection, vehicles traveling Northbound. These acronyms differ across different speed reports depending on the collection point and travel direction. However, we will only need to understand the very first letter of the entire string, which indicates the relative location of our collection point (the letter N in this case means N of the Intersection).
Direction: North (bound). This indicates the vehicles’ travel direction being collected. Each report has only one travel direction.
* Wednesday, May 24, 2023. This is the traffic count’s collection date. There are at least 2 continuous collection dates in every report.
Traffic counts - all the attributes within the traffic count table, right below each collection date. Each row of the table contains travel counts recorded every 30 minutes for different speed ranges (i.e., column name Vbin 35-40 means speed range is 35-40 mph) throughout the entire collection day. The last column is the total number for each half-hour period. Pay attention to the total number in the columns and their order; we will need this information at a later step. There are at least two traffic count tables in every report, corresponding to the collection dates.

Sample Traffic Speed report with all important information highlighted

4. Add a Testfilter for the Four Important Attributes
Now that we are quite familiar with all these important attributes and the PDF has already been parsed into individual features, let’s filter each attribute to a separate stream. To start, add a TestFilter to the canvas and connect it to the Elements port of the ListExploder. Then, open up its parameters and format the port definitions to filter the four attributes we need.

First, for Traffic Counts, we need to filter all the lines that start with 4 digits. If you have yet to realize, the 4 digits at the beginning of each row indicate the collection time (i.e., 0000 is 00:00 am). Therefore, configuring the filter statement to match only features starting with 4 digits will leave us with the entire traffic count table itself. From the TestFilter Parameters window, double-click on the first If statement to open up its Test Conditions dialog box and fill in the following:

Left Value: _list
Operator: Contains Regex
Right Value: ^\d{4}
Output Port: TrafficCounts

The regular expression might look intimidating at first, but the Quick Reference can help you break down what each part of the REGEX means.

^ Start of line
\d Any digit
{4} Exactly 4 characters

Putting it all together, this expression simply means “Match any string (line) that starts with 4 digits”. You can confirm that our regular expression is correct by typing any 4 digits into the Test String box and seeing if all digits are highlighted in yellow.

Click OK to close the first Regular Expression Editor, and OK again to exit the test conditions. Next, we will create another If statement to filter the second information, Collection Dates. Click on the Else If statement on the second row to open its Test Conditions dialog box, then format the Test Clauses as follows:

Left Value:_list
Operator: Begins With
Right Value: *
Logic: AND
Left Value: _list
Operator: Contains
Value: \d{4}
Output Port: CollectionDate

This entire Test Clause simply means “Match any string that begins with an asterisk (*) AND contains 4 digits”. If you can recall the collection date from our input PDF (* Wednesday, May 24, 2023), that string starts with an asterisk and contains four digits corresponding to the collection year. If configured correctly, this test clause will output at least two collection dates from each report.

Now move on to the third attribute, Collection Point. Taking another look at the sample PDF, we can see that the line indicating our Collection Point starts with the word “Site” followed by a colon (Site:). Thus, this Test clause will be much more straightforward. Click on the next Else If statement (third row) and format its Test clause as follows:

Left Value:_list
Operator: Begins With
Right Value: Site:
Output Port: CollectionPoint

The final attribute, Travel Direction, will have a Test clause very similar to the previous one. The line containing this information starts with the word “Direction” followed by a colon (Direction:).

Left Value:_list
Operator: Begins With
Right Value: Direction:
Output Port: TravelDirection

Click OK twice to close the Test Conditions and the TestFilterParameters dialog boxes. Save the workspace, then click Run.

Double-check that the number of features at each output port is similar to the screenshot below, and make sure you thoroughly understand these numbers by inspecting all the cached features from each port.

Feel free to organize your transformers, then add bookmarks and annotations to the document for your workspace. This will help you remember what you did for each step and keep everything organized, as our workspace is becoming increasingly complex.

AI Assist for Regular Expressions

Starting in FME Form 2023.1, Artificial Intelligence (AI) Assist is available in the Regular Expression editor to help create search patterns for your use case. Click the AI Assist button at the bottom of the Regular Expression Editor dialog to open the AI Assist dialog.

Type in a prompt in English in the Regular Expression Description field and click Generate. The AI service will then attempt to generate a regular expression search pattern based on the Regular Expression Description field. An explanation of the generated prompt and test strings will also be provided in the Explanation and Test String fields, respectively. You can also enter your own test strings to test the regular expression.

Create a regular expression with the help of Artificial Intelligence (AI) Assist by typing in your regular expression description and selecting Generate. Optionally, enter your own test strings to confirm the regex works as expected. Hit Apply to apply the generated regular expression to the Regular Expression dialog.

If the generated regular expression does not match your test cases, continue to refine and modify the prompt. Once you are satisfied with the regular expression match pattern that has been generated, select Apply to apply the generated regular expression to the Regular Expression editor dialog.

*Tip: A helpful tip when using AI Assist is to input your own test string. Try experimenting with different regular expressions or rephrasing the expression to achieve the desired result displayed in the test string dialog.

5. Extract Traffic Counts
Now that we have all the important attributes filtered to their individual output port, let’s extract each of them, starting with the Traffic Counts. To do so, we will extract individual count values inside each line feature using the AttributeSplitter again. Add another AttributeSplitter to the canvas and connect it to the TrafficCounts port from the TestFilter. The parameters will be populated as follows:

Attribute to Split: _list
Delimiter or Format String: space character (highlighted in blue in the screenshot)
Trim Whitespace: Both
List Name: change to a more intuitive name (i.e., _countvalue)

Click Run To This.

If you check the AttributeSplitter_2’s output port, there are still 96 features just like from the Tester; however, each of these features contains a nested list of attributes representing data from each column in the origin table. We can confirm that our AttributeSplitter was configured correctly by reviewing the Feature Information pane on the right of the canvas. Under the Attributes section, there are 18 different list attributes called _countvalue() nested in each feature. Can you guess which _countvalue attribute corresponds to the collection time? We will need to revisit these list attributes at a later step.

(Hint: _countvalue index starts from 0, and this value corresponds to the first column from our traffic count table)

To access more training on handling list attributes, please refer to the following module: Work With Multiple Data Models Using Lists. It is important to familiarize yourself with this concept, as we will revisit it a few more times throughout this demo series.

Our next task is to create groups of 48 half-hour increments for each collection day, using a custom transformer called Grouper. In this case, we will have two groups created from 96 features. Add the Grouper transformer, connect it to the AttributeSplitter_2’s output port, and set the Sampling Rate to 48 before clicking OK. After running the workspace, take a look at the Visual Preview table and pay attention to a new field called _group_index, you can see the first 48 features have an index of 0 and the rest have an index of 1. Keep these numbers in mind, we will need them as key-value pairs to join all Traffic Count features with their respective Collection Dates being extracted. Save and document the workspace before moving on to the next step, Extract Collection Dates.

6. Extract Collection Dates
To extract Collection Dates, we will use a StringReplacer and remove all characters before the collection month, including the asterisk, the weekday, a comma, and a space after it ( *Wednesday, ). Add a StringReplacer transformer to the canvas and connect it to the CollectionDate port from the TestFilter. Fill in the following parameters:

Attributes: _list
Mode: Replace Regular Expression
Case Sensitive: No
Text To Replace: .*y,\s
Set Attribute Value To: <No Action>

Here is a breakdown of the regular expression provided, in case you are not comfortable with Regex yet:
. Any single character
*y, Any character before the letter y, followed by a comma (i.e., Wednesday)
\s Any whitespace character

This entire expression simply means “Match all characters up until the letter y, followed by a comma and a white space”.

After running the workspace, confirm that the StringReplacer only outputs two date features with the following format: Month DD, YYYY.
Now, we will convert our date features into a numeric format (FME format) using a transformer called DateTimeConverter. Add the DateTimeConverter to the canvas and connect it to the StringReplacer. Use the following information to fill in the parameters:

DateTime Attributes: _list
Input Format: %B %d, %Y
Output Format: FME
Repair Overflow: Yes
Passthrough Nulls, Empties or Missing: No

Preview the output in the Preview Data section right under General parameters.

Click OK, then Run To This. Now, we have our date-time string in a numeric format (it can be any other date-time format or even custom formats if needed). We just use the FME format for this demo.

However, our field name needs to be a bit more intuitive rather than just “_list.” At this point, we also want to remove the other two fields (pdf_page_number and pdf_page_text). So, let’s add an AttributeManager and connect it to the DatetimeConverter. Refer to the following screenshot to configure its parameters.

Later on, in order to join our Collection Dates with the Traffic Counts extracted from the previous part, we would need a key-value pair. Remember the _group_index attribute from the Grouper earlier? We will replicate these group indexes for each date extracted so that the first Collection Date has an index of 0 and the second collection date has an index of 1. These correspond to the indexes of the two groups of 48 traffic count features we extracted previously. Add the Counter transformer to the canvas and connect it with the AttributeManger. Make sure the Counter parameters are set as follows:

Count start: 0
Output Attribute Names
- Count: _groupindex

After running the workspace, double-check the Counter’s cached features and confirm that each collection date has a group index. When we are done extracting all the important attributes, we will perform a table join at the final step. Now, save and document the workspace before extracting the Collection Point.

7. Extract Collection Point and Intersection ID
For the third important attribute, Collection Point, we will also extract IntersectionID from the same string feature. This IntersectionID will serve as a key-value pair in the second demo of this series, when we intend to join all the extracted attributes to a point geometry, the Intersection point.

To start, we will first extract the entire string between the square brackets using a StringSearcher, paired with another advanced Regular Expression. Add the StringSearcher to the canvas and connect it with the CollectionPoint output port from the TestFilter. Fill in the parameters as follows:

Search In: _list
Contains Regular Expression: (?<=\[).*?(?=\])
Matched Result: _first_match

Don’t be intimidated by this complex-looking expression. Here is a breakdown of it:
(?<=\[) A positive lookbehind to find any text string preceded by an opening square bracket [.
.*? Any character or sequence of characters, as few as possible. The question mark ? makes it non-greedy, meaning it will match as few characters as possible. This ensures that the match does not extend beyond the closing square bracket.
(?=\]) A positive lookahead to check if the text string is followed by a closing square bracket ].
Putting it all together, this expression simply means “Match any string that is enclosed within the square brackets”. It will capture the content inside the brackets, but not the brackets themselves.

After running the workspace, double-check our output feature to see if the StringSearcher correctly matched the text string we wanted (14266NN). Next, we will add two more StringSearchers to split the IntersectionID and Collection Point (first letter after IntersectionID) from this string. Connect the second StringSearcher to the first one. The parameters for the second StringSearcher are as follows:

Search In: _list
Contains Regular Expression: .\d+
Matched Result: IntersectionID

The regular expression .\d+ is a pattern used to match a character followed by one or more digits. Breaking it down:
. Any single character.
\d+ One or more digits. The \d represents any digit from 0 to 9, and the + quantifier indicates that there must be at least one digit and allows for additional digits. The regular expression means “Match any character followed by one or more digits.” This will extract all the digits within our string.

Make sure the Matched Result parameter is changed to IntersectionID before clicking OK. Next, connect the second StringSearcher to the third one. The third StringSearcher should be configured as follows :

Search In: _first_match
Contains Regular Expressions: [a-z,A-Z]*$
Case Sensitive: Yes
Matched Result: CollectionPoint

As you are a bit more comfortable with Regex by now, you might figure out that the regular expression [a-z,A-Z]*$ simply means “Match any string that contains letter(s) and ends with a letter only.” This is to extract the last two characters (NN) indicating the Collection Point.

Click OK, then run the workspace and confirm that it only extracts letters from the given string. Your output feature now has two attributes: IntersectionID and CollectionPoint.

Ensure that the workspace is Saved and properly documented like the screenshot below before moving on to the next step.

8. Extract Travel Direction
Now we will extract the last attribute needed from the PDF, Travel Direction. Add another StringSearcher to the canvas and connect it with the TravelDirection port from the TestFilter. Trust me, this will be the last StringSearcher we need for this workspace! Open up its parameters and specify in the following values:

Search In: _list
Contains Regular Expressions: (?<=\:).*?(?=\()

It is just a slightly different version of the Regex we used to extract the string between square brackets in a previous step. Instead of the opening bracket, it is replaced with a colon : and the closing bracket is replaced with an opening parenthesis. ) Our goal is to extract the word “North” that is located after "Direction:" and before "(bound)". This expression will also work for any other directions found in different reports.

Matched Result: TravelDirection

Click OK, then run the workspace. If you take a closer look at the output feature, you can see that the expression did not do a perfect job, as it also includes two space characters before and after the word North.

Therefore, we would need to add an AttributeTrimmer and connect it with the Stringsearcher_4 to get rid of those space characters. Configure the parameters as follows, then click Run:

Attribute to Trim: TravelDirection
Trim Type: Both
Trim Characters: (space)

Next, let’s use a FeatureMerger to merge this TravelDirection feature with the one from our previous step that contains both Intersection ID and Collection Point. Connect the Merged port from FeatureMerger to its Requestor Port, and the Output port from Attribute Trimmer to the Supplier Port. Set the same Join On value for the Requestor and Supplier (i.e., 1). Leave other parameters as default, click OK, then run the workspace.

To elaborate, this specific configuration enables data joining even when there is no common key attribute. By setting the Join On parameters this way, we can join a single-feature Supplier to each feature in the Requestor. By doing so, incoming feature(s) from the Requestor will pick up the attributes and corresponding values from the Supplier’s single feature. This should only be with a Supplier that contains only one single feature. If not, you will always need a key attribute to perform any form of table join.

After extracting all the important attributes from the PDF, your workspace should look similar to the screenshot below. The last section of this demo will focus on merging all the extracted attributes into the same feature and summarizing hourly counts to Average Daily Traffic counts.

Part 2: Merge Attributes and Calculate Average Daily Traffic

1. Merge Attributes
Remember what we did at the end of the Extract Collection Date step? We used a Counter to replicate the group index, which served as a key-value pair for a table join. Now, we will join all the Traffic Counts with their respective Collection Dates.

Let’s add a FeatureMerger to the canvas and connect the Grouper’s Output port to the FeatureMerger’s Requestor port, and the Counter to the Supplier port.

Fill in the FeatureMerger_2 parameters as follows:

Supplier First: No
Requestor: _group_index
- from Traffic Counts
Supplier: _groupindex
- from Collection Dates
Comparison Mode: Automatic

Click OK, then Run To This, and confirm that 96 features came out of the Merged port.

Next, we will output all the traffic counts to a tabular format that can be easily read and used by other teams or departments if needed. To do so, we need to map each of the list attributes to the corresponding column from the Traffic Count table schema. Add another AttributeManager to the canvas, connect it to the Merged port of the FeatureMerger_2, and start populating its parameters as follows:

To explain, we are creating 18 new fields in our feature and populating them with the data extracted from our Traffic Count table. The value of each field corresponds to the list index position from each of the strings being parsed at the beginning of the demo. For example, the Time value for the first feature corresponds to its first index value (_countvalue{0}), indicating that the collection time was 00:00 am. Similarly, the Total count value for the first feature corresponds to the 18th index value (_countvalue{17}), indicating that the total traffic count at 00:00 was 16. This same pattern is repeated until the very last feature we extracted from the Traffic Count table (96 features).

Run the workspace and check out the Table in Visual Preview. By now, we should have all the traffic counts organized by collection date and collection time across different speed ranges.

Congratulations! You successfully extracted the traffic count table from the PDF and output it to a new feature type in FME.

At this point, you might want to export the table and share it with other teams or departments using a CSV or MS Excel Writer.

2. Calculate Average Daily Traffic
However, since we want the Daily Traffic counts, we would need another transformer to summarize all the half-hour counts and group them together by each collection date. Let’s add a StatisticsCalculator to the canvas and connect it with the AttributeManager_2. For its parameters, enter the following:

Group Processing box: Checked
Group By: Date.
Calculation Method: Numeric
Statistics to Calculate:
- TotalCount: Sum

Run the workspace and check the output table to see if there are summed total counts for each of the collection dates. These are Daily Traffic counts.

Now, for the Average Daily Traffic counts, we want to calculate the mean counts of both dates and keep only the first Collection Date. This can be done using another StatisticsCalculator. Add another StatisticsCalculator and connect it to the Summary Port from the first one.
Set the parameters as follows:

Group Processing: Unchecked
Statistics to Calculate:
- Date: Min
- TotalCount.sum: Mean

Save the workspace, then click Run. Ensure the visual preview table looks similar to the following screenshot, with the collection date showing 20191113 and ADT counts showing 2229. Don’t worry about the attribute names for now; we will change them to more intuitive names after merging them with the Collection Point and Travel Direction attributes.

If you recall from the previous step, we already merged the data from IntersectionID, CollectionPoint, and TravelDirection into one single feature. Now we will use another FeatureMerger to merge that feature with the one we just created, putting all the important attributes together into one feature. Add another FeatureMerger to the canvas, connect the Summary Port from the StatisticsCalculator_2 to its Requestor port and the Merged port from FeatureMerger to its Supplier Port. Set the same Join On value for the requestor and supplier parameters (i.e., 1), click OK, then Run.

Check the output port from FeatureMerger_3 and make sure the feature contains all four important attributes we need.

Finally, to polish the attribute names and remove unnecessary attributes, add another AttributeManager and configure its Parameters as follows:

Let’s break down all these parameters for your reference:

Change Date.min to Date.
Remove unnecessary attributes including pdf_page_number, pdf_page_text, _list, _first_match, and Total.Count.sum.mean
CollectionPoint values are formatted to more intuitive descriptions, as we only need the very first letter to indicate their relative location from an intersection. (For example, either NN or NS will be set to N of). To do this, click on the ellipsis button in the column value to open up the Conditional Value Definition window.

Next, click on the first If Statement and populate the following test clause:

Left Value: Collection Point
Operator: Begins With
Right Value: N
Value: N of

The subsequent Else If statements for other directions are set up in a similar manner, with only minor changes in the Right Value (S, E, W), and Value (S of, E of, W of). Confirm that the Conditional Value Definition window looks like the screenshot above before moving on to the next parameter.

Add the word “bound” after the TravelDirection value (i.e., North -> Northbound). To do this, click on the ellipse button under the Value column to open up the Text Editor window. Under the FME Feature Attributes dropdown, select TravelDirection. The @Value(TravelDirection) will be automatically added to the Editor whiteboard; you only need to type in the word “bound” right after it. Make sure there is no space in between.

Ensure that the Total.Coun.sum.mean attribute is moved below ADTOne and ADTTwo, we can only remove it after we are done using it to condition ADTOne and ADTTwo.
Add two new attributes named ADTOne and ADTTwo; their conditional values are explained below.

Regarding the ADTOne and ADTTwo attributes, each of them represents the ADT counts from a specific Travel Direction extracted from the report. At a given collection point, there are typically two Traffic Speed Reports that gather data from opposite travel directions. Therefore, it is crucial for our ADT data to indicate its corresponding travel direction. ADTOne indicates the count for Northbound or Eastbound travel, while ADTTwo indicates the count for Southbound or Westbound travel. Set up the Conditional Value for ADTOne as follows:

Left Value: TravelDirection
Operator: Contains
Right Value: North
Logic: OR
Left Value: TravelDirection
Operator: Contains
Right Value: East
Value: TotalCount.sum.mean

Similarly, the Conditional Value for ADTTwo is as follows:

Left Value: TravelDirection
Operator: Contains
Right Value: South
Logic: OR
Left Value: TravelDirection
Operator: Contains
Right Value: West
Value: TotalCount.sum.mean

Save the workspace, then click Run Entire Workspace. Double-check that your output table looks like the screenshot below.

As you can observe, this report pertains to the Northbound travel direction, and therefore the ADT data will be recorded in the ADTOne field. There is no need to be concerned about ADTTwo at this moment as it will be filled with data from the opposite direction, which will be extracted from another Traffic Speed Report. In an upcoming article of this series, we will demonstrate how to aggregate data from multiple reports and explore methods to automate the entire process using FME Flow.

Part 3: Join the Extracted Data to an Intersection Point

1. Add a Shapefile Reader
Our last step is to join all the extracted data to an intersection point, using the IntersectionID as the key attribute. To start, let’s add the Street Intersections shapefile using the Esri Shapefile Reader. Click Add Reader and make sure the Format is set to Esri Shapefile. For Dataset, browse to the demo folder and select the Street_Intersections.shp file.

2. Add an AttributeKeeper
Click OK, then run the workspace and double-check if the 22,660 intersections were read in. As you may notice, this feature type contains 36 attributes/fields in total. We only need a few of them so let’s add an AttributeKeeper to remove the irrelevant attributes. Connect the AttributeKeeper to the Shapefile Reader, for the Attributes to Keep parameter, click on the ellipse button near the end and check the following attributes:

ASTREETDIR (Street A Direction)
ASTREETNAM (Street A Name)
BSTREETDIR (Street A Direction)
BSTREETNAM (Street A Name)
FACILITYID (Facility ID)
INTNAME (Intersection Name)
INTNUM (Intersection Number, which is IntersectionID)
LATITUDE
LONGITUDE

3. Add a FeatureJoiner
Next, we will perform a table join between the output from the previous step with the Street Intersections using the FeatureJoiner transformer. Our key attribute will be IntersectionID, which is the same as the INTNUM attribute from the input Street_Intersections shapefile. Add a FeatureJoiner to the canvas, connect its Right port to the AttributeManager_3, and its Left port to the AttributeKeeper.

When prompted, set the Left Join On to INTNUM and the Right to IntersectionID.

Click OK, then Run the workspace. Ensure that only one feature is cached at the output port of the FeatureJoiner and that the Intersection Point is displayed in the Graphics window.

Lastly, save and properly document your workspace with annotations and bookmarks. You can use the following screenshot for reference.

Congratulations! You have successfully completed an FME workspace that is capable of reading a complex PDF file, extracting relevant data, and configuring attributes tailored to your project needs. By now, you should feel more confident working with list attributes and creating Regular Expressions for text extraction. Well done!