Getting Started with AI in FME: Classifying Unstructured PDF Files

Files

FileClassification.fmw
- 80 KB
- Download

Introduction

Organizations often manage large volumes of unstructured documents—PDFs, Word files, plain text, and more—that contain valuable information but are difficult to organize or analyze using traditional tools. With the OpenAIConnector transformer (or any of our other AI Transformers) in FME, you can now harness the power of large language models to understand and process this kind of content at scale.

By leveraging this functionality, you can build workflows for:

Document classification: Automatically categorize files into types such as contracts, invoices, reports, or proposals based on their content.
Content summarization: Extract key points, summaries, or abstracts from long documents to make them easier to digest.
Data validation and cleanup: Flag documents with missing information or inconsistencies before they enter downstream systems.
Search optimization: Tag documents with keywords or metadata to improve searchability in document management systems.
Sensitive content detection: Flag documents that contain potentially sensitive or confidential information.

By combining these capabilities with FME’s powerful data integration and automation tools, you can build intelligent workflows that not only move and transform data, but also understand it.

In this tutorial, we’ll walk through a simple but powerful use case: classifying PDF documents and renaming them based on their content. This is an excellent first step toward building more advanced AI-powered document processing pipelines.

AI Disclaimer
The results generated by the OpenAIConnector are based on predictions from a large language model and may contain inaccuracies, misinterpretations, or omissions. Always review AI-generated outputs before relying on them for decision-making or reporting. For critical use cases, validate insights against source data or consult a subject matter expert.

Requirements

FME Workbench 2025.0.0.0 or later (Build 25208+)
Access to the OpenAI API
A directory of unclassified PDFs or PDFs downloaded here, For more information on the source of these PDFs see Digital Copora. The link will download 1000 unclassified PDFs.

Step-by-Step Instructions

In this section, you’ll build an FME workspace that classifies PDF documents using the OpenAIConnector. The workflow will:

Read a directory of PDF files
Send each file to OpenAI for classification
Suggest a new filename based on content
Copy and rename the files into a new, organized folder

Follow the steps below to get started.

1. Get an OpenAI API Key

If you haven’t already, sign up for the OpenAI API, then generate an API key. You’ll need this key to authenticate requests made by the OpenAIConnector in FME.

2. Create a New FME Workspace

Open FME Workbench and create a blank workspace.

3. Add a Reader

Use the Directory and File Pathnames Reader to scan the folder that contains your PDF files, which can be downloaded from here.

Click Add Reader and set the following parameters:

Format: Directory and File Pathnames
Parameters:
- Dataset: <folder>
- Select Parameters and set Path Filter: *.pdf

Click OK twice to close both dialog boxes.

This reader outputs a list of full file paths.

4. Add a Sampler Transformer

To avoid excessive token usage while testing, add the Sampler transformer. Attach the Sampler to the output of the Directory and File Pathnames reader feature type. Double-click the Sampler to edit the following parameters:

Sampling Rate (N): 10
Randomize Sampling: No

Click OK.

This helps you test your workflow on a manageable sample size.

5. Add an OpenAIConnector

Tips for prompt generation:

• Be clear and specific in your prompt. The more context you provide, the better the model performs.
• Role-based prompts (e.g., “You are a document classification assistant…”) help set expectations for the model.
• Use structured output whenever possible. This makes downstream processing much easier, especially in FME, where structured JSON can be parsed into attributes using the JSONFlattener.
• Enumerate options. If you want consistent classification (e.g., fixed categories), provide a defined list the model can choose from.
• Ask for justification. Including fields like explanation or confidence_score can help with auditing and quality assurance.

Add the OpenAIConnector and attach it to the Sampler Sampled output port. Double-click the OpenAIConnector to edit the following parameters:

API Key: <Your API Key>
Action: File Search
File to Upload: path_windows or path_unix (depending on your OS)

You can select path_windows or path_unix from the drop-down menu beside the parameter field. If you don't see it listed, try running the workspace with feature caching enabled once to populate available attributes, then return to the transformer settings to select it.
User Prompt:

You are a document classification assistant. Analyze the content of the following document and classify it into one of the following categories based on its purpose and structure:

- research paper: Documents that present scientific, academic, or experimental work, often with sections like abstract, introduction, methodology, results, and references.

- financial report: Documents that contain financial statements, earnings, expenses, projections, or fiscal data for organizations.

- legal document: Contracts, court rulings, statutes, agreements, or other formal/legal language with references to laws or legal processes.

- other: If none of the above apply.

Also suggest a new filename for the file based on the contents.

Structured Output: Checked

{

"additionalProperties": false,

"properties": {

"classification": {

"enum": [

"research paper",

"financial report",

"legal document",

"other"

],

"type": "string"

},

"confidence_score": {

"type": "number"

},

"detected_keywords": {

"items": {

"type": "string"

},

"type": "array"

},

"explanation": {

"type": "string"

},

"filename": {

"type": "string"

},

"language": {

"type": "string"

}

},

"required": [

"classification",

"confidence_score",

"explanation",

"detected_keywords",

"language",

"filename"

],

"type": "object"

}

Click OK.

6. Add a JSONFlattener

Add a JSONFlattener and attach it to the OpenAIConnector output port. After the OpenAIConnector returns its structured response, the data is stored in a single JSON-formatted attribute named Response. To work with the individual fields (like classification, filename, or confidence_score), you'll need to parse that JSON into usable FME attributes. Flattening helps extract structured fields into FME attributes for further use.

Double-click the JSONFlattener and edit the following parameters:

JSON Attribute to Flatten: Response
Recursively Flatten Objects/Arrays: No
Attributes to Expose: confidence_score, detected_keywords, explanation, language, filename

You can select Response from the drop-down menu beside the parameter field. If you don't see it listed, try running the workspace with feature caching enabled once to populate available attributes, then return to the transformer settings to select it.

Click OK.

7. Add a Tester

Add a Tester and connect it to the output port of the JSONFlattener. A Tester transformer is used to filter out documents that were not successfully classified.

Configure the Tester with the following logic to filter for English-language documents that were successfully classified:

Test Clauses:

Logic	Left Value	Operator	Right Value
(	language	=	en
OR	language	=	English
) AND (	classification	!=	other
)

Comparison Mode: Case Insensitive

Click OK.

8. Add an AttributeCreator

Add an AttributeCreator and attach it to the passed port of the Tester. This is used to build a new path for the renamed PDF file.

Use the AttributeCreator to define the source and destination paths for the renamed PDF files. The following attributes are required by the Filecopy Writer, which will be placed next. Double-click the AttributeCreator to edit the following parameters:

Output Attribute	Value
filecopy_source_dataset	path_unix
filecopy_dest_filename	filename

When setting the values, you can select existing attributes like path_unix and filename by clicking the drop-down arrow and choosing Attribute Value.

path_unix is the full path to the original file (provided by the Directory and File Pathnames Reader), and filename is the suggested new name generated by the OpenAIConnector.

Click OK to save your changes.

9. Copy and Rename Files

The final step is to write the original PDF files to a new location, using the new filenames suggested by the OpenAIConnector.

FME uses the FileCopy writer for this purpose—it copies files from a source path to a destination path, and can also rename them in the process.

From the menu, go to Writers > Add Writer...

In the dialog that appears, configure the following:
- Format: File Copy
- Dataset: Select a local folder
- Click OK
In the Feature Type Parameters dialog that appears:
- Subfolder Name: Categorized PDFs
- Click OK

Connect the output port of the AttributeCreator to the FileCopy writer feature type to complete the workflow. The final workspace should look like this.

Once complete, running the workspace will copy each original PDF to its new location and rename it based on its content classification.

Search

Getting Started with AI in FME: Classifying Unstructured PDF Files

Files

Introduction

Requirements

Step-by-Step Instructions

Was this article helpful?