Filtering Unbounded Data Streams

Files

Filter on attributes.fmw
- 60 KB
- Download
Filter on time.fmw
- 70 KB
- Download

Introduction

As sensors and devices become more ubiquitous, the volumes of data we have to handle will continue to increase. In some situations, storing all of this data (usually in a data lake) will be a requirement for downstream analysis. However, it is often the case that storing all the data coming from a stream is either not required for analysis or the costs to do so would be prohibitively expensive.

The primary goal of filtering unbounded data streams is therefore to reduce data volumes in memory before committing data to disk.

There are several ways in which data can be filtered: attribute values, location, or time.

Examples

Filter on Attribute Values

Read the attribute values on the incoming data, apply a filter condition, and then discard data that doesn’t meet the condition.

Streams Diagrams Rebranded - Filtering - Dicard Based upon Attributes (1).png

Scenario

A city is monitoring the location of all of its operational vehicles (e.g. SUVs, refuse vehicles, maintenance vans) so they can better understand vehicle movement. The specific study is looking at trucks only so they wish to identify vehicles that are of type trucks and commit them to a database.

Workflow

The workspace connects to the message broker, in this case, we are using Kafka. The JSON-formatted payload, which represents the location of the vehicle, is then flattened into individual attributes. The vehicle type attribute on each message/feature is then tested to see if the message originated from a truck. From there, the attributes are cleaned up in the AttributeManager and pushed to the output destination.

Filter on Location

Compare the location of incoming messages to a geofence and discard the message if it doesn’t fall within the geofence.

Streams Diagrams Rebranded - Filtering - Discard Based Upon Location.png

Scenario

A city is considering adding tolls for vehicles to access the downtown core. They wish to analyze how much time their vehicle fleet spends inside the downtown core so they can assess the impact on budgets.

Workflow

The workspace connects to the message broker; in this case, we are using Kafka. The JSON-formatted payload, which represents the vehicle's location, is then flattened into individual attributes, and the attribute containing the WKT geometry is set to be the geometry of the feature. Using the Clipper transformer, each vehicle is then compared to a geofence defined by the administrative boundary of the City of Vancouver (COV) to see if it lies within the geofence or not. If the vehicle falls within the COV, then the attributes are cleaned up in the AttributeManager and pushed to the output destination.

For more information and to see this workflow in action, please refer to our video, 'Stream Processing: Select Based on Location'.

Filter data within a period of time

Grouping data in batch processing is as simple as you can read all of the data into memory and then run a query across all of the data. With unbounded data streams, you never stop reading the data, so to perform stateful operations on data with the same key (e.g. aggregations or joins) you have to simulate a grouping. To achieve this, a technique known as windowing is employed.

Windowing utilizes time to group features. Stateful operations can then be performed on data within a window. Read more on windowing here.
Streams Diagrams Rebranded - Filtering - Windowing.png

Scenario

An insurance company is trialing a new program where drivers pay for insurance based on the distance they travel. They have 50,000 people in the trial, and each participant has a sensor in their vehicle that tracks their location. The sensors report their location every five seconds. For this analysis, a lower resolution is acceptable, so to save storage costs, we will only record the location at a 30-second interval.

Workflow

The workspace connects to the message broker; in this case, we are using Kafka. The JSON-formatted payload, representing the vehicle's location, is then flattened into individual attributes.

The TimeWindower transformer then groups the messages into 30-second windows. All messages in each window are then sorted based on their timestamp, and a Sampler is used with a Group By to retrieve the last location for each vehicle within the window. This reduces data storage volumes by approximately six times.

For more information and to see this workflow in action, see our video Stream Processing: Select Based on Location.

Search

Filtering Unbounded Data Streams

Files

Introduction

Examples

Filter on Attribute Values

Scenario

Workflow

Filter on Location

Scenario

Workflow

Filter data within a period of time

Scenario

Workflow

Was this article helpful?