The Linux Foundation Projects
Skip to main content
By | February 24, 2023

RTDIP Ingestion Pipeline Framework

RTDIP has been built to simplify ingesting and querying time series data. One of the most anticipated features of the Real Time Data Ingestion Platform for 2023 is the ability to create streaming and batch ingestion pipelines according to requirements of the source of the data and needs of the data consumer. Of equal importance is the need to query this data and an article that focuses on egress will follow in due course.

Overview

The goal of the RTDIP Ingestion Pipeline framework is:

  1. Support python and pyspark to build pipeline components
  2. Enable execution of sources, transformers, sinks/destinations and utilities components in a framework that can execute them in a defined order
  3. Create modular components that can be leveraged as a step in a pipeline task using Object Oriented Programming techniques included Interfaces and Implementations per component type
  4. Deploy pipelines to popular orchestration engines
  5. Ensure pipelines can be constructed and executed using the RTDIP SDK and rest APIs

Pipeline Jobs

The RTDIP Data Ingestion Pipeline Framework will follow the typical convention of a job that users will be familiar with if they have used orchestration engines such as Apache Airflow or Databricks Workflows.

A pipline job consists of the following:

As per the above, a pipeline job consists of a list of tasks. Each task consists of a list of steps. Each step consists of a component and a set of parameters that are passed to the component. Dependency Injection will ensure that each component is instantiated with the correct parameters.

Pipeline Runtime Environments

PythonApache SparkDatabricksDelta Live Tables
pythonpysparkdatabricksdelta

Pipelines will be able to run in multiple environment types. These will include:

  • Python: Components will be written in python and executed on a python runtime
  • Pyspark: Components will be written in pyspark and executed on an open source Apache Spark runtime
  • Databricks: Components will be written in pyspark and executed on a Databricks runtime
  • Delta Live Tables: Components will be written in pyspark and executed on a Databricks runtime and will write to Delta Live Tables

Runtimes will take precedence depending on the list of components in a pipeline task.

  • Pipelines with at least one Databricks or DLT component will be executed in a Databricks environment
  • Pipelines with at least one Pyspark component will be executed in a Pyspark environment
  • Pipelines with only Python components will be executed in a Python environment

Pipeline Clouds

azure-aws-gcp

Certain components are related to cloud providers and in the tables below, it is indicated which cloud provider is related to its specific component. It does not mean that the component can only run in that cloud, instead its highlighting that the component is related to that cloud provider.

CloudTarget
AzureQ1-Q2 2023
AWSQ2-Q4 2023
GCP2024

Pipeline Orchestration

AirflowDatabricksDagster
airflowdatabricks-workflowsdagster

Pipelines will be able to be deployed to orchestration engines so that users can schedule and execute jobs using their preferred orchestration engine.

Orchestration EngineTarget
Databricks WorkflowsQ2 2023
AirflowQ2 2023
Delta Live TablesQ3 2023
DagsterQ4 2023

Pipeline Components

The Real Time Data Ingestion Pipeline Framework will support the following components:

  • Sources – connectors to source systems
  • Transformers – perform transformations on data, including data cleansing, data enrichment, data aggregation, data masking, data encryption, data decryption, data validation, data conversion, data normalization, data de-normalization, data partitioning etc
  • Destinations – connectors to sink/destination systems
  • Utilities – components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance etc
  • Edge – components that will perform edge functionality such as connectors to protocols like OPC

Pipeline Component Types

PythonApache SparkDatabricks
pythonpysparkdatabricks

Component Types determine system requirements to execute the component:

  • Python – components that are written in python and can be executed on a python runtime
  • Pyspark – components that are written in pyspark can be executed on an open source Apache Spark runtime
  • Databricks – components that require a Databricks runtime

Sources

Sources are components that connect to source systems and extract data from them. These will typically be real time data sources, but will also support batch components as these are still important and necessary data souces in a number of circumstances in the real world.

Source TypePythonApache SparkDatabricksAzureAWSTarget
Delta*✔✔✔✔✔Q1 2023
Delta Sharing*✔✔✔✔✔Q1 2023
Autoloader✔✔✔Q1 2023
Eventhub*✔✔✔✔Q1 2023
IoT Hub*✔✔✔✔Q2 2023
Kafka✔✔✔✔✔Q2 2023
Kinesis✔✔✔Q2 2023
IoT Core✔✔✔Q2 2023
SSIP PI Connector✔✔✔✔Q2 2023
Rest API✔✔✔✔✔Q2 2023
MongoDB✔✔✔✔✔Q3 2023

*✔ – target to deliver in the following quarter

Transformers

Transformers are components that perform transformations on data. These will target certain data models and common transformations that sources or destination components require to be performed on data before it can be ingested or consumed.

Transformer TypePythonApache SparkDatabricksAzureAWSTarget
Eventhub Body✔✔✔Q1 2023
OPC UA✔✔✔✔✔Q2 2023
OPC AE✔✔✔✔✔Q2 2023
SSIP PI✔✔✔✔✔Q2 2023
OPC DA✔✔✔✔✔Q3 2023

*✔ – target to deliver in the following quarter

This list will dynamically change as the framework is developed and new components are added.

Destinations

Destinations are components that connect to sink/destination systems and write data to them.

Destination TypePythonApache SparkDatabricksAzureAWSTarget
Delta Append*✔✔✔✔✔Q1 2023
Eventhub*✔✔✔✔Q1 2023
Delta Merge✔✔✔✔Q1 2023
Kafka✔✔✔✔✔Q2 2023
Kinesis✔✔✔Q2 2023
Rest API✔✔✔✔✔Q2 2023
MongoDB✔✔✔✔✔Q3 2023
Polygon Blockchain✔✔✔Q3 2023

*✔ – target to deliver in the following quarter

Utilities

Utilities are components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance and are normally components that can be executed as part of a pipeline or standalone.

Utility TypePythonApache SparkDatabricksAzureAWSTarget
Delta Table Create*✔✔✔✔✔Q1 2023
Delta Optimize✔✔✔✔Q2 2023
Delta Vacuum*✔✔✔✔✔Q2 2023
Set ADLS Gen2 ACLs✔✔Q2 2023
Set S3 ACLs✔✔Q2 2023
Great Expectations✔✔✔✔✔Q3 2023

*✔ – target to deliver in the following quarter

Edge

Edge components are designed to provide a lightweight, low latency, low resource consumption, data ingestion framework for edge devices. These components will be designed to run on edge devices such as Raspberry Pi, Jetson Nano, etc. For cloud providers, this will be designed to run on AWS Greengrass and Azure IoT Edge.

Edge TypeAzure IoT EdgeAWS GreengrassTarget
OPC Publisher✔Q3-Q4 2023
Greengrass OPC UA✔Q4 2023

Conclusion

This is a very high level overview of the framework and the components that will be developed. As the framework is open source, the lists defined above and timelines can change depending on circumstances and resource availability. Its an exciting year for 2023 for the Real Time Data Ingestion Platform. Check back in regularly for updates and new features! If you would like to contribute, please visit our repository on Github and connect with us on our Slack channel on the LF Energy Foundation Slack workspace.

This article was originally posted on RTDIP.