Mastering Data Pipelines: The Secret to Fast and Reliable Data Operations

Discover how automated data pipelines revolutionize DataOps, from streamlining workflows to creating a competitive advantage.

In today’s data-driven world, data pipelines are the backbone of efficient and scalable DataOps. These pipelines are vital for managing both data and code, automating complex workflows, and minimizing manual data handling. But why are they so critical, and how can businesses leverage them for a competitive edge? Let’s explore the role of data pipelines in enabling DataOps and transforming raw data into actionable insights.

The Power of Data Pipelines

Data pipelines automate the flow of data, whether in batch or stream, from source to destination. They eliminate manual data handling, reduce repetitive tasks, and foster collaboration, automation, and continuous improvement (Atwal, 2020). Robust, reusable, and scalable data pipelines enhance efficiency by minimizing non-value-adding activities for data scientists and analysts. This translates to higher productivity, faster project deployment, and greater employee satisfaction. Data pipelines handle the entire data science lifecycle: from data ingestion to processing. In essence, a well-designed pipeline acts as a fully automated data refinery, transforming raw data into actionable insights that drive meaningful business impact.

From a business perspective, a data pipeline is a fully automated data-refinery which transforms raw data into actionable information, resulting in a meaningful business impact. Workflow orchestration tools support the description and execution of end-to-end data pipelines (Matskin et al., 2021). There are different applicable areas of workflow orchestration tools such as automating business workflows, automatic scientific workflows, or more generalized big data orchestration workflows (Matskin et al., 2021). Recently, many workflow orchestration tools for business processes were developed such as Airflow, Prefect, Kedro or Dagster. With the development of these new tools new opportunities arise for companies to automate the process to refine raw data to meaningful insights, resulting in a competitive advantage. Selecting the best fitting workflow orchestration tool for describing and executing data pipelines is crucial for the development and productionizing of data products and can lead to a competitive advantage.

Conceptual Model for Data Pipelines

Raj at al. describe a conceptual model for data pipelines (Raj et al., 2020). The model contains nodes and connectors which build end-to-end data pipeline. Nodes perform a specified activity to manipulate data like aggregating or joining data. Connectors are connecting nodes with each other. The output of one node is the input of the next node downstream. The first node is the source node and the last node downstream is the sink node. The Figure below describes a meta model for nodes and connectors. This generic toolkit serves as a template to design specified data pipelines. This toolkit is not fully exhaustive and can be extended to the given needs.

Own representation (changed) based on (Raj et al., 2020,p. 17)

For each specific use case the nodes and connectors are chosen from the template. If necessary additional nodes and connectors can be added. The nodes can perform different tasks during the data flow. In the following the given nodes are explained:

The data generation node describes activities where data is generated. In a manufacturing environment IoT sensors, logistic data, machine/ maintenance/ operator logs, PLC data (e.g., sensors), ERP data (e.g., manual entries in SAP) and Quality data are frequently sources for data generation. Other sources can be cameras (Oleghe & Salonitis, 2020). Data collection nodes gather these data from the origins of data generation. Challenges are that the data is generated with a huge volume, high variety and a high velocity from multiple sources. The data can be transmitted in batch, mini batches or continuously streamed. The type of transmission affects the data collection nodes. Data ingestion describes nodes which ingested data to a database or other applications depending on the data types (format, size, batch versus stream). Raw data from a data generating node can be ingested for example to a data lake. Data can be ingested to a data pipeline or an application as well. Data processing nodes are performing processing tasks like aggregation or joining of datasets to solve a specific business task. There are many different tools and programs for processing data with a huge variety. For the yield optimization project parsing PLC text data into readable and understandable formats, creating the batch family trees or processing quality data is important. Data storage describes all task to store data. There are different types of storage: Data lake for raw material, data warehouse for processed data, NoSql or relational data bases for intermediate steps in the pipeline. Data labeling nodes describe tasks where for supervised or reinforced learning models the data is labeled. Data preprocessing tasks are related to assuring high quality of the data by handling missing values and outliers, splitting data in train test and validation sets or performing other preprocessing steps like dimensionality reduction. RAJ at al. specified a 8th node Data reception to ingest data to the machine learning model or the given application (Raj et al., 2020). The functionality is similar to the data ingestion node and is not considered further here. Nodes are connected via connectors. Connectors transmit data from one node to the other node. They can do that either by carrying the data on themselves or indirectly by pointing to a data base or connecting applications like Kafka with other applications. RAJ et al. proposed a guideline for data transports (see Figure above). Data transmissions are specified as continuous – also known as streaming data -, mini batches or batch processing data. Furthermore, RAJ at al. differentiated between structured data, labelled data and preprocessed data transmissions. Beside the data transport capability, the connectors have various other functions. Frequently data pipelines use cloud services to store data. Connectors can assure authentication to e.g., external systems. Connectors can monitor data pipelines and send alarms if an error occurs. Connectors can also validate e.g., data transfers. Furthermore, connectors can mitigate failures of data pipelines e.g., by specifying a ‘try again’ parameter or by specifying back up parameters.

A normal ETL-Job describes a data pipeline as well. More than one data pipeline can be managed for example by a workflow orchestration tool like Airflow. Airflow uses like other modern workflow orchestration tools a directed acyclic graph to “represent the flow and dependencies of tasks in a pipeline” (Densmore, 2021, p.18). Directed graphs means that tasks do not run before all the dependent task were executed successfully. The term ‘acyclic graphs’ means no loops in the graphs are allowed. An example DAG is displayed in Figure below:

Own representation

ETL is an acronym: Extracting data from various sources, Transformation of the raw (e.g., cleaning, joining data or formatting) and Loading the data into the final destination. Depending on the sequence of the last two steps it can be called ELT (when data is loaded first into final destination and transformed there) or ETL (when data is first transformed and then loaded) (Densmore, 2021, p.21-29). There is a huge variety of tools allowing to perform ETL jobs, for example the AWS-Glue catalog is a popular tool box.

Sources:

  • Atwal, H. (2020). Practical DataOps. In Practical DataOps. Apress. https://doi.org/10.1007/978-1 4842-5104-1

  • Matskin, M., Tahmasebi, S., Layegh, A., Payberah, A. H., Thomas, A., Nikolov, N., & Roman, D. (2021). A Survey of Big Data Pipeline Orchestration Tools from the Perspective of the DataCloud Project *. https://datacloudproject.eu

  • Raj, A., Bosch, J., Olsson, H. H., & Wang, T. J. (2020). Modelling Data Pipelines. Proceedings - 46th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2020, 13 20. https://doi.org/10.1109/SEAA51224.2020.00014

  • Densmore, J. (2021). Data Pipelines Pocket Reference: Moving and Processing Data for Analytics. O’Reilly.