From Data Lifecycle to DataOps Architecture

Data is a valuable asset for the most companies in the 21st century. Like other assets data needs to be managed over the whole lifecycle. Mismanagement of data can result in many risks like: data losses or breaches resulting in disclosure of private information, incompliancy’s with regulations or data inconsistencies. Those risks can materialize in imminent risks for companies through e.g., reputation losses or costly legal disputes. In literature the management of data over the whole lifecycle is called data lifecycle management giving guidance how to manage date from the generation or entry in the data system though their usage to the disposal of those data. During the beginning of the 21st century many different data lifecycle management approaches were developed differentiated by detail levels and purposes. El Arass, Tikito and Souissi analyzed the 12 most important data lifecycle managements models: Information pyramid, CRUD lifecycle, lifecycle for big data, IBM lifecycle, DataOne lifecycle, information lifecycle, CIGREF, DDI lifecycle, USGS lifecycle, PII lifecycle, Enterprise data lifecycle, Hindawi lifecycle (el Arass et al., 2017). For this work the most interesting models are CRUD lifecycle as a base model and the information lifecycle fitting the needs for cloud environments.

Common Lifecycle management models

CRUD:

CRUD is an acronym (Create, Read, Update and Delete) describing basic functions of persistent database application allowing a user to manage data over the whole data lifecycle (IONOS, 2020). The basic functions are displayed below

Own representation based on (IONOS, 2020);

CRUD explains the basic functions needed for managing data over the lifecycle in an abstract way without giving detailed guidelines how to proceed step by step.

The Information Lifecycle:

The information lifecycle is a data lifecycle model developed especially for cloud computing mitigating the risks for data leakages and privacy losses developed by Lin et al. (Lin et al., 2014). The data lifecycle contains seven steps: Generation of data in the cloud by authorized users, Transmission of encrypted data with e.g. VPNs or digital certificate mechanism , Storage of data, Access of data by validating users identity, Reuse of data, Archiving of data, Disposal of data. This approach is focusing strongly on data security in the cloud environment. In comparison to the CRUD model, it is more fine-grained. The information lifecycle is displayed

Own representation based on (Lin et al., 2014)

The lifecycle management of data is a basic part in DataOps methodology to assure the data quality and data governance.

Team Data Science Process (TDSP)

DataOps methodology facilitates the rapid development of minimal viable products with a short lead time. Data analytic as well as data science projects have frequently an iterative character. Best practices were developed to standardize and optimized data science projects. The management of data is one part of it but the focus is clearly on the approach how to conduct data science projects. A good example for such best practices is the Team Data Science Process (TDSP). This framework is a structured approach to conduct data science projects. The main components of the TDSP are displayed below:

Own representation (changed) based on (Microsoft, 2021a, 2021b; Thakurta & McGehee, 2017)

The main part of the TDSP describes the data science life cycle. Furthermore, guidelines for roles and responsibilities, a file system structure and a toolbox (resources and infrastructure components) are given as well (Microsoft, 2022b). The TDSP is a method to structure data science projects in a flexible and iterative approach. Understanding the business needs is crucial for designing the right data science tools. To analyze and understand business needs data scientists work together with subject matter experts. Based on the domain knowledge of the subject matter experts the data scientists can outline the goals and deliverables of the data science project and define the quantifiable key success measures. Based in the key success measures the success or failure of the data science project can be calculated. According to the goal of the project data is acquired and analyzed: Detecting available data sources, create data pipelines, accessing data quality, cleaning, data wrangling, and explorative data analysis. With the given data the data science model is trained with meaningful input variables. Meaning input variables can be extracted from the raw data by feature engineering. The developed model is evaluated based on the key success measures. If defined thresholds are reached the model can be deployed to production. All of these steps are iterative and interconnected. That means for example after feature extraction the subject matter expert can be consulted. Given the feedback of the subject matter expert, the feature engineering can be adjusted (Microsoft, 2022a).

Advantages in comparison to traditional project management methods like water fall is that the project is conducted in iterative steps and thus changes can be implemented even in late phases of the project. Thus, this approach is more agile. Nevertheless, several important points are not included in the project lifecycle off TDSP:

Challenges in common data lifecycle management tools

The TDSP is focusing mainly on the development of new data science projects. The model for legacy data science projects needs to be updated to avoid model decays over time. Furthermore, the customer demands new features. The deployment of those features in production can be a high-risk event. The same applies for data analytics projects. Thus, the TDSP is a good practice for the development of a data science model but is not suitable to describe an holistic approach to develop and deploy data science models or data analytic features to production. Historically data architectures and infrastructures were focusing on production requirements like latency, load balancing and robustness. Data science projects as well as data analytic projects require that new features or model updates are deployed into production to stay competitive. This requires a change of paradigm for the data architecture: “A DataOps Architecture makes the steps to change what is in the production a central idea” (Bergh et al., 2019, p.110). Without having a suitable framework to deploy changes into production it remains a high-risk operation. Over time the deployments getting exponentially more complex with an increase of technical debt (Bergh et al., 2019). Technical debt is defined as “long term costs incurred by moving quickly in software engineering” (Sculley et al., 2015). Explanatory the components of an infrastructure to deploy a ML-Model is depicted below:

Own representation based on (Sculley et al., 2015)

The ML-Model is just a small part of the broader landscape. Changes in the ML-Model can affect as well other parts of the infrastructure. Deploying the new model manually results in a huge effort since the other components may need to be updated as well. Historically it happens that the development environment differs from the production environment resulting in additional problems since the configuration differs and cannot be tested before. Deploying code into production for data analytics projects or data science projects without a proper deployment framework results in high cycle times, production downtimes, errors in production and poor data quality. Thus it’s a high risk for operations.

DataOps Architecture

Bergh proposed in the book “The DataOps Cookbook – Methodologies and tools that reduces analytics cycle time while improving quality” an DataOps data architecture. The core idea of this architecture is to focus on how to deploy code in data analytics and data science projects to production with low cycle time, high data quality and low downtime risks. From a high level perspective Bergh proposed “decouples operations from new analytics creation and rejoins them under the automated framework of continuous delivery” (Bergh et al., 2019). This decoupling happens through the separation of work in two pipelines: Value pipeline and innovation pipeline:

Own representation based on (Bergh et al., 2019, p.38)

The value pipeline describes the current production pipeline. Raw data is ingested in the system, data is cleaned, features are extracted and fed into the production system where either a ML/AI model is deployed or advanced analytics operation are performed. Finally, the results can be visualized and value is achieved for the end-customer. The Innovation pipeline is focusing on the deployment of new production models. The process steps within one of the tow pipelines can be executed either in a serial or parallel manner. The two pipelines are completely controlled by code (infrastructure-as-Code). Behind the two pipelines is a sophisticated DataOps data architecture. Different environments are used. For example, the development is done in the development environment. The development environment can be a copy of the production environment. The separation of the environments is beneficial since the development is not impacting the production and it simplifies the deployment into production. The value pipeline runs in the production environment. The innovation pipeline automates the processes from the idea, to the development of new features in the “dev” environment and the successful deployment and testing of the new feature in the “prod” environment. All processes are highly automated. The workflows in the value pipeline are automated e.g., with an workflow orchestration tool or ETL pipelines. The deployment of the successfully created new feature in the “prod” environment (CI/CD) is automated including automated tests before deployment. All infrastructure is described in code (infrastructure as code) which enables an automated provisioning of the infrastructure for “prod” as well as for the “dev” environment. Code is saved in a version control system like git. Meta data is saved to gain additional insights and to detect errors early as early as possible. Secrets are used to simplify the authentication. Specific secret storage possibilities exist to ensure safe operations. Parameters are used to simplify for example the automated infrastructure provisioning. Data pipelines are key components in DataOps architecture. The next chapter is focusing on the data pipelines.

Sources:

IONOS. (2020). CRUD: die Basis der Datenverwaltung. https://www.ionos.de/digitalguide/websites/web-entwicklung/crud-die-wichtigsten datenbankoperationen/
el Arass, M., Tikito, I., & Souissi, N. (2017). Data lifecycles analysis: towards intelligent cycle.
Lin, L., Liu, T., Hu, J., & Zhang, J. (2014). A privacy-aware cloud service selection method toward data life-cycle. Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS, 2015-April, 752–759. https://doi.org/10.1109/PADSW.2014.7097878
Microsoft. (2021a). Lebenszyklus des Team Data Science-Prozesses. https://docs.microsoft.com/de-de/azure/architecture/data-science-process/lifecycle
Microsoft. (2021b). Was ist der Team Data Science-Prozess (TDSP)? https://docs.microsoft.com/de de/azure/architecture/data-science-process/overview
Microsoft. (2022a, July 3). Lebenszyklus des Team Data Science-Prozesses. https://docs.microsoft.com/de-de/azure/architecture/data-science-process/lifecycle
Microsoft. (2022b, July 3). Was ist der Team Data Science-Prozess (TDSP)? https://docs.microsoft.com/de-de/azure/architecture/data-science-process/overview
Thakurta, D., & McGehee, H. (2017). Team Data Science Process: Roles and tasks. https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/roles-tasks.md
Bergh, C., Benghiat, G., & Strod, E. (2019). The DataOps Cookbook Methodologies and Tools That Reduce Analytics Cycle Time While Improving Quality Second Edition.
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems.