DataOps – Secret Of Machine Learning & Data Science Success In An Enterprise

Initially, Cloud-first and then DevOps-first. Now, we are talking about Data-first, and eventually AI-first

It is often said that in this business environment, every organization has to be data-driven. This translates to the fact that most organizations have to make use of data to take business decisions across the enterprise, and this is not just confined to CxO only. Data is the core of digital transformation, and is pivotal in terms of improving a user’s experience. We also hear about how data, which is being considered as the ‘new oil,’ is fueling the economic growth of 21st century. In fact, there are many Billion dollar+ valuation companies that have been built on Data Foundation.

However, it is often not easy for enterprises to translate the above goals in the real world. Converting raw data into business and operational insights, and subsequently integrating them into a data monetization value chain is a cumbersome task.

A few facts:

The dark secret behind AI and Machine Learning is that only a small number of projects see the light in terms of day-to-day business operations
Gartner says 80% of analytics will not deliver business outcomes, 80% of AI projects will “remain alchemy, run by wizards” through 2020
Only 31% of organizations reported that they are seeing any kind of major business value from their AI efforts – Mindtree AI Readiness Report

The reasons for the above are many – Data silos, lack of alignment and standardization across the organization, the inability to envision an end-to-end solution, unavailability of a foundational platform, a huge gap from POC to production etc.

The common areas of concern in Machine Learning & Data Science (ML & DS) initiatives:

Putting data ahead of the business objective, missing situational awareness
Data debt – Multiple LOBs are making independent data decisions. Thus, creating synergy across data is a big task.
Building data infrastructure is a time-consuming & repetitive task, even on Cloud.
A majority of the time is spent in data engineering pipeline activities.
A small fraction of time/effort is used for model learning, evaluation, prediction etc., which are important tasks.
Model deployment and building scalable APIs for production grade deployment involves multiple parties.
Iterations of the model in production – Continuous evaluation of the deployed model, rescoring & re-deployment is a challenge.
New users struggle to be a part of ML & DS projects

Converting data into valuable insights is always a cumbersome process, due to the multiple points of handoffs

Technology and organizational complexity are big challenges for ML & DS projects:

So then, what is the solution?

A few fundamental steps organizations must take before initiating enterprise-scale ML & DS projects:

Self-serve data access across the enterprise (with appropriate guardrails in place)
Self-serve data infrastructure
Standardize the data landscape

And, how do we achieve the above? The answer is DataOps

Gartner’s definition of DataOps: It is a hub for collecting and distributing data, with a mandate to provide controlled access to systems of record for customer and marketing performance data, while protecting privacy, usage restrictions, and data integrity.

Practical definition: DataOps is an agile way of developing, deploying and operating data-intensive applications. This requires technology, organizational alignment and a process to collaborate between siloed data, system and people. This helps in promoting a data factory mindset, while orchestrating, monitoring and managing the data pipeline in an automated fashion for everyone handling data, who mainly comprise data engineers, data scientists, developers and business users.

In simple words, it helps organizations derive data insights quickly, and with less human intervention.

The manifesto of DataOps is similar to that of DevOps:

Analytics as a code
Disposable data environments
Reduction of data cycle (From ingestion to production)
Continuous data quality and performance
Data pipeline abstraction
Automated data orchestration
Collaboration among teams
Reusability
Simple and self-servicing
Embraces new changes

DataOps’ features:

Agile data infrastructure: The ability to provision consistent and secure data environments – self provisioning, automated administration, self-healing cluster management and a centralized dashboard.
Automated data pipeline deployment for data engineering: Version-controlled data pipeline repository, automated aggregation and the processing of various data sets.
Machine Learning model-building: An iterative process to build a model using multiple languages and frameworks.
Model deployment workflows: Deploying models/ application to containers through the Continuous Integration/Continuous Delivery (CI/CD) pipeline.
Data monitoring & management: Helps in data pipeline monitoring, identifying hotspots in data architecture, logging, compliance with SLA of data availability and accuracy.

Role of DataOps:

The typical components of end-to-end Machine Learning & Data Science projects have been summarized below:

It is imperative for enterprises to apply the DataOps manifesto in all data activities (data integration, data quality, data engineering, data security) to simplify the process of end-to-end Machine Learning & Data Science

DataOps is slowly becoming a critical discipline for any organization to survive in an evolving digital world, where dealing with real-time business intelligence is a competitive necessity. There are a few reasons for this surge:

Data is not static – With DataOps, we create data infrastructure in an automated manner, build models quickly and extract value from data with enhanced collaboration.
Technology is evolving at a fast pace – ML, DS, analytics tools and platforms are changing rapidly – there is a need to have COE approach to handle the advent of technology.
Agility is needed to be successful in a real time world – DataOps helps in agility at multiple levels – Data, business and organization agility.

Mindtree offers solutions to many challenges by implementing DataOps at enterprises:

Lack of alignment across organizational silos – Product IT operating model
Lack of a foundational platform – CAPE, MWatch
Lack of standardization – InnoApp for containerization
Mindtree Data Pumpkin for data platform modernization

A few examples of successful DataOps implementation by Mindtree:

Marketing analytics platform for an American rental company: The solution consisted of building Data Lake, AWS Data Pipeline, on-demand EMR cluster provisioning, automated data ingestion & processing using Spark, R & Tableau for visualization/ analytics. This solution helped in building customer 360 view, improving customer service and branding
High Performance predictive analytics for a CPG major: A Cloud-based analytic platform helps in leveraging the on-demand scalability and massive parallel processing capability. The solution involved Big Data, Amazon Redshift and Spark technologies for Machine Learning. This helped in reducing the lead time for data processing & analysis to insight generation from 21 days to less than 10 hours. The solution processes 4-6 TB of data within hours, thus improving the accuracy of the prediction.
Connected traveler experience for an international airline: The solution involved the process of building unique customer profiles (~65 million) from hundreds of millions of customer data from multiple touchpoints. The technology footprint involved Cloudera distribution Hadoop, Micro-services and Lambda architecture. This is now the central system source of insights for the airline to drive multiple actions during promotions, customer service, service recovery, personalization service etc.

Original article: https://www.mindtree.com/insights/blog/dataops-secret-machine-learning-data-science-success-enterprise

DataOps – Secret Of Machine Learning & Data Science Success In An Enterprise

The common areas of concern in Machine Learning & Data Science (ML & DS) initiatives:

Converting data into valuable insights is always a cumbersome process, due to the multiple points of handoffs