Skip links

Data Science Project Lifecycle

Data Science is the art of combining data, science and technology to solve a business problem. There is a general misconception that Data Science is all about applying cool statistical/machine learning algorithms.  However, Data Science involves several other critical steps before and after the use of statistical/machine learning algorithms. In this post, we will take a brief look at the life cycle of a data science project.

data science

Data Extraction and Processing:

The first step in most Data Science projects starts with data extraction. This could be simple or complex depending on the complexity of the data sources as well as the data maturity in the organizations. The data in various formats needs to be extracted, cleansed and stored in a format that can be used for further analysis.

Data extraction and processing is done through a combination of R Programming/Python, Databases and Big Data tools depending on the type and size of data.

Exploratory Data Analysis:

Once the data is ready, it is time to explore the data and understand the patterns and pitfalls. This is usually done through the use of visualizations and basic statistics. Once the data is understood thoroughly appropriate treatments can be applied.

Visualizations are usually done using tools such as Tableau or using visualization packages in R Programming/Python.

Feature Engineering:

Feature engineering involves applying appropriate transformations on data to enhance it and make it fit for applying Statistical and Machine Learning algorithms.

“The first three steps of Data Extraction and Processing, Exploratory Data Analysis and Feature Engineering typically takes about 60-70% of the time spent on a Data Science project”

This step involves a bit of custom coding in Python/R depending on the data. There are packages available in R/Python to facilitate Feature Engineering.

Model Building:

Once the data is prepared, appropriate statistical and machine learning models are applied according to the problem at hand. These models are usually predictive or prescriptive models. (We will have a separate post on different kinds of data science and analytics models soon!!!). Model building requires a lot of experimentation to select the appropriate ones.

Python/R has several packages out of the box that makes model building fairly easy. However, fine-tuning a model requires knowledge of algorithms and the domain and is time-consuming.

Model Validation:

Once the model is selected, it is validated with both online and offline data to ensure the model performs as expected on all segments of data.

Model Deployment:

Once the model is developed, the model needs to be deployed in the production for use. This step is undervalued and usually an afterthought. However, this step could make or break a data science project. Considering this aspect at an early stage of model development could save months of wrong effort.

Model Performance Monitoring:

The statistical/machine learning models once developed are stable only for a period of time and would start to underperform as the data changes. Hence, a monitoring mechanism needs to be set up in place to track the model performance over a period of time. Once the model performance drops below a threshold, models need to be retrained with the most recent data.

Original article