It requires more than skill to successfully complete a data science project—especially to complete one efficiently. To ensure your data science projects are executed in a manner that delivers the results you’re looking for, it’s necessary to have a plan in place. In this article, we’ll discuss some of the different approaches that can be used to conduct data science efficiently, as well as the type of work those approaches are best suited for.
CRISP-DM: The default, time-tested methodology for data science
The Cross Industry Standard Process for Data Mining is an industry-standard methodology used by data scientists everywhere. First developed in 1997 on an EU tender, CRISP-DM is freely available, and the most popular data science methodology in the world.
CRISP-DM is built on an analytics-focused process framework called KDD (Knowledge Discovery in Databases) and is designed to accommodate the business concerns that underlie enterprise data science projects. As an iterative framework that caters to data science teams, CRISP-DM is flexible enough to support most enterprise data science projects. IBM describes its customisable nature thusly:
“if your organisation aims to detect money laundering, it is likely that you will sift through large amounts of data without a specific modeling goal. Instead of modeling, your work will focus on data exploration and visualisation to uncover suspicious patterns in financial data. CRISP-DM allows you to create a data mining model that fits your particular needs.”
As the most popular process methodology in data science, CRISP-DM is an ideal choice for individuals who want to use an approach that is widely supported in the data science community.
Customising a methodology to your unique needs
CRISP-DM provides an effective “baseline” approach for analytics; however, for data scientists who do not want to spend time and effort on customising CRISP-DM to their particular needs, a task-specific methodology may be more useful.
While CRISP-DM is built to be customisable, it lacks any sort of standardised framework to guide the customisation process, making it potentially inefficient and error-prone. This is a particular concern at organisations whose capabilities are not mature enough to use CRISP-DM effectively, as it lacks a maturity framework to guide such organisations forward.
If CRISP-DM doesn’t suit your precise needs, you can look to see if any derivative models exist, such as this custom IoT model. Teams that want process guidance on deploying and operating commercial data science products may also benefit from using ASUM-DM—IBM’s 2015 update to CRISP-DM. ASUM-DM is also suitable for use in environments that use an agile approach to project management.
What about agile methods?
Many workplaces use agile approaches to project management; however, these don’t play particularly well with the waterfall-esque process prescribed by CRISP-DM. Agile principles can help data science projects be more efficient and ensure they remain aligned with their initial goal, which means there’s a lot of value to incorporating them into your workflow.
Agile methods were designed for software developers, and are not perfectly compatible with data science—an experimental discipline which deals with ill-defined problems. Many agile time management principles, such as planning to achieve a defined outcome within a given time period, are simply not suitable for data science work. Agile data science methods accommodate these issues by placing extra emphasis on agile’s iterative plan-work-refine-plan work pattern.
Emphasising plan-work-refine-plan means emphasising the importance of feasibility at each stage of the project. Instead of focusing on immediately creating the best analysis possible, the agile data scientist first focuses on creating a model that simply works (i.e, plan-work), and then decides whether to invest time into improving its performance during the next planning session (i.e., refine-plan).
CRISP-DM has been critiqued as offering poor decision-management support, so the emphasis that agile methods place on regular planning sessions can be considered one of their relative strengths. ASUM-DM and AgileKDD are designed to combine CRISP-DM with some agile principles, and there are many resources available which describe how to blend the two.
OSEMN: A process for independent research
Not all work done by data scientists has a distinct business purpose, and much of it isn’t done by teams. Even corporate work that has a high-level business purpose may at times take the form of individual exploratory research into a particular data collection. In these cases, an approach that focuses on facilitating exploratory analytics is idea.
There are many research-focused data science workflows available; however, one of the more popular options is the OSEMN model. OSEMN stands for Obtain, Scrub, Explore, Model, iNterpret, which describes the analytics workflow used in the approach. Unlike CRISP-DM and other models, OSEMN is not iterative, and moreover, it omits business-focused project phases such as business understanding, deployment, operations, and optimisation. This makes it ideal for use in exploratory research projects that prioritise data analysis.
The above methodologies are process-focused, which means they work regardless of the technology they’re used with; however, because the technology we use for data science affects how we perform it, it can be useful to consider approaches that are specifically designed for use with modern tools.
Microsoft’s Team Data Science Process (TDSP), introduced in 2016, was designed for the needs of modern data science teams. To that end, the TDSP process incorporates specific tools that Microsoft’s data scientists created to make exploratory data analysis more efficient. These tools are free and can be easily integrated into any workflow that uses Python or R.
TDSP may be ideal for anyone who wants to use a methodology that is purpose-built for use with modern tools. As with the agile methodologies discussed in this article, TDSP is an iterative process and offers a strong decision-management framework for planning which tasks should be performed during each phase of a project’s lifecycle.
An even more tech-specific framework is available for data scientists who use the R programming language. Hadley Wickham and Garrett Groleman’s R for Data Science is a widely-used methodology built around the Tidyverse approach to data science—a set of principles, practices, and free coding tools designed to facilitate an organised workflow.
While native to R, tidyverse principles have influenced how data science work is done in other programming languages as well. A “tidy” approach to data science can also be combined with other methodologies because tidy data science focuses primarily on creating code in a manner that is efficient to write and easy to read. It is very beginner-friendly and can be used by both new and experienced data scientists.
Whether you operate within a corporate or research setting, as part of a team or as an individual, it is important to approach your data science projects with a plan of action. The approaches outlined in this article will help you be more efficient, reduce errors, and prevent scope creep. In doing so, they will help you to achieve consistently strong outcomes from your projects.
The online Master of Data Science at James Cook University is designed to teach students how to take an efficient and professional approach to data science. Students at JCU learn to solve problems using industry-standard best practices, and are taught skills—such as strategic decision making—that maximise the value of the methodologies discussed in this article. Taught by leading experts with real-world experience, JCU’s Master of Data Science can provide the skills you need to take your career to the next level.
To find out more about how JCU’s Master of Data Science program can empower you to successfully solve complex data science challenges, contact our enrolment team on 1300 535 919.