
Machine learning (ML) and Data Science (DS) projects are hard to manage. Because projects are research-like in nature, it’s difficult to predict how long it will take for them to finish. They often start off with one idea and then pivot into a new direction when the proposed technique doesn’t work or the assumptions made about the data are proven wrong.
Model-building is also an inherently long process (compared to most Software Engineering and Analytics work) and it’s not uncommon for a data scientist to enter a rabbit hole and spend months on a project without a clear notion of progress. Another distinction from standard software engineering practices is that model development is usually done by a single person. This serial nature does not lend itself well to traditional collaborative SE workflows such as Kanban and Scrum.
I’ve spent quite some time researching existing workflows (primarily in Jira) to manage DS and ML projects, but without luck. Much of the information out there targets software engineering and focuses on Agile methodologies. I also talked to coworkers and friends in the field but I wasn’t able to find anything that was tailored for machine learning and data science. I noticed that some people try to adapt their workflow to standard engineering practices or, in other cases, don’t try to manage those projects at all. The latter is particularly problematic as it can lead to projects that take too long, have too ambitious a scope, and are more likely to fail.
Since I couldn’t find a good solution, I decided to build my own custom workflow for managing Machine Learning and Data Science projects. This system can be implemented in Jira and allows me to easily monitor and report on the status of projects. It also helps me limit their scope, avoiding overly complex models that take too long to build. Our scientists are provided with a structure that helps them think about how models should be built, which increases their changes of succeeding at a project. I’ve been using this system for a few years now and my group and I are pretty happy with it.
Machine learning projects have well-defined phases
Whether you’re building a complex computer vision algorithm using deep learning, a learning-to-rank model with LightGBM, or even a simple linear regression, the process of building an ML model has well-defined phases. Below is how we break the model-building process into sequential phases, from the initial research all the way to the analysis of the A/B testing results. Notice that each phase has a deliverable (or milestone) which also works as a touch point to sync up with the team or the stakeholders.
1) Research
This is the initial phase of a project. It comprises talking to stakeholders to understand the project goals and expectations, talking to business analysts to find out what data is available and where to get it, creating some initial queries and investigating the data in order to develop a better intuitive understanding of the problem.
It is also in this phase that the scientist will read the literature and decide on a methodology to address the problem. This includes reading scientific papers and brainstorming ideas with their peers. Sometimes deciding on the methodology will also require learning about existing packages and building some simple prototypes using a Jupyter notebook.
Deliverable: The output of this phase is a detailed plan for the execution of the project with a breakdown of the subsequent phases (i.e, data exploration, modeling, productization, and result analysis) and an associated estimated level of effort (in number of weeks). The methodology and the data to be used must also be specified.
This plan will be shared with the stakeholders for feedback.
2) Data Exploration
This is the traditional phase of exploring the data using Pandas and a Jupiter notebook (or sometimes Tableau) in order to gain insights into the data. Typical analyses include counting the number of rows in the data, creating histograms for different feature aggregations, graphs for trends over time, and multiple distribution plots. Scientists will also build queries that will be the core of their model ETL.
Deliverable: A detailed data exploration report as a Jupiter notebook with graphs and comments providing insights into the data. This report will be shared with rest of the group and the project stakeholders.
3) Modeling
This is the meat of the project. Here scientists will start building their models using our internal framework. This includes building an ETL, performing feature engineering, and training models. It also includes building baseline models and providing an extensive evaluation of the final solution.
Deliverables: The outputs of this phase are:
- A model prototype
- A report in Jupyter notebook with an extensive evaluation of the model
The final report will be shared with the group and the project stakeholders.
4) Productization
This phase is about implementing the final version code. Some common tasks include adding comments to all functions and making sure the code is formatted properly according to Python standards and the standards of the group. The code is instrumented with reporting metrics such as the number of rows pulled, the number of rows in the output, the prediction error according to several metrics, and the feature importances when applicable. Finally the code is reviewed by one data scientist and one engineer.
Sometimes the productization process will lead to a back and forth interaction with the platform engineers. This is particularly expected for real time models where runtime performance is critical. It’s also possible that the memory requirements for the code are too aggressive, leading to problems down the production pipeline. The engineers might push back and require a reduction in the memory footprint for training the model.
Deliverable: The output of this phase is a committed code to the master branch that is ready for deployment by the platform engineering team.
5) A/B Testing
Most models will undergo an A/B testing phase. Here, scientists and stakeholders decide on the details of the test: how long it will run for, with what percentage of traffic, what is the control, how to interpret the results, etc. While the test is running, team members will mostly focus on other projects but they will need to monitor the test.
6) Results Analysis
Every scientist is responsible for a detailed analysis of their own model results. Here they’ll analyze the results metrics in many different ways to understand what’s really going on. In particular, when the test is unsuccessful we’ll need to do a deep dive into the results to figure out what went wrong.
Deliverables:
- A detailed report of the results in a Jupyter notebook.
- A hypothesis as to why things didn’t go as expected (when applicable)
The final report will be shared with the group and the stakeholders of the project.
Working with Jira
While this framework might look great in theory, the reality is that the above phases are rarely purely sequential. For example, it’s very common to jump back and forth from data exploration to modeling and then back to data exploration. Also, this process doesn’t fit into an existing Jira framework, so how can you implement this in practice?
It’s actually pretty easy. We use a Jira Kanban board and swimlanes (one per team member) with a few custom fields and changes. The guidelines below define the essence of our process:
- A new Epic ticket is created for each project and the work is split into Tasks.
- Every Task is tagged with a Phase, a custom field in Jira for selecting one out of the 6 phases listed above. (Notice that a phase can have multiple Tasks.)
- Tasks cannot be longer than 1 week (5 days). This forces team members to break down their work into smaller (but still pretty sizable) chunks, allowing for progress monitoring with minor overhead.
- There can only be one ticket in progress at any time. This ensures that we always know in what state the project is.
- Phases are not always sequential and it’s OK to move back and forth between Phases as new Tasks are created.
Conclusion
Managing ML and DS projects doesn’t have to be complicated. At first, I spent about 30 min a day monitoring this process, but once the team got used to it, my time reduced to 15 minutes a week! I know in what state each project is at any point in time, how long a project has taken, and I can quickly identify issues so I can jump in and help my team when needed. My data scientists have a clear framework for thinking about how models are built and they have become much more efficient at it.
I hope you can find this useful as much as it’s been for me.
Original post: https://towardsdatascience.com/how-to-manage-machine-learning-and-data-science-projects-eecacfc8a7f1
Hey! This is my first visit to your blog! We are a team of volunteers and starting a new project in a
community in the same niche. Your blog provided us beneficial information to work on. You have done a wonderful job!
Heya i’m for the first time here. I came across this board and I find
It really useful & it helped me out a lot. I hope to give something
back and help others like you helped me.
I am the co-founder of JustCBD Store label (justcbdstore.com) and I am currently aiming to grow my wholesale side of business. I really hope that anybody at targetdomain give me some advice ! I thought that the most ideal way to accomplish this would be to connect to vape companies and cbd retail stores. I was hoping if anyone could suggest a trusted web site where I can purchase CBD Shops B2B Database I am presently reviewing creativebeartech.com, theeliquidboutique.co.uk and wowitloveithaveit.com. Not sure which one would be the best selection and would appreciate any advice on this. Or would it be much simpler for me to scrape my own leads? Ideas?
I am the proprietor of JustCBD label (justcbdstore.com) and I’m presently trying to grow my wholesale side of business. It would be great if someone at targetdomain share some guidance . I considered that the very best way to accomplish this would be to talk to vape shops and cbd retail stores. I was really hoping if anybody could recommend a trusted website where I can get CBD Shops Business Leads I am currently considering creativebeartech.com, theeliquidboutique.co.uk and wowitloveithaveit.com. On the fence which one would be the best choice and would appreciate any support on this. Or would it be easier for me to scrape my own leads? Suggestions?
The very next time I read a blog, I hope that it does not disappoint me as much as this particular one. I mean, Yes, it was my choice to read through, nonetheless I really believed you would have something interesting to talk about. All I hear is a bunch of complaining about something that you could fix if you were not too busy seeking attention.
bookmarked!!, I love your website!
Having read this I thought it was rather enlightening. I appreciate you taking the time and effort to put this content together. I once again find myself spending way too much time both reading and posting comments. But so what, it was still worthwhile!
Good post. I learn something new and challenging on sites I stumbleupon on a daily basis. It will always be interesting to read content from other authors and use something from their web sites.
This site certainly has all of the information and facts I wanted concerning this subject and didn’t know who to ask.
Greetings! Very helpful advice within this post! It is the little changes that make the most significant changes. Thanks a lot for sharing!
This is a topic which is close to my heart… Cheers! Where are your contact details though?
Your style is really unique compared to other folks I’ve read stuff from. I appreciate you for posting when you have the opportunity, Guess I’ll just bookmark this blog.
I blog frequently and I genuinely appreciate your content. This great article has really peaked my interest. I will take a note of your website and keep checking for new details about once a week. I opted in for your Feed too.
This excellent website certainly has all the information I wanted about this subject and didn’t know who to ask.
Greetings! Very helpful advice in this particular post! It is the little changes that will make the most important changes. Many thanks for sharing!
Hi there! This article could not be written any better! Going through this article reminds me of my previous roommate! He continually kept preaching about this. I most certainly will send this article to him. Fairly certain he’s going to have a very good read. Thanks for sharing!
Everything is very open with a precise explanation of the issues. It was definitely informative. Your site is extremely helpful. Thanks for sharing!
Achieving your fitness goals doesn’t need a certified personal trainer or an expensive gym membership, especially when you have the budget and the space to consider practically every workout machine on the market.
Greetings! Very helpful advice in this particular post! It is the little changes that produce the biggest changes. Thanks a lot for sharing!
The next time I read a blog, I hope that it won’t disappoint me as much as this particular one. After all, Yes, it was my choice to read through, however I genuinely believed you would have something helpful to talk about. All I hear is a bunch of complaining about something you could possibly fix if you weren’t too busy seeking attention.
Hi, I do believe this is an excellent blog. I stumbledupon it 😉 I will return once again since I book marked it. Money and freedom is the best way to change, may you be rich and continue to help other people.
Very good article. I definitely appreciate this site. Keep it up!
You are so awesome! I don’t believe I’ve truly read a single thing like that before. So nice to find somebody with a few unique thoughts on this subject. Really.. thanks for starting this up. This web site is one thing that’s needed on the web, someone with some originality!
I’m very pleased to discover this site. I want to to thank you for your time just for this wonderful read!! I definitely appreciated every bit of it and i also have you book marked to see new things in your web site.
You need to be a part of a contest for one of the most useful sites on the internet. I’m going to recommend this web site!
I seriously love your blog.. Great colors & theme. Did you make this web site yourself? Please reply back as I’m hoping to create my own blog and would love to know where you got this from or what the theme is called. Appreciate it!
Hello! I just want to give you a huge thumbs up for the great information you’ve got here on this post. I’ll be returning to your website for more soon.
Excellent write-up. I absolutely love this website. Continue the good work!
Excellent blog you have got here.. It’s hard to find good quality writing like yours nowadays. I truly appreciate individuals like you! Take care!!
Saved as a favorite, I love your site!
Oh my goodness! Impressive article dude! Thank you so much, However I am experiencing problems with your RSS. I don’t understand why I am unable to subscribe to it. Is there anyone else having identical RSS issues? Anyone that knows the answer will you kindly respond? Thanx!!
This blog was… how do I say it? Relevant!! Finally I have found something that helped me. Thanks a lot!
This is a topic that is close to my heart… Thank you! Exactly where are your contact details though?
Howdy, There’s no doubt that your website may be having browser compatibility issues. When I take a look at your blog in Safari, it looks fine however when opening in Internet Explorer, it’s got some overlapping issues. I merely wanted to provide you with a quick heads up! Other than that, great website!