The two most widely considered software development models in modern project management are, without any doubt, the Waterfall Methodology and the Agile Methodology.
An overview of the Waterfall model
The Waterfall approach is the way to go in “consolidated” areas of engineering design. In these fields you can assume that progress flows in one direction. In layman terms, once you make up your mind there are no second thoughts. From here the name waterfall.
Software development purists look at the Waterfall methodology as the model to look at for highly structured projects e.g. Operating System design, real-time codecs, scientific software or software for critical environments.
However, this approach can be deleterious for AI and machine learning projects. Its adoption could lead lead to long development cycles and project failures.
Agile development methodology is much more suited to machine learning projects. The table below summarises some differences between the two methodologies. It also emphasises the major reasons why Agile is probably the best development method for data science projects.
A comparison between waterfall and agile in machine learning
|Requirements clear from the beginning.||Within a well-defined scope, requirements evolve during the course of the project.|
|Fixed plan||Adaptable plan to suits needs and feedbacks|
|Inflexible to changes||Only a few general milestones, describing where the project should head|
|Deliver product as planned||Create minimal viable product (MVP) as fast as possible, so that users can provide feedback|
|Project is divided into successive phases, which are not revisited once completed||Project is divided in 2 weeks iterative sprints. It’s possible to go back and forth between sprints, until completion|
|Tests are done at the end||Tests are performed throughout the project|
|Participation of users not required||Users participate throughout the development phase|
|Precise cost and time estimation||Difficult to estimate the number of sprints needed to achieve requirements|
|Development of algorithms takes place after the gathering of requirements||Immediate development of algorithms|
|Simple to give updates to the management and business teams, due to detailed planning and accurate budget estimations||Hard to update all parties especially when they are not deeply involved|
|Does not work well for the data discovery process due to the cyclical nature of the latter||Suitable to data science projects, which comprise multiple iterations of understanding a business problem by asking questions, data acquisition from multiple sources, data cleaning, feature engineering and modelling|
|Fail slow: only towards the end of the project one knows whether the project reached its key goals||Fail fast, e.g. if model performance is 70% and the minimum valuable performance is 90% the project can be stopped earlier since it is unlikely that the goal will be reached any time soon|
|Model deployment happens only at the end of the project||Model deployment occurs as soon as an acceptable enough model is ready|