Build machine learning prediction models faster learning from your failures
Thomas Edison once said, “I have not failed, I’ve just found 10,000 ways that won’t work”. Considering this advice is coming from one of the greatest innovators of all time, it seems wise to follow. While the idea of failure has long been discouraged by society, in data science, failure takes an important role in the development of machine learning and artificial intelligence. The
1. Collaborate and listen
Yes, it's a bit taboo, but it is known that data scientists sift through many failing models before they find a winner that transforms the face of artificial intelligence. Most data scientists produce thousands of models before they find a winner, meaning thousands fail. But, how can data science teams learn without understanding each
2. No humility with reproducibility
Aside from the idea that failure is a scary word, most often failure is not discussed because there is no means to understanding failures. While data science requires failure as part of the process, it is often the most difficult to do in practice. It is generally not supported in the workflow, and can often be unreasonably time consuming and complicated to support such a structure. First, data scientists use endless amounts of tools and languages to produce data science. Data scientists don’t follow one particular flow, there are many different methods and often data scientists need to come up with creative solutions to reach results. This makes it extremely difficult to replicate without tracking and storing models and meta-data, including every step such as parameters, code version, metrics and more. Some companies are lucky enough to support
3. Use resources wisely
Often, data scientists will spend a lot of time and resources such as computing power for models that never had a chance to begin with. Unfortunately the symptoms for a losing model are not always visible at the forefront. There might be a brilliant algorithm that provides outstanding results, though the computing power is too high, or it takes way too much time to deploy. This is important when selecting the best possible model for production.
By tracking every possible model combination, and setting micro-parameters, data scientists can quickly sift out the bad models and embrace the good.
Recording all the steps taken to reach a particular result is arguably the most important aspect of an efficient data science workflow. It is especially important to record those projects that did not lead anywhere. Otherwise, time will be lost in future exploring avenues that have already proved to be dead ends.