5 Things to Consider When Operationalizing Your Machine Learning
Operationalizing machine learning models requires a different process than creating those models. To be successful at this transition, you need to consider five critical areas.
As machine learning teams start, most of their work is done in a laboratory mode. This means that they work through the process in a very manual yet scientific manner. They iteratively develop valuable machine learning models by forming a hypothesis, testing the model to confirm this hypothesis, and adjusting to improve model behavior. As these projects mature and evolve, it often becomes important to take them out of experimentation mode and operationalize them.
Operationalizing machine learning requires a shift in mindset and a different set of skills for those performing the work. The inquisitive state of what-ifs and trial and error give way to practices that are predictable and stable. The goal is to reproduce the same valuable results that were generated as part of the creation process but do it in a way that is more hands off and long running. This changes the team’s goals from experimentation to experience management.
To effectively operationalize your machine learning model, consider these five key areas: data collection, error management, consumption, security, and model management.
During the experimentation phase, much of the data collection and cleansing is done manually. A training and testing data set is pulled from the source — that source could be a data lake, a data warehouse, or an operational system — and is often hand curated. The merging, matching, deduping, and overall data wrangling is generally done one step at a time. This is mainly because the data scientists are not sure what will persist (and what won’t) in the data set. This data management process can span from work done in programming languages such as Python and R to work performed using a spreadsheet or a text editor.
With an operational model, the uncertainty of what data is valuable is removed and all the data wrangling done during the build phase now needs to be automated and productionalized. This means that the scripts used during the development phase need to be standardized into something that can be supported in a production environment. This can mean rewriting scripts into a supported language, automating the steps performed in a spreadsheet using scripting or an ETL tool, or ensuring that all the data sources used are being updated regularly and are accessible as part of the data collection process.
When data scientists are working through the process one step at a time, they manage the errors that arise. From dirty data to data access issues, if data scientists run into a problem, they interact with the people and systems that can resolve it. With these unanticipated challenges, the most effective path forward is to address them one at a time as they arise.
This is not the case once the models have been promoted to a production environment. As these models become integrated with an overall data pipeline, downstream process come to rely on their output and errors have a higher risk of business disruption. As many of these potential errors as possible need to be anticipated during the pre-operation design and development stage and automated mechanisms need to be designed and developed to address them.