Bayesian Optimization in Machine Learning

Recently I’ve discussed Bayesian Optimization with my friends and it’s application in Machine learning. Just document them here for future references.

In MLOps, hyper-parameter tuning comes to play when we’ve figured out what ML method we need to use to answer a business question, what data we have, and we’ve started training the model. The performance of the model is highly depended on the hyper-parameters of the model. Therefore, to get a “best” model, we have to spend some time tuning the hyper-parameters. There are three major methods of tuning the hyper-parameters: 1. grid-search, 2. random-search, 3. Bayesian optimization. I will skip some methods which haven’t been not widely used.

Grid-search (or manually selection based on ML engineer’s experience) are most widely used methods for early-stage small size ML teams. They probably just start their ML project, they want to find get the whole ML pipeline framework up and running before fine-tune the model. A full grid-search can be very computational expensive because they have to run every single combination. But most of the teams will increase the interval of the grid to run few iterations, just to get a quick and dirty model to start with.

Random search is one level up. Instead of exam all combinations of all parameters, random search exams random combinations of parameters. Given the same time, random search will increase the possibility of locating the “best” hyper-parameters. So this is widely used in teams with kinda of mature MLOps pipeline. For them, the framework is ready. They are having new data coming and they are focusing on maintaining the ML model and retrain the model with new-coming data.

Bayesian Search, on the other hand, go to another level. It actually learns to find where the best hyper-parameters are with the price of extra computational cost. As you mentioned, Facebook used it in it’s backend. Google’s autoML provides Bayesian op API. AWS sagemaker also provide “hyper-parameter tuning” API based on Bayesian Search. And SigOpt, probably the most famous commercial library on the market (two sigma claims sigopts reduced their tuning time from 24 days to 3 days). As far as I know, every big ML team in high-tech companies probably have someone look into Bayesian optimisation for R&D purpose.

Since we’re thinking about Bayesian opt, let’s talk a bit more.

If Bayesian is so good, then it should have been seen everywhere right now since it has been on the market for a while. However, with my limited experience, I’d said Bayesian op is still far from dominating the market. So why is that? Several reasons. Actually not only Bayesian opt is not widely deployed, almost any auto hyper-parameter tuning has not been widely deployed. The reason is for any large commercial ml model, training one model will take days or months, so it’s almost not possible to run hundreds of times to compare the performance. Also, even though you’ve tested hundreds of combinations of hyper-parameters, the “best” hyper-parameters probably won’t give a significant improvement on the results. Considering there are many ways to improve ML models, eg, more training data, better ML algo, more features, etc, each of them is kinda of low-hanging fruit which requires fewer computational cost than Bayesian opt, so Bayesian opt is a kind of luxury in my view: Only very mature ML team in large high-tech enterprise can afford it.

Bayesian opt naturally has problem with high dimension modelling. As you added more and more features in to your training dataset, the benefit of Bayesian opt will become less and less. The hours of running Bayesian will be longer and longer. Probably hyper-parameter tuning is not valued for enterprise to put so much efforts into. For example, enterprise can market their product as “cutting-edge ML model”, but they can hardly say that “we used the best hyper-parameter tuning technology in our ML model”. It just, not very, attractive…

OK let’s go back to sigopt. Personally I would LOVE to connect to sigopt and see if we can use it to help our client. I’d love to hear their opinions of some cloud-native bayesian opt solutions provided by google/azure/aws, as well as some open source solutions like GPyOpt, Ray.tune: Hyperparameter Optimization Framework, etc.