In a new environment, decision making systems have to choose from multiple available choices, to optimize for a goal.
This delima exists in our real life as well, e.g. a new office, new neighbourhood, new friends, new date, new product, etc.
In absence of the complete knowledge of the new environment, this processs is quite difficult. We as grown up adults derive from our past experiences and go by our gut. But a computerized system born just now can't leverage experience.
Just like a new born, the solution would be to Explore the surrounding, touch everything, put everthing piece of thing into our mouth. Gradually gaining experience on things which we shouldn't be touching (hot mugs), or things we should be putting more into our mouth (sweet candies), this phase is called Exploitation where we leverage what we learned about the most rewarding experiences.
As we learn/grow, we need to maintain a fine balance between
- How much we should be Exploring new things
- If we explore too much, we will be wasting opportunities on random things we could have avoided
- How much we should be Exploiting gained experience
- If we exploit too much, then we will be missing out of unknown rewarding experiences in the environment.
How to balance Explore vs Exploit?
There are couple of very simple strategies. Which can be framed properly in a Multi-armed bandit framework.