Machine learning and Artificial Intelligence are becoming hot technologies right now. While most of the available technologies and expertise is centered around supervised and unsupervised techniques, the real AI paradigm as nature present it to us lies int the ability to learn while interacting with the environment.
Traditional supervised learning is unrealistic, as no real word entity would have been in a situation where it would be presented with a set of positive and negative examples. Rather, living organisms learn by interacting and experimenting. By doing so they are not only learning a model to discriminate between various categories but also learning the right policy to get the desired outcome.
The supervised learning is suitable for learning generalizations to new situations but is not suitable for learning form the interactions with an environment. All possible interactions are in the database, labelled with the correct action. The agent has the ability to generalize by inferring an action to an unseen environment state, but it has no mean to correct itself if was wrong.
Reinforcement learning is learning how to map situations to actions that maximize a numerical reward signal. The system is not told which actions to take, but instead must discover which actions yield the most reward by trying them.
This class of policy learning is called Reinforcement learning and it belong to the class of semi-supervised learning methodologies.
In this Tutorial we will explore the fundamental concepts behind these techniques and implement them using python.
Reward, policy, actions what all these terms mean? These as some of the fundamental concepts in Reinforcement Learning and I will explain them in the coming sections
Agents and Environments
An agent can be viewed as an object that is perceiving its environment through sensors and acting upon that environment through actuators. This simple idea is illustrated in the following figure.
An animal agent has eyes, ears, and other organs for sensors and mouth, legs, wings, and so on for actuators. A software agent receives data as sensory inputs and acts on the environment by displaying on the screen, writing files, and sending network packets.
We use the term percept to refer to the agent’s perceptual inputs at any given instant. An agent’s choice of action at any given time depends on what he perceives or has perceived until now, but not on anything it hasn’t perceived. The agent’s behavior is described by a function that maps any given perceived sequence to an action.
The mapping the agent’s response to every possible percept sequence defines the agent function that maps any given percept sequence to an action. This function can be thought of as a very large table—infinite, in fact. It can be, constructed by trying out all possible percept sequences and specifying the agent response to it. In practice this table is implemented by the agent program.
The Elements of Reinforcement Learning
Beyond the agent and the environment, there are four main elements of a reinforcement learning system: a policy, a reward, a value function, and, optionally, a model of the environment.
A policy defines the way the agent behaves in a given time. Roughly speaking, a policy is a mapping from the states of the environment to actions to the actions the agent takes in the environment. The policy can be a simple function or lookup table in the simplest cases, or it may involve complex function computations. The policy is the core of what the agent learns.
A reward defines the goal of a reinforcement learning problem. On each time step, the action of the agent results on a reward. The agent’s final objective is to maximize the total reward it receives. The reward thus distinguishes between the good and bad action results for the agent. In a natural system, we might think of rewards as pleasure and pain experiences.
The reward is the primary way for impacting the policy; if an action selected by the policy results in a low reward, then the policy can be changed to select some other action in the same situation.
The reward signal indicates good actions in an immediate sense: each action results immediately on a reward, a value function defines what is good in the long run.
The value of a state is the total aggregated number of rewards that the agent can expect to get in the future if it starts from that state. Values indicate the long-term desirability of a set of states taking into account the likely future states and the rewards yielded by those states. Even if a state might yield a low immediate reward, it can still have a high value because it is regularly followed by other states that yield higher rewards.
The interplay between rewards and values is often confusing for beginners as one is an aggregation of the other. Rewards are primary and immediate, values, on the other hands are predictions of rewards, they are secondary. Without rewards there are no values, and the only purpose of estimating values is to achieve more reward. Nevertheless, it is values which we consider when making and evaluating decisions. Action choices are ultimately made based on value judgments.
The agent will seek actions that bring states of highest value, not highest reward, these states will lead to actions that earn the greatest amount of reward over the long run.
How then we determine the values and the rewards?
Rewards are directly given by the environment, but values have been constantly estimated from the sequences of observations the agent makes at each interaction. This will make the method(s) of estimating values efficiently the most important component reinforcement learning algorithms.
Another important element of some reinforcement learning systems is the model of the environment. This is something that reproduce the behavior of the environment and allows inferences to be made about how the environment will react. This model will help the agent to predict the next reward id an action is taken and hence base the current action selection based on the future environment reaction
Exploitation vs. exploration
A Reinforcement Learning agent will gradually learn the best (or near-best) policy essentially based on trial and error, through random interactions with the environment and by incorporating the responses of these interactions, in order to improve the overall performance. The agent’s actions serve both as a means to explore (learn better strategies) and a way to exploit (greedily use the best available strategy). Since exploration is costly in terms of resource, time and opportunity, a crucial question in Reinforcement Learning is to address the dichotomy between exploration into uncharted territory and exploitation of existing proven strategies. Specifically, the agent has to balance between greedily exploiting what he learned so far and choose actions that currently yield the highest reward, and continuously explore the environment to acquire more information to potentially achieve a higher value in the long term.
To illustrate these ideas, let’s use a simple example—the vacuum-cleaner world shown in the following Figure.
This world is simple and made-up world, we can describe everything that happens in it and consider several variations. This particular world has 9 locations: squares labelled by coordinates (i, j) where i =1, 2, 3. and j =1, 2, 3. The vacuum agent perceives which square it is in and whether there is dirt in the square. It can choose to move left, right, up or down, suck up the dirt, or do nothing. One very simple agent function is the following: if the current square is dirty, then suck; otherwise, move to the next square.
It’s important to define when a reward is given to the agent and whether it is positive or negative. A naive approach would be to give a positive reward whenever the agent cleans all the squares. However, as the agent explore randomly, the chances of receiving the reward, by cleaning all the squares, is small. To guide the agent towards the desired goal, a better strategy would be to give a small positive reward whenever it cleans a square and a small negative reward if the agent attempts to clean an already cleaned square. And a big positive reward when all squares are cleaned.
Here is how the vacuum cleaner problem would be approached by making use of value functions. First, we would set up a table of numbers, one for each possible state of this small world. Each number will be the latest estimate of the probability of our finishing the cleaning from that state. We treat this estimate as the state’s value, and the whole table is the learned value function. State A has higher value than state B, or is considered “better” than state B, if the current estimate of the probability of our winning from A is higher than it is from B. All the states in which all the squares are clean have a probability of finishing the cleaning of 1, because we have already cleaned all the space.
To select the next move the agent examines the states that would result from each of our possible moves (one for each of the 4 directions and 2 possible functions; sucking up or not) and look up their current values in the table. Most of the time the agent will move greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of finding a dirty square. Occasionally, however, the agent will choose randomly from among the other moves instead. These are called exploratory moves because they cause us to experience states that we might otherwise never see. The value of this exploration becomes apparent if we add to the value function a reward related to how fast the space is cleaned. This would allow the agent to select better space traversal strategy.