*Tapping into the power of Q-Learning: from theory to practical applications.*

Reinforcement learning deals with a unique problem setup where an arbitrary agent tries to learn the optimal way of interacting with an environment. In return for its actions, it receives delayed labels, also known as rewards; the agent’s ultimate goal is to find an optimal policy that maximizes cumulative numerical return.

Innovative companies have already implemented RL-based technologies to configure, in an optimal way, multi-tier web networks, build robust recommendation algorithms, engineer sophisticated intrusion detection schemes for IoT networks, and so on.

In this article, we will delve a bit deeper into sequential decision-making, discuss, in general terms, the promise and shortcomings of Reinforcement Learning, and explain the principles of one of the most popular RL strategies – Q-learning.

## All about reinforcement learning

Mathematically, Reinforcement Learning problems are formulated as Markov Decision Processes (MDPs); they’re defined by:

- set of states
`(S)`

, set of actions `(A)`

;
- distribution of rewards
`(R)`

– a function that maps a state-action pair to a reward;
- transition probability distribution
`(P)`

over the next state (given a state-action pair);
- discount factor
`(γ)`

that defines how important are the rewards we get now vs later in the episode.

In an MDP, everything begins with an environment’s sampling of initial values based on the learning algorithm state from initial state distribution. Then, a so-called reinforcement learning training loop begins:

1) agent plays an action `(a1)`

in a state `(s1)`

,

2) After receiving the state-action pair, the environment sends back a positive reward. `R1(a1,s1)`

3) agent transitions into the agent class next state `(s2)`

.

This continues iteratively until we reach the terminal state and the episode ends.

All agent’s actions are dictated by a policy `π`

(a function that maps the previous state s at time step t to the current position a at time step t). The objective is to find an optimal policy `π*`

that will tell the agent, given the current state s, which action brings the maximum cumulative discounted reward.

So, how is `π*`

calculated?

First, let’s establish how the goodness (optimality) of states and actions is found in RL.

Each policy produces a particular trajectory of states, with expected rewards, actions, and rewards.

We use a **Value function (V)** to measure how good a **specific state** is, in terms of maximum expected future reward and cumulative reward, for an agent following a certain policy. The optimal value function `(V*)`

, therefore, gives us the maximum achievable value (return) for each state in a shared state space (set of all possible states).

A **Q-value function (Q)** shows us how good a** specific action** is for an agent following a policy, given a more particular state, movement, and state alone. The optimal Q-value function `(Q*)`

gives us the maximum return achievable by any approach from a given state-action pair.

One of the critical properties of `Q*`

is that it must satisfy Bellman Optimality Equation, according to which the optimal Q-value for a given state-action pair equals the maximum reward the agent can get from an action in the current state + the maximum discounted reward value it can obtain from any possible state-action pair that follows. The equation looks like this:

`Q* (s, a) = E [Rt+1 + γ max a′ Q*(s′,a′)]`

Since Bellman Equation helps calculate `Q*`

at each time step, it gives us a way to determine the optimal policy; we know the maximum q value of the following state-action pair, so the optimal strategy is to play actions that maximize the expected value of the q table.

`R + γ Q*(s',a')`

`R`

is the immediate reward
`Q*(s', a')`

– is the optimal Q-Value for the state we’ll end up in
`γ`

– is the discounting factor

The optimal policy `π*`

, as we can infer from this, is to take the best action – as defined by the maximum value, by `Q*`

– at each time step.

Respectively, reinforcement Learning provides a robust framework for decision-making under uncertainty. The process involves:

*An agent interacting with an environment.*
*Learning from rewards.*
*Updating its policy to maximize future rewards.*

The ultimate goal is to determine an optimal policy that guides the agent to take actions leading to the highest cumulative discounted reward, showcasing the power and potential of reinforcement learning in various real-world applications.

## Q-Learning

Now, let’s discuss Q-learning, which is the process of iteratively updating Q-Values for each state-action pair using the Bellman Equation until the Q-function eventually converges to `Q*`

In the simplest form of Q-learning, the Q-function is implemented as a table of states and actions (Q-values for each s, a pair are stored there), and we use the Value Iteration algorithm to update the values as the agent accumulates knowledge directly.

Since all the values in the table are typically initialized to zero at the beginning, the first random action the agent takes is random. Upon playing it, however, the agent observes the associated negative reward and state of transition and reward functions, and based on this data, we can compute the Q-value and update the table.

We also need to balance somehow the agent’s exploration and exploitation efforts with expected future reward – it can’t just opt for an optimal route for an immediate reward every time, and, at some probability, it must take random actions along the wrong path that might bring a greater return in the long run.

One way of doing this is by applying the epsilon (ε) greedy strategy.

This is how it’s executed: Initially, `ε`

is set up to be 1, and so we’re confident the agent will start off by exploring the surrounding environment, i.e., acting randomly.

At the start of the next episode, however, εwill decay by some rate (which we must determine beforehand). As the agent learns more, it will become progressively more greedy – `ε`

will keep dropping in the learning rate, so the agent will lean more and more towards exploiting its existing knowledge rather than supervised learning more than investigating the environment.

## Deep Q-Learning

This iterative approach to update the q-values and directly the Q-values in the table might be sufficient for tasks with a relatively small state space. Still, the method’s performance will decrease drastically if we tackle more complex environments. The sheer amount of computational resources and time we’ll need to traverse the new states and modify Q-values will render the task computationally inefficient and infeasible.

### So, what can we do?

Instead of computing Q-values directly through value iterations, we can use a function approximator to estimate the optimal Q-function update. Currently, the method of choice to do this is through applying neural networks.

Integrating the artificial intelligence and neural nets into the Q-learning process is called Deep Q-Learning, which we call a network that uses NNs to approximate Q-functions – a Deep Q-Network (or DQN).

### Here’s how Deep Q-Learning works:

A neural network receives states from an environment as an input and then produces estimated Q-values for each action an agent can choose in those states.

We already know that `Q*`

must satisfy the Bellman Equation `Q*(s, a) = E [R t + 1 + γmax a′q∗(s′, a′)`

, so we’ll take the target values from the right-hand part of it. Then, we’ll compare the model’s outputs to the target values and calculate the loss.

Next, using a backdrop algorithm and stochastic gradient descent, the NN will update its values to minimize the error.

When dealing with a small state space, we’re better off using standard Q-learning rather than neural nets, as conventional value iteration will likely converge to optimal values faster.

## Some applications of the Q-Learning algorithm

In Reinforcement Learning (RL), Q-learning has found practical applications in various fields. This section explores how Q-learning algorithms have been effectively utilized in online web systems auto-configuration, news recommendations, and network traffic signal control.

### Online Web Systems Auto-configuration

An RL-based approach can be implemented for the automatic configuration of multi-tier web systems; the model can learn to adapt performance parameter settings efficiently and dynamically to both workload changes and modifications of virtual machines. According to the experiments reported here, such systems can determine the optimal or near-optimal configuration after about 25 iterations.

Standard Q-learning can be applied in this case. The issue can be formulated as a finite MDP: the network’s configuration will represent the state space, actions (“increase,” “decrease,” “keep”) for each parameter will be the action space, and the full reward function will be defined as the difference in target response time and measured response time.

### News recommendations

An innovative approach based on deep reinforcement and machine learning framework has been applied to tackle the shortcomings of standard online recommendation systems, such as boring the user by suggesting the same or very similar materials, not addressing the dynamic news environment, failing to incorporate essential forms of user feedback (return patterns) when generating recommendations, etc.

A DQN can model future rewards explicitly, as opposed to traditional models that rely solely on current rewards (Click-Through-Rates). A sophisticated built-in exploration strategy can also be utilized, allowing the system to venture into new territories and find atypical items that might excite the user.

Here’s the gist of how the system works. The researchers built four categories for features to put into a deep neural net. User and context features were, in this case, state features, while user news and news features were the action features. A DQN could calculate Q-values from these inputs and then send them to the engine that generated recommendations. A user’s clicking on a news item was a part of the reward system.

### Network traffic signal control

An RL-based framework was used to determine an optimal network control policy that reduced average delays and probabilities of intersectional cross-blocking and congestion. In a simulated traffic environment, an RL agent tasked with controlling traffic signals was put at the central intersection. The state space was represented by a feature vector (of 8 dimensions) whose elements each represented a relative traffic flow; 8 non-conflicting phase combinations (for each isolated intersection) were chosen as an action set, and the expected reward was the reduction in delay, compared to previous steps.

Though only tested in a simulated environment, the DQN-based approach proved more effective than traditional ones. The applications of Q-learning in these diverse areas demonstrate its versatility and effectiveness in solving complex problems. By leveraging the power of RL and Q-learning, businesses can optimize their operations, improve user experiences, and tackle challenges in dynamic environments.

## Conclusion

RL is concerned with a foundational issue of how an agent can learn to make optimal sequences of decisions under uncertainty. Various techniques help the agent determine the degree of goodness (by this, we mean some notion of optimality) of its policies, and currently, one of the most popular of them is a model-free strategy called Q-learning.

When Q-Leaning is used in tandem with Deep Learning (artificial neural nets serve as function approximators of deep reinforcement learning), we call this Deep Q-Learning. Though nascent, the method has already helped businesses across industries achieve meaningful breakthroughs in various tasks.

Wondering whether implementing Reinforcement Learning to real-world applications can benefit your company too? Contact our experts for a free consultation right now!