On-policy vs Off-policy Monte Carlo Control Methods for Supply Chain Optimization: A Use Case of Inventory Replenishment

19 min readOct 22, 2023

Introduction

Replenishing inventory in a dynamic retail environment presents a complex challenge with far-reaching implications for profitability and customer satisfaction. Leveraging the power of Reinforcement Learning (RL), offers a promising avenue to address this issue. In this endeavor, we draw inspiration and insights from the rich corpus of literature, particularly from Enes Bilgin’s “Mastering Reinforcement Learning with Python,” which provides invaluable code references, which we will use to implement and solve the problem at hand. Additionally, the seminal work of Richard S. Sutton and Andrew G. Barto in “Reinforcement Learning: An Introduction” serves as an indispensable foundation, guiding our understanding of the core principles underpinning RL algorithms.

This study dives into the application of RL to the inventory replenishment problem, focusing on two distinct methodologies: On-Policy and Off-Policy First-Visit Monte Carlo methods.

I- What are Monte Carlo methods ?

Monte Carlo estimation involves making estimations through repeated random sampling, particularly in situations where using alternative methods like analytical formulas or deterministic algorithms is challenging or not feasible. These methods find application across various fields including physics, engineering, statistics, finance, and artificial intelligence.

Examples of Monte Carlo methods include simulating complex systems with multiple variables (e.g., gases, fluids, solids), estimating the value of integrals or functions with high-dimensional inputs, generating samples from probability distributions or stochastic processes, optimizing functions or systems with uncertain parameters or constraints, and evaluating the risk or uncertainty associated with a decision or scenario.

In the realm of Reinforcement Learning, Monte Carlo (MC) denotes a set of techniques for estimating state and action values by using sample trajectories from complete episodes.

Recall that a trajectory refers to the sequence of states and actions that an agent experiences as it interacts with an environment during a single episode. It represents the possible path or sequence of states that the agent follows from the initial state to the terminal state, along with the actions taken at each state. Trajectories are essential for estimating the value of states or state-action pairs using Monte Carlo methods, as they provide the necessary information about the agent’s experiences in the environment.

This approach proves convenient because, for most practical RL or control problems, the dynamics of the environment — namely, state transitions and reward probability distributions — are often either too intricate to handle or not known in advance.

II- Framework

First let us recall some key concepts:

In a Decision problem we typically want to learn what are the optimal decisions to make to maxmise a certain profit or minimize a certain cost, this is what we call the objectif function.

In the context of dynamic programming and Reinforcement Learning, we introduce the concept of a MDP, a Markov Decision Process , in which we model our environment the following way:

1 — We define a set of states for which our agent/environment can exist in, for example in an inventory problem the state can be the current inventory.

2 — We define a set of actions we can make at each state that will transition us to a next state, for example how much goods should we add to our current stock.

3 — We define a reward for the said action, this will help the agent learn what is a good action and a bad action.

4 — The sum of all rewards across an “episode” is what we call the Return, Howerver since we can’t have an MDP with no terminal state, because it could go to infinity, we introduce a 0≤discount rate ≤ 1, and we multiply the reward of each action at time t withe the discount rate to the power of t , the sum gives us the discounted return G_t.

5 — For a given state s we can define the state value function v(s) as the expected value of G_t given S_t = s .

6 — Similarity for an action a , we can define an action value function q(s, a) as the the expected value of G_t given S_t = s AND A_t = a

5- The set of mapping from state to action , will define a policy , and the optimal policy is the one that maximises the total gain. Formally we can define a policy as a distribution over action space given state space, like this : π(a|s)= P(A_t = a | S_t = s). In other words it tells us which action we are most likely to take given we’re in known state.

III — Estimating the state-value function

Recall that :

Where v_π(s) is the value of state s under policy π, γ is the discount factor and R is the Reward.

MC Prediction suggests observing many sample trajectories, where a trajectory is a sequence of tuples (state, action, reward) , starting in s to esitmate E_π. This is like tossing a dice so many times to try and estimate the probability distribution.

Now we have:

The problem with this is that we need to identify every single trajctory τ starting at state s , the probability of observing each τ under the policy π and the corresponding discounted return G_τ. Thing is there could be an infinite number of possible trajectories originating from s.

Luckily we know a good estimator for the Expected value, thanks to monte carlo methods, it’s the mean

One final notice , consider this trajectory τ: S1 -> S3 -> S2 -> S3 -> S4 , τ visits the state S3 twice, we can use the return from just the first visit, this is called first-visit MC method , or from each of the visits , this is called every-visit MC method.

IV — Estimating the action value function

Recall that:

This means that similarly to the state value function , we can use the discounted return G_τ to estimate q_π(s_t, a_t), ∀t ∈ (0,1,…etc), and then we can find the best action for a state s by comparing the action values,

The issue here is we do not necessarily have all possible action value function estimates for all possivle actions taken in a given state.

A common solution for this is to only consider policies π that select all actions in a given state with a positive none null probability, in other words π(a|s) > 0 , ∀ a ∈ A(s) , ∀ s ∈ S , where A(s) denotes the set of all possible actions for a given state s and S denotes the set of all possible states.

V — On-policy vs Off-policy

On-policy and off-policy methods represent two distinct approaches in reinforcement learning algorithms. They fundamentally diverge in their approach to updating the action-value function, denoted as Q(s, a), based on experience tuples (s, a, r, s’), where (s,a,r,s′) consists of the current state s, the action taken a, the reward received r, and the next state s′.

Here, we encounter two crucial concepts: 1) Exploration, vital for discovering optimal policies during training or simulation, and 2) the desire for direct access to the best actions during inference. The former corresponds to what’s termed a behavior policy, often associated with on-policy methods, while the latter aligns with a target policy, characteristic of off-policy methods. Essentially, any policy apart from (1) falls into the category of off-policy.

Put simply, on-policy methods employ the same policy for both action selection and learning, whereas off-policy methods utilize a distinct policy for learning, one that may differ from the policy used to generate the data.

On-policy methods typically offer a simpler and more intuitive framework. However, they demand exploration during training to guarantee that all state-action pairs receive sufficient attention. In contrast, off-policy methods exhibit greater flexibility, allowing them to glean insights from a broader range of data sources. This includes data generated by external agents or random processes, making them particularly valuable in tackling cold start problems.

Off-policy methods often exhibit higher variance and slower convergence compared to on-policy methods. This arises from the need to rectify the mismatch between the behavior policy and the target policy. This correction is accomplished through a technique known as importance sampling (we’ll explain this term later on , when we delve deeper into off policy Monte carlo methods), which involves adjusting the probability of the data based on the ratio of the target policy to the behavior policy. Off-policy methods, however, compensate for this with their increased power and versatility. They have the capacity to glean insights into multiple policies from the same dataset or even repurpose old data for the acquisition of a new policy. As an illustrative example, consider Q-learning, an off-policy method renowned for its ability to learn the optimal policy independently of the behavior policy. It’s worth noting that exploratory policies, while invaluable for uncovering new strategies, tend not to be optimized for efficiency since they incorporate random actions. This inherent sub-optimality is carried over into the value estimates of on-policy methods, as they assess state and action values based on the behavior policy, which naturally includes elements of exploration.

VI — Off-Policy MC methods and importance sampling

Now consider we have a set of trajectories collected under some abstract behaviour policy b, we would like to be able to use the experience learned under b to estimate the state and action values for a new target policy π, without having to do exploration again.

To help us with that we will use a technique called Importance sampling, to explain how it works let’s consider a simple example, imagine we have two coins C_1 and C_2, when you toss a coin , you get a reward h_1 if you get face , let’s denote it getting a 1 and another reward h_2 if you get tails, let’s denote it getting a 2.

Let H(x) denote the reward you receive when you get x as a result of a toss, where x ∈ {1,2} , Our goal is to estimate the expected reward E(H) given a coin C,

Where P_C (x) denote the probability of observing x under coin C.

Let’s say you toss coin C_1 N times and record the reward h(x_i), i∈ {1,…,N}, then a estimate for the expected reward is simply :

Now given this estimaor μ^C1 , we can deduce an estimator for C2 μ^C2 without running any more experiments to gather new observations from C2, all we need to know is P_C1(x) and P_C2(x).

This is importance sampling , in μ^C1 we can assume each h_i(x_i) is multiplied by weight w_i = 1, IS works by scaling these weights according to P_C2(x):

The probabilities fraction inside the sum is called the importance sampling ratio ρ.

We can normalize this estimator with respect to the new weights and get a weighted importance sampling :

Now back to Off-Policy MC, a behaviour policy b would be the coin C_1, whilst a target policy π would be the coin C_2, and observing a particular trajectory is similar to observing a face or a tail for a coin toss.

Now starting in a given state S_t , the probability of obsrving a trajectory a_t,S_t+1, a_t+1,S_t+2,…,S_T under a policy b is given by:

Similarly , for the target policy we get π:

This implies that the importance sampling ratio is :

We can estimate the expectation under the target policy as :

VII — Problem and modelling

To tackle the inventory replenishment problem with reinforcement learning, we employ a mathematical model represented by the `Store` class. This class is an environment defined using the OpenAI Gym framework. Within this environment, we have specified parameters such as `v_demand` for possible demand values and their respective probabilities `p_demand`. The store has a finite capacity determined by the maximum demand value. The week is segmented into seven days, each denoted by ‘Mon’, ‘Tue’, ‘Wed’, ‘Thu’, ‘Fri’, ‘Sat’, and ‘Sun’. The unit cost of restocking is set at 4, and the net revenue per item sold is 7. The available actions for replenishment are discretized into 0, 100, 200, 300, and 400 units. The state space encompasses all possible combinations of days and inventory levels.

The `get_next_state_reward` method computes the subsequent state and associated rewards after taking an action. It factors in the current day and inventory, as well as the chosen replenishment quantity. The resulting dictionary contains information on the following day, the updated inventory, the cost of restocking, the number of sales, the generated revenue, the remaining inventory, and the overall reward.

The `get_transition_prob` method computes the transition probabilities for various demand scenarios given a state and an action. It iterates through potential demand values and calculates the subsequent state, reward, and their associated probabilities.

The environment supports the standard Gym methods: `reset` initializes the environment to the start of a new week with zero inventory, and `step` simulates a single time step by considering a selected action, generating a random demand, and returning the updated state, reward, termination status, and additional information about the demand and sales. Finally, the `is_terminal` method identifies whether the current state is a terminal state, which occurs when the day is ‘Sun’.

By leveraging this mathematical model within a reinforcement learning framework, we aim to develop an effective strategy for inventory replenishment that maximizes profits over time. The defined state and action spaces, along with the transition dynamics and reward structure, enable the application of Monte Carlo methods to derive optimal policies for inventory management.

class Store(gym.Env):
    def __init__(self):
        self.v_demand = [100, 200, 300, 400]
        self.p_demand = [0.3, 0.4, 0.2, 0.1]
        self.capacity = self.v_demand[-1]
        self.days = ['Mon', 'Tue', 'Wed',
                     'Thu', 'Fri', 'Sat', 'Sun']
        self.unit_cost = 4
        self.net_revenue = 7
        self.action_space = [0, 100, 200, 300, 400]
        self.state_space = [(d, i) for d in self.days for i in [0, 100, 200, 300, 400]]

    def get_next_state_reward(self, state, action, demand):
        day, inventory = state
        result = {}
        result['next_day'] = self.days[self.days.index(day) \
                                       + 1]
        result['starting_inventory'] = min(self.capacity,
                                           inventory
                                           + action)
        result['cost'] = self.unit_cost * action
        result['sales'] = min(result['starting_inventory'],
                              demand)
        result['revenue'] = self.net_revenue * result['sales']
        result['next_inventory'] \
            = result['starting_inventory'] - result['sales']
        result['reward'] = result['revenue'] - result['cost']
        return result

    def get_transition_prob(self, state, action):
        next_s_r_prob = {}
        for ix, demand in enumerate(self.v_demand):
            result = self.get_next_state_reward(state,
                                                action,
                                                demand)
            next_s = (result['next_day'],
                      result['next_inventory'])
            reward = result['reward']
            prob = self.p_demand[ix]
            if (next_s, reward) not in next_s_r_prob:
                next_s_r_prob[next_s, reward] = prob
            else:
                next_s_r_prob[next_s, reward] += prob
        return next_s_r_prob

    def reset(self):
        self.day = "Mon"
        self.inventory = 0
        state = (self.day, self.inventory)
        return state

    def is_terminal(self, state):
        day, inventory = state
        if day == "Sun":
            return True
        else:
            return False

    def step(self, action):
        demand = np.random.choice(self.v_demand,
                                  p=self.p_demand)
        result = self.get_next_state_reward((self.day,
                                             self.inventory),
                                       action,
                                       demand)
        self.day = result['next_day']
        self.inventory = result['next_inventory']
        state = (self.day, self.inventory)
        reward = result['reward']
        done = self.is_terminal(state)
        info = {'demand': demand, 'sales': result['sales']}
        return state, reward, done, info

VIII— On-Policy Monte Carlo control approach

First we need a function to implement ε-Greedy policy, we will use this to explore all state-action possibilities, this policy chooses a random action with ε probability, where 0< ε <<1 , and chooses the action maximizing Q(s,a) with 1-ε probability:

def get_eps_greedy(actions, eps, a_best):
    prob_a = {}
    n_a = len(actions)
    for a in actions:
        if a == a_best:
            prob_a[a] = 1 - eps + eps/n_a
        else:
            prob_a[a] = eps/n_a
    return prob_a

This function generates a probability distribution over the available actions based on an epsilon-greedy policy. It gives higher probabilities to the action believed to be the best, while also allowing for exploration with a certain probability determined by ε

1. Input Parameters:
— `actions`: This is a list of all possible actions that can be taken in the environment.
— `eps`: This is the exploration parameter, representing the probability of taking a random action. It’s a value between 0 and 1.
— `a_best`: This is the action that is currently believed to be the best in the given state.

2. Initialization:
— `prob_a`: This is an empty dictionary that will be used to store the probabilities of taking each action.

3. Calculating Exploration and Exploitation Probabilities:
— The loop iterates over all possible actions (`a`) in `actions`.
— For each action, it checks if it is the same as the action believed to be the best (`a_best`).
— If `a` is the best action (`a == a_best`), it calculates the probability as follows:
— `1 — eps + eps/n_a`: This means that with probability `1 — eps`, the best action is chosen. With probability `eps/n_a`, a random action is chosen from the remaining actions (exploration).
— If `a` is not the best action, it calculates the probability as `eps/n_a`, which means that with probability `eps`, a random action is chosen (exploration).

4. Storing Probabilities:
— The calculated probability for each action is stored in the `prob_a` dictionary.

5. Returning the Probability Dictionary:
— Once all actions have been processed, the function returns the `prob_a` dictionary, which contains the probabilities for each action.

We also need a function that simulates the store with a given policy and returns the trajectory:

def get_trajectory(env, policy):
    trajectory = []
    state = env.reset()
    done = False
    sar = [state]
    while not done:
        action = choose_action(state, policy)
        state, reward, done, info = env.step(action)
        sar.append(action)
        sar.append(reward)
        trajectory.append(sar)
        sar = [state]
    return trajectory

This function simulates the environment by following a given policy, collecting the resulting state, action, and reward at each time step, and storing them in a trajectory. The trajectory represents the agent’s interactions with the environment during a single episode.

1. Input Parameters:
— `env`: This represents the environment in which the trajectory is generated. It is assumed to be an instance of the environment class.
— `policy`: This is the policy used to select actions in the environment. It’s expected to be a dictionary that maps states to action probabilities.

2. Initialization:
— `trajectory`: This is an empty list that will store the trajectory.
— `state`: It is initialized by resetting the environment (`env.reset()`), which sets the environment to its initial state.
— `done`: This is a flag indicating whether the episode is finished. It’s initially set to `False`.
— `sar`: This is a list that will hold the state-action-reward tuples.

3. Generating the Trajectory:
— The `while` loop runs until the episode is finished (`done == True`).
— Inside the loop, it does the following:
— It calls `choose_action(state, policy)` to select an action based on the given policy. The chosen action is stored in the variable `action`.
— It then performs a step in the environment by applying the chosen action. This returns the next state (`state`), the reward received (`reward`), whether the episode is finished (`done`), and additional information (`info`).
— The state, action, and reward are appended to the `sar` list.
— The `sar` list is then appended to the `trajectory` list, representing a state-action-reward tuple for that time step.
— Finally, the `sar` list is reset with the current state for the next iteration.

4. Returning the Trajectory:
— Once the episode is finished, the function returns the generated trajectory, which is a list of state-action-reward tuples.

Then we need a function that generates a random policy:

def get_random_policy(states, actions):
    policy = {}
    n_a = len(actions)
    for s in states:
        policy[s] = {a: 1/n_a for a in actions}
    return policy

Now we can build a first-visit on-policy MC control function :

def on_policy_first_visit_mc(env, n_iter, eps, gamma):
    np.random.seed(0)
    states =  env.state_space
    actions = env.action_space
    policy =  get_random_policy(states, actions)
    Q = {s: {a: 0 for a in actions} for s in states}
    Q_n = {s: {a: 0 for a in actions} for s in states}
    for i in range(n_iter):
        if i % 10000 == 0:
            print("Iteration:", i)
        trajectory = get_trajectory(env, policy)
        G = 0
        T = len(trajectory) - 1
        for t, sar in enumerate(reversed(trajectory)):
            s, a, r = sar
            G = r + gamma * G
            first_visit = True
            for j in range(T - t):
                s_j = trajectory[j][0]
                a_j = trajectory[j][1]
                if (s, a) == (s_j, a_j):
                    first_visit = False
            if first_visit:
                Q[s][a] = Q_n[s][a] * Q[s][a] + G
                Q_n[s][a] += 1
                Q[s][a] /= Q_n[s][a]
                a_best = max(Q[s].items(),
                             key=operator.itemgetter(1))[0]
                policy[s] = get_eps_greedy(actions,
                                           eps,
                                           a_best)
    return policy, Q, Q_n

The function `on_policy_first_visit_mc` implements the On-Policy First-Visit Monte Carlo method . It takes four parameters: `env` represents the environment, `n_iter` is the number of iterations for the Monte Carlo simulation, `eps` is the epsilon value for epsilon-greedy policy, and `gamma` is the discount factor for future rewards.

Within the function, it first initializes the random seed for reproducibility. It then retrieves the states and actions from the environment.

The policy is initialized using a random policy obtained from the `get_random_policy` function. Two dictionaries, `Q` and `Q_n`, are created to store action-values and their respective visit counts for each state-action pair.

The function then enters a loop that iterates `n_iter` times. In each iteration, it generates a trajectory of states, actions, and rewards by following the current policy using the `get_trajectory` function.

For each time step in the trajectory, it calculates the return `G` (the cumulative reward from that time step onwards). It then checks if the current state-action pair is the first visit to that state in the trajectory. If it is, it updates the action-value function `Q` by averaging the returns for that state-action pair. The visit count `Q_n` is also updated.

After processing the trajectory, the function updates the policy based on the action-values and epsilon-greedy exploration. This continues for the specified number of iterations.

The function finally returns three values: the updated policy, the action-value function `Q`, and the visit count `Q_n`. This provides the optimal policy and the learned action-values, which can be used for decision-making in the environment.

IX — Off-Policy Monte Carlo control approach

The beheviour policy b to be used to estimate the action values under the target policy is the ε-Greedy policy defined earlier. We also use the weighted importance sampling ratio:

def off_policy_mc(env, n_iter, eps, gamma):
    np.random.seed(0)
    states =  env.state_space
    actions = env.action_space
    Q = {s: {a: 0 for a in actions} for s in states}
    C = {s: {a: 0 for a in actions} for s in states}
    target_policy = {}
    behavior_policy = get_random_policy(states,
                                        actions)
    for i in range(n_iter):
        if i % 10000 == 0:
            print("Iteration:", i)
        trajectory = get_trajectory(env,
                                    behavior_policy)
        G = 0
        W = 1
        T = len(trajectory) - 1
        for t, sar in enumerate(reversed(trajectory)):
            s, a, r = sar
            G = r + gamma * G
            C[s][a] += W
            Q[s][a] += (W/C[s][a]) * (G - Q[s][a])
            a_best = max(Q[s].items(),
                         key=operator.itemgetter(1))[0]
            target_policy[s] = a_best
            behavior_policy[s] = get_eps_greedy(actions,
                                                eps,
                                                a_best)
            if a != target_policy[s]:
                break
            W = W / behavior_policy[s][a]
    target_policy = {s: target_policy[s] for s in states
                                   if s in target_policy}
    return target_policy, Q

The `off_policy_mc` function is designed to implement the Off-Policy Monte Carlo method to solve a reinforcement learning problem within a specified environment. It takes four parameters: `env` for the environment, `n_iter` representing the number of iterations for the Monte Carlo simulation, `eps` for epsilon-greedy exploration, and `gamma` as the discount factor for future rewards.

The function begins by initializing the random seed for reproducibility and fetching the states and actions from the environment. Two dictionaries, `Q` and `C`, are created to store action-values and their respective cumulative weights for each state-action pair. `target_policy` and `behavior_policy` dictionaries are also initialized.

Next, the function enters a loop that iterates `n_iter` times. In each iteration, it generates a trajectory of states, actions, and rewards by following the current behavior policy using the `get_trajectory` function.

For each time step in the trajectory (processed in reverse), it calculates the return `G` (the cumulative reward from that time step onwards) and updates the cumulative weight `C` for the visited state-action pair.

The action-value function `Q` is updated using the weighted importance sampling. It combines the previous value with the new return, weighted by the importance sampling ratio.

The target policy is updated based on the action with the highest value in `Q` for each state, while the behavior policy is updated to balance exploration and exploitation using epsilon-greedy exploration.

If the action taken in the trajectory is not the same as the action recommended by the target policy, the loop breaks, as further exploration in this trajectory is irrelevant.

The function returns the learned target policy and the action-values `Q`, which can be utilized for decision-making in the environment.

X- Results:

On-Policy First-Visit Monte Carlo:

The output of the On-Policy First-Visit Monte Carlo method provides action probabilities for each state. It represents the probability of selecting each possible action given the state. In this case, the method has learned a stochastic policy. For example, in the state (‘Mon’, 0), it suggests that the best course of action is to replenish 400 units with a probability of 96%, while other actions have a lower probability.

Off-Policy Monte Carlo:

The output of the Off-Policy Monte Carlo method directly provides the recommended action for each state. Instead of providing probabilities, it directly suggests the most advantageous action. In this method, the policy learned is deterministic. For example, in the state (‘Mon’, 0), the recommended action is 400.

Comparison:

1. Exploration vs Exploitation:
— On-Policy: This method maintains some level of exploration by considering a range of actions with associated probabilities. It might explore suboptimal actions to learn more about their value.
— Off-Policy: This method directly exploits the learned knowledge to recommend the action with the highest estimated value. It’s more focused on exploitation.

2. Deterministic vs Stochastic Policy:
— On-Policy: Learns a stochastic policy, meaning it recommends actions with associated probabilities. This can be useful in situations where randomness or variability in actions is desired.
— Off-Policy: Learns a deterministic policy, directly suggesting the best action for each state. This can be beneficial when a clear and precise course of action is needed.

3. Potential for Exploration:
— On-Policy: Can continue exploring different actions even as learning progresses, ensuring a balance between exploration and exploitation.
— Off-Policy: Focuses primarily on exploiting the known best actions, potentially missing out on alternative strategies.

Conclusion

Overall, the choice between the methods depends on the specific requirements of the problem. If there’s a need for exploration and a preference for probabilistic decision-making, the On-Policy method might be more suitable. On the other hand, if efficiency and direct exploitation of learned knowledge are crucial, the Off-Policy method could be preferred.

Bibliography and References

Bilgin, Enes. “Mastering Reinforcement Learning with Python.” Packt Publishing, 2020.
Sutton, Richard S., and Barto, Andrew G. “Reinforcement Learning: An Introduction.” The MIT Press, 2020