Markov decision processes
Markov decision process is a mathematical tool for modelling stochastic control problems. Another way to look at them is that they can be used for
sequential decision making. Sidenote: There is a great lecture series on YouTube by David Silver which explains MDPs and how they are used in reinforcement learning.
A key assumption is that all the states follow the Markov property which basically states that the future is independent of the past given the present.
Defining MDPs
An MDP is a 4-tuple M :
M = (S ,A ,Pa ,Ra)
- S is the state space
- A is the action space
- Pa(s* ,s’)* is the probability of going from state s to s’
- Ra(s* ,s’)* is the reward for going form state s to s’
Sometimes a discount factor is added to the problem definition to model the variance associated with time horizons. Basically γ models how much we care about short-term rewards (γ = 0) and long term rewards (γ =1)
MDPs for inventory control
Adapated from 6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 1
A common question in supply chain management is when and by how much to replenish the shelves in a store. Replenishment can be thought of as an stochatic control problem.
Let’s say that you are a warehouse manager looking after just one SKU (stock keeping unit).You review the inventory at the end of each month and decide how much inventory to order.
There is a cost associated with maintaining some inventory in the warehouse h(s), with ordering inventory C(a). Selling the items earns you an income of f(q). If the demand is greater than the available inventory then the customer leaves your store to never come back i.e there are no backorders. Only constraint that we are considering now is maximum capacity of the warehouse M.
State space - s ∈ S = {0, 1, …., M} which is the number of goods we can store in our warehouse; constrained by maximum capacity of the warehouse
Action space - For a state s, a ∈ A(s) = {0, 1, …, M − s}. Action space depends on the current state as we cannot order more than the capacity of the warehouse. Basically the only action the warehouse manager can take is choosing how much to order
Reward - rt = - C(at) - h(st + at) + f([st + at - st - 1]+)
where, C(at) is the ordering cost for a number of units
h(st + at) is the holding cost of current inventory + ordered inventory
f([st + at - st - 1]+) is the income from sales
The goal is to find a policy (π) of re-ordering that achieves a lot of reward over the long run.
From Sutton and Barto:
A value function is a function of states (or state action pairs) that estimate how good it is for the agent to be in a given state
Applying this to our inventory problem, it gives us a dollar value on the amount of inventory, s, that we are holding. Too little inventory means penalty from loosing customers. Too much inventory means that we incur a high holding cost. It also means that we have tied up our working capital. (You can compare the profit from the sale of the goods to the interest on the capital used to manufacture/procure the goods).
Solving MDPs
Dynamic programming is used to solve MDPs.