# Stochastic Dynamic Programming

## Markov Decision Process

Originally introduced by Richard E. Bellman in (Bellman 1957), **stochastic dynamic programming** is a technique for modeling and solving problems of decision making under uncertainty. Closely related to stochastic programming and dynamic programming, stochastic dynamic programming represents the problem under scrutiny in the form of a Bellman equation. The aim is to compute a policy prescribing how to act optimally in the face of uncertainty.

**Stochastic dynamic programming** deals with problems in which the current period reward and/or the next period state are random, i.e. with multi-stage stochastic systems. The decision maker's goal is to maximize expected (discounted) reward over a given planning horizon.

The characteristics of dynamic programming application are as follows:

**Stages**

The essential feature of the dynamic-programming approach is the structuring of optimization problems into multiple *stages*, which are solved sequentially one stage at a time. Although each one-stage problem is solved as an ordinary optimization problem, its solution helps to define the characteristics of the next one-stage problem in the sequence.

**State**

Associated with each stage of the optimization problem are the *states *of the process. The states reflect the information required to fully assess the consequences that the current decision has upon future actions.

**Action**

The decision chosen at any stage describes how the state at the current stage is transformed into the state at the next stage.

**Value Function**

"value" of a decision problem at a certain point in time in terms of the payoff from some initial choices and the "value" of the remaining decision problem that results from those initial choices.

Given the current state we are in, choose the optimal action which will maximize the long-term expected reward provided by the environment using a class of algorithms that seeks to simplify complex problems by breaking them up into sub-problems and solving the sub-problems recursively.

**Reward**

A feedback signal from the environment reflecting how well the agent is performing the goals of the game.

**Markov decision process** (**MDP**) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision-maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning. An MDP is just a probabilistic dynamic programming problem in which decision maker faces an infinite horizon.

At each time step, the process is in some state s, and the decision-maker may choose any action that is available in state s to minimize expected cost incurred or to maximize expected reward earned over a given horizon. The process responds at the next time step by randomly moving into a new state s, and giving the decision-maker a corresponding reward R_{a}(s, s). For each state and action, we specify a probability distribution over the next states. Represents the distribution P(s' | s, a).

MDP follows a policy by determining the current state, s executes the action. Assumes full observability i.e the new state resulting from the executing an action will be known to the system. The policy is evaluated, for stochastic actions instead expected total reward obtained-again typically yields infinite value. The goal is to find a good policy for the decision-maker.

Bellman equation relates the value function to itself via the problem dynamics. MDP solution focuses critically on expected value.

**Example:**

Cyclone Cyril did extensive damage to the forests which are home to many of our endemic species. We are particularly concerned about one region that can be roughly divided into nine sites, as shown in the map below:

At the moment our ecologists have assessed that each of these sites is in a *fragile *state. Each month they can focus efforts on one of the fragile sites to turn it into a *restored *state, permanently saving all the species in that site from extinction.

Based on the species data, our current plan is to prioritize restoring the sites in the following order:

1, 7, 8, 6, 4, 2, 0, 5, 3

Unfortunately, at the end of each month, there is a probability 0.2 that a fragile site will have degraded into a lost state which can no longer be restored, losing all the associated species. In practice, once a site is lost it becomes more likely that neighboring sites will also be lost. We think the probability of a site being lost in a given month can be quantified as 0.2 + 0.05 *n*, where *n *is the number of directly adjacent sites that have already been lost. Taking this into account, what would the maximum expected number of species save?

# Data: Species_{ij}, species i in restoring order found in site j # adjacent_site_{ij}, site i with it's adjacent sites j # Stages: m, months # State: state, tuple containing site information lost, fragile, restored # probability, probability of a site being lost # Actions: idx, site to save # Value Function: V(state, m, probability) is the maximum expected species to save if we restore idx plus the # expected species we will save after we have restored idx over all possible # fragile sites where probability of a site being lost in a given month # is quantified as 0.2 + 0.05n, having gone for m months. from itertools import combinations prob = 0.2 Sites = range(9) Species = { 0: {0, 1, 7, 10, 11, 17}, 1: {7, 13, 16, 19}, 2: {2, 12, 15}, 3: {13, 17, 18}, 4: {3, 12, 17}, 5: {4, 14, 15}, 6: {6, 8, 18}, 7: {15, 18}, 8: {5, 9} } adjacent_site = { 0: {1}, 1: {0, 2, 3}, 2: {1, 4}, 3: {1, 4}, 4: {2, 3, 5, 7}, 5: {4, 6}, 6: {5, 8}, 7: {4}, 8: {6} } # The state is a 9-tuple where each element is # -1 = lost, 0 = fragile, 1 = restored # GenerateOutcomes returns a list of tuples with a probability # in position 0 and a state as a tuple in position 1 def GenerateOutcomes(State, LossProb): ans = [] tempSites = [j for j in Sites if State[j] == 0] n = len(tempSites) for i in range(n + 1): for tlist in combinations(tempSites, i): p = 1.0 slist = list(State) for j in range(n): if tempSites[j] in tlist: p *= LossProb[tempSites[j]] slist[tempSites[j]] = -1 else: p *= 1 - LossProb[tempSites[j]] ans.append((p, tuple(slist))) return ans def expected_species(states, site): saved_species = Species[site] for i in Sites: if states[i] == 1: saved_species = saved_species - Species[i] return len(saved_species) def restore_site(state, x): restored_site = list(state) if restored_site[x] is 0: restored_site[x] = 1 return restored_site def calculate_probability(x): return prob + 0.05 * x def probability_quantified(state, site): n = 0 sites = adjacent_site[site] for s in sites: if state[s] == -1: n = n + 1 return calculate_probability(n) _V = {} def V(state, m, probability): if m is 0: return len(Species[0]) + V((1, 0, 0, 0, 0, 0, 0, 0, 0), 1, 0.2) if 0 not in state or m is 9: return 0 if (tuple(state), m, probability) not in _V: previous_state = GenerateOutcomes(state, [probability for j in Sites]) fragile_state = [previous_state[i] for i in range(len(previous_state))] for i in range(len(previous_state)): s = 0 while s < len(previous_state): multiple_sites = [] # select the fragile sites available for idx, sites in enumerate(previous_state[s][1]): if sites is 0: multiple_sites.append(idx) if 0 not in previous_state[s][1]: fragile_state[s] = [0, 0, -1] else: fragile_state[s] = [previous_state[s][0], list(previous_state[s][1]), multiple_sites] s += 1 _V[tuple(state), m, probability] = sum( max( fragile_state[j][0] * ( expected_species(fragile_state[j][1], idx) + V( restore_site( tuple(fragile_state[j][1]), idx), m + 1, probability_quantified( restore_site( tuple(fragile_state[j][1]), idx), idx) ) ) for idx in fragile_state[j][2]) for j in range(len(fragile_state)) if fragile_state[j][2] is not -1 ) return _V[tuple(state), m, probability] print(V((0, 0, 0, 0, 0, 0, 0, 0, 0), 0, prob))