Reinforcement Learning in R: An Introduction to Dynamic Programming

October 30, 2012

(This article was originally published at statalgo, and syndicated at StatsBlogs.)

Reinforcement Learning is an approach to learning that attempts to maximize a cumulative reward based on a set of actions and states. The techniques are very popular within operations research and control theory. It does not fall under the traditional paradigms of supervised or unsupervised learning because correct inputs/outputs are never provided. The algorithm is instead presented with some notion of doing well or poorly at each step.

There are several different methods for optimizing a reinforcement learning problem; I will be focusing on Dynamic Programming. Dynamic programming (or DP) is a powerful optimization technique that consists of breaking a problem down into smaller sub-problems, where the sub-problems are not independent. This is useful both in mathematics (especially fields like economics and operations research) and computer science (where the algorithm's complexity can be substantially improved through methods such as memoization). In this context, "programming" actually means "planning", as dynamic programming consists of forming a table of solutions.

I give a brief overview of dynamic programming using R. I start by focusing on two well-known algorithm examples (fibonacci sequence and the knapsack problem), and in the next post I will move on to consider an example from economics, in particular, for a discrete time, discrete state Markov decision process (or reinforcement learning). The goal is to both describe the algorithm design and give some sense for how it can be applied to actual problems. I'm interested in demonstrating this because it will be required for reproducing results from some of the primary trade optimization methods (such as Bertsimas and Lo) as part of my series on Market Microstructure.

Computational Design: Dependent Divide and Conquer

A quick search of CRAN (including the Optimization view) will reveal that there are no dynamic programming packages available. This is because dynamic programming is a design technique rather than a specific algorithm; it is specifically tailored to each given problem.

From a design perspective, dynamic programming has a lot in common with divide-and-conquer, but is specifically designed to address problems where the subproblems are not independent. A nice summary of this distinction can be found in "Introduction to Algorithms":

Dynamic programming, like the divide-and-conquer method, solves problems by combining the solutions to subproblems...divide-and-conquer algorithms partition the problem into independent subproblems, solve the subproblems recursively, and then combine their solutions to solve the original problem. In contrast, dynamic programming is applicable when the subproblems are not independent, that is, when subproblems share subsubproblems. In this context, a divide-and-conquer algorithm does more work than necessary, repeatedly solving the common subsubproblems. A dynamic-programming algorithm solves every subsubproblem just once and then saves its answer in a table, thereby avoiding the work of recomputing the answer every time the subsubproblem is encountered.

There is a video lecture from the MIT course using this canonical algorithm textbook, including one of the authors: Charles Leiserson and Erik Demaine:

Fibonacci Series

The Fibonacci Sequence is one of the most popular introductory examples used in teaching programming (also one of the most popular pages on Rosetta Code). This entails a function where the output is simply the sum of the preceding two values:

  F_n = F_{n-1} + F_{n-2}

Which results in a sequence that looks like:

  0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ...

A classic approach to solve this problem is to use recursion:

fib1 <- function(n) {
  print(paste("fib1(", n, ")", sep=""))
  if(n == 0) return(0)
  else if(n == 1) return(1)
  else {
    return(fib1(n-1) + fib1(n-2))

I have included a print statement at the top of the function so that it's easy to see how this function is called. Unfortunately, the computational complexity of this algorithm is exponential. As an example, for fib(5), we end up calling fib(2) three times.

The dynamic programming solution calls each value once and stores the values (memoization, also see Hadley Wickham's memoise package for R). This might seem overly simplistic when you consider the recursive algorithm, but in this case we need to loop and maintain state, which is a very imperative approach.

fib2 <- function(n) {
  fibs <- c(0, 1)
  if(n > 1) {
    for (i in 3:n) {
      fibs[i] = fibs[i - 1] + fibs[i - 2]

This second algorithm is linear time complexity, and its performance is noticeably better even on small values for n.

> library(rbenchmark)
> benchmark(fib1(20))
      test replications elapsed relative user.self sys.self 
  fib1(20)          100   15.15        1     13.64     0.02 
> benchmark(fib2(20))
      test replications elapsed relative user.self sys.self
  fib2(20)          100    0.05        1      0.03        0

This demonstrates both the performance improvement that is possible, and also the dependence structure that is common in dynamic programming applications: the state of each subsequent value depends on prior states.

This also demonstrates another important characteristic of dynamic programming: the trade-off between space and time.

0/1 Knapsack Problem

How can we use this for optimization? The knapsack problem is a very well known example of dynamic programming that addresses this question. This is a resource allocation problem. We're given a set of items with a weight and a value. And we have a limited capacity to carry these items, so we want to maximize the value given the constraint. This is sometimes described as a story: a thief enters a house and wants to steal as many valuable items as possible but can only carry so much weight.

Presented with this problem, we can think of several different approaches. We might start with a greedy strategy: rank all the items by value, and add them to the knapsack until we can't fit any more. This won't find the optimal solution, because there will likely be a combination of items each of which is less valuable, but whose combined value will be greater. Alternatively, we might simply iterate through every single combination of items, but while this finds the optimal solution, it also grows with exponential complexity as we have more items in the set. So we can use Dynamic Programming:

Define a set S of n items each with a value and weight v_i, w_i respectively. We want to solve for:

  maximize \sum^n_{i=0} v_i

Subject to:

  \displaystyle\sum^n_{i=0} w_i \le W

The dynamic programming solution follows these steps:

  1. Characterize the structure of an optimal solution.
  2. Recursively define the optimal solution to subproblems.
  3. Step backwards and find the global optimal value.

knapsack <- function(w, v, W) {
  A <- matrix(rep(0, (W + 1) * (length(w) + 1)), ncol=W+1)
  for (j in 1:length(w)) {
    for (Y in 1:W) {
      if (w[j] > Y)
        A[j+1, Y+1] = A[j, Y+1]
        A[j+1, Y+1] = max( A[j, Y+1], v[j] + A[j, Y- w[j]+1])


optimal.weights <- function(w, W, A) {

  amount = rep(0, length(w))
  a = A[nrow(A), ncol(A)]
  j = length(w)
  Y = W
  while(a > 0) {
    while(A[j+1,Y+1] == a) {
      j = j - 1
    j = j + 1
    amount[j] = 1
    Y = Y - w[j]
    j = j - 1;
    a = A[j+1,Y+1];


We can test this on a simple example of 7 items with different weights an values:

w = c(1, 1, 1, 1, 2, 2, 3)
v = c(1, 1, 2, 3, 1, 3, 5)
W = 7

A <- knapsack(w, v, W)
best.value <- A[nrow(A), ncol(A)]
weights <- optimal.weights(w, W, A)

We can look at the matrix A to see how dynamic program stores values in a table rather than recomputing the values repeatedly:

> A
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    0    0    0    0    0    0    0    0
[2,]    0    1    1    1    1    1    1    1
[3,]    0    1    2    2    2    2    2    2
[4,]    0    2    3    4    4    4    4    4
[5,]    0    3    5    6    7    7    7    7
[6,]    0    3    5    6    7    7    8    8
[7,]    0    3    5    6    8    9   10   10
[8,]    0    3    5    6    8   10   11   13

Where the best total value when satisfying the weight constraint in the example.

There are many examples of solutions to the knapsack problem on Rosetta Code. I should also point out that "Mike's Coderama" has a post covering this in Python.


Dynamic programming is a powerful technique both for finding optimal solutions to problems where steps have dependence, and at improving algorithm performance by trading off space for time. I will finish off this topic in the next post by looking in more detail at the topic of Bellman's Optimality Principle and at an example from economics.


I rely primarily on three sources for this analysis, all of which I highly recommend:

In addition, I would point out:

Please comment on the article here: statalgo

Tags: , ,