Solving CartPole using Hill Climbing Algorithms

In this post, we are going to solve CartPole using simple policy based methods: hill climbing algorithm and its variants. In a previous post, we used value based method, DQN, to solve one of the gym environment. In value based methods, we first obtain the value function i.e state value or action-value (Q) and then recover the optimal policy. In policy based methods, we recover the optimal policy directly without any intermediate values. Recollect that policy is a mapping from state to action i.e what action should the agent take in a given state. The policy based methods take state as input and return the probability of taking an action.

CartPole

The following description is taken from openai gym

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. 
The system is controlled by applying a force of +1 or -1 to the cart. 
The pendulum starts upright, and the goal is to prevent it from falling over. 
A reward of +1 is provided for every timestep that the pole remains upright. 
The episode ends when the pole is more than 15 degrees from vertical, 
or the cart moves more than 2.4 units from the center.

The environment is considered solved when the agent scores more than 195.0 over 100 consecutive episodes. The state space is continuous and action space is discrete (left or right).

Cartpole

solved cartpole environment

Approach

Our policy will use neural network to map continuous state into action probabilities.

We are going to make use of following steps to solve the environment:

initialise policy $\pi$ with random weights $\theta, \theta_{best} = \theta$
use policy $\pi_{\theta}$ to collect rewards ${r_1,r_2...r_t}$
calculate discounted return, $G_{current}=\sum_{i=t}^T \gamma^{i-t} r_i$ where discount factor $\gamma \in [0,1]$
compare and update $G_{best}, \theta_{best}$ if best return and weights found
update policy weights using certain update rule
repeat steps 2-5

The key difference in all the policies below is how we update the policy weights.

Random policy

We start with a policy that updates its weights randomly and use it as baseline to compare the performance of hill climbing methods. The code to update the policy weights looks like this:

    def step(self, current_return):
        """
        Choose random weights and compare it with best return
        if its better than best return then use it

        Parameters
        ----------
        current_return (int): Return of current rollout
        """
        super().step(current_return)
        if current_return >= self.best_return:
            self.best_return = current_return
            self.best_weights = self.w
        # update weights
        self.w = np.random.rand(*self.best_weights.shape)

We run the policy 1000 times and plot the number of episodes taken to reach the target of 200.0 over 100 consecutive runs.

The plot shows that random policy could solve the environment in less than or equal to 100 episodes 34% of the time and took 900-1000 episodes rest of the time (66%). It may well be possible that policy could not solve the environment at all in those 900-1000 bin. So, we could consider the environment not solved for runs that resulted in 900-1000 bin. Note that red line shows the mean of episodes and its solely for the illustration purpose.

Hill climbing

Hill climbing policies perform gradient ascend to find the global optimum. We generate a candidate policy by perturbing the current best weights by a small radius and using it to collect an episode. We then compare the return of candidate policy with the current best return.

    def step(self, current_return):
        """
        Perturb policy weights by predefined radius (noise) and
        update best return and weights if found

        Parameters
        ----------
        current_return (int): Return of current rollout
        """
        super().step(current_return)
        if current_return >= self.best_return:
            self.best_return = current_return
            self.best_weights = self.w
        # update weights
        self.w = self.best_weights + self.noise * np.random.rand(*self.best_weights.shape)

The plot shows that vanilla hill climbing performs better that random and is able to solve the environment 72% of the time. In general, the more number of episodes closer to the left side of the plot there are, the better the policy is.

While it is possible to compare each policy just by looking at the plots, I admittedly find it difficult. I suggest not to compare the plots for now and instead use them to get an idea of the performance of individual policy is distributed across multiple runs (1000 runs).

We will perform a direct comparison among policies towards the end of this article.

Steepest ascend

Steepest ascend uses the same approach as vanilla hill climbing but instead of generating a single candidate policy we generate multiple neighboring policies and compare the return. By doing this, we have a better chance of finding the better weights than vanilla hill climbing because we cover a larger radius.

    def step(self, current_return):
        """
        Perturb policy weights by generating neighboring policies and
        comparing the return from each policy with current best.
        if the return from any of neighboring policy is greater than 
        or equal to current best then we use that policy as our current.

        Parameters
        ----------
        current_return (int): Return of current rollout
        """        
        super().step(current_return)
        # Check the return from all neighbors
        candidate_returns = [current_return]
        candidate_weights = [self.w]        
        for _ in range(self.neighbors):
            policy = deepcopy(self)
            policy.w = self.best_weights + self.noise * np.random.rand(*self.best_weights.shape)
            rewards = run_episode(policy, self.env, self.max_steps)
            policy_return = calculate_returns(self.gamma, rewards)
            candidate_returns.append(policy_return)
            candidate_weights.append(policy.w)

        # Find the max return from candidate returns and 
        # compare it with our best return
        best_idx = np.argmax(np.array(candidate_returns))
        if candidate_returns[best_idx] >= self.best_return:
            self.best_return = candidate_returns[best_idx]
            self.best_weights = candidate_weights[best_idx]
            self.w = candidate_weights[best_idx]

Note that steepest ascend was able to solve the policy during all runs as there is no single bar for 900-1000 bin.

Simulated annealing

Simulated annealing uses a predefined schedule where the search radius is annealed whenever we make progress (i.e get better return than current best).

    def step(self, current_return):
        """Anneal the noise per schedule"""
        super().step(current_return)
        if current_return >= self.best_return:
            self.best_return = current_return
            self.best_weights = self.w
            # schedule to anneal the noise
            self.noise = max(self.min_noise, self.noise/2)
        # update weights
        self.w = self.best_weights + self.noise * np.random.rand(*self.best_weights.shape)

Adaptive noise scaling

Adaptive noise scaling improves the approach used by simulated annealing by contracting or reducing the search radius when we find a better policy and expanding the radius when we don’t find the better policy.

    def step(self, current_return):
        """
        Update weights based on following rule:
        expand search radius when current return is greater than or equal to best return
        contract search radius when current return is less than best return

        Parameters
        ----------
        current_return (int): Return of current rollout
        """    
        super().step(current_return)
        if current_return >= self.best_return:
            self.best_return = current_return
            self.best_weights = self.w
            # contract radius, if we are doing well
            self.noise = max(self.min_noise, self.noise/2)
        else:
            # expand radius, if we are not doing well
            self.noise = min(self.max_noise, self.noise * 2)
        # update weights
        self.w = self.best_weights + self.noise * np.random.rand(*self.best_weights.shape)

Performance

We measure the performance by running each policy 1000 times and recording the number of episodes taken to solve the environment. The environment is considered solved when the policy is able to get a return of 200.0 for 100 consecutive episodes. The following table shows the statistics from all five policies.

Policy	Mean	Std	Min	25%	50%	75%	Max
Random	660	473	1	1	1000	1000	1000
Vanilla Hill Climbing	367	399	1	1	217	912	1000
Steepest Ascend	73	53	1	1	103	103	418
Simulated Annealing	366	380	1	1	217	912	1000
Adaptive Noise Scaling	346	422	1	1	118	1000	1000

All the policies were able to solve the environment in a single episode atleast once. However, at 50th and 75th percentile it became clear that steepest ascend performed the best. The performance of Vanilla hill climbing and simulated annealing was similar and adaptive noise scaling was 2nd best policy at 50th percentile. Random policy ran out of luck very soon and was not able to solve the problem half of the time.

Conclusion

In this article, we explored policy based method to solve CartPole-v0 environment. We saw that steepest ascend performed the best in solving the environment over 1000 runs. However, Hill climbing and its variant are susceptible to weight initialisation and may get stuck with bad weights in which case they will not be able to solve the environment.

Over the next few articles, we will explore other policy based methods such as REINFORCE, Proximal Policy Optimisation (PPO) and Deep Deterministic Policy Gradients (DDPG).

The code used in this article is available at my github repository.

REFERENCES

[1] Deep reinforcement learning [link]