Unit 1 - Lunar Lander

Unit 1 notes. Jupyter notebook for this page can be found here

Introduction
Trained model vs untrained model
Libraries and environment used
Defined functions
Environment Action space and Observation space
SB3 models to choose from
PPO Implementation (MLP Policy)
A2C Implementation
DQN Implementation
Conclusion and Results
Publishing best model on huggingface

Introduction

In this experiment, a lander agent is trained to safely land on the surface between the two flags.

Trained model vs untrained model

First, a difference between a trained and an untrained agent.
An untrained agent takes random actions in the environment and fails whereas the trained agent calculates its relative position in the environment and using the jets, lands safely.
This page describes how the agent was trained.

Trained vs Untrained agent

Libraries and environment used

Gym is an open source Python library for developing and comparing a standard API to communicate between learning algorithms and environments. It is a diverse collection of reference environments
Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. Algorithms used in this experiment come from SB3 and various others can be chosen based on the environment.
LunarLander-v2 environment is used for this project. The objective of LunarLander is to safely land a spaceship between the two flags. The LunarLander environment is:
- Fully Observable - All necessary state information is known at every frame
- Single Agent - No competition
- Deterministic - No stochasticity
- Episodic - Reward depends only on the current state and action

Defined functions

init_render() - Initially create the environment and render it for 150 timesteps.
play_trained_model() - Once a model is trained, this function plays the model for one episode to view progress.
record_random_episode() - To record a ‘gif’ of the agent playing in the environment with random inputs for documentation purposes.
record_trained_model_episode() - To record a ‘gif’ of the agent playing in the environment with a trained model for documentation purposes.
evaluate_model() - Evaluation helper from SB3, runs the trained policy for ‘n’ number of episodes and returns the average reward for the agent in the environment.

Environment Action space and Observation space

LunarLander-v2 Action Space: Discrete(4)
- 4 Discrete actions can be taken - thrust, left, right, nothing
LunarLander-v2 Observation Space Box([-inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf], (8,), float32)
- 8 observations - Continuous (X distance from target site, Y distance from target site, X velocity, Y velocity, Angle of ship, Angular velocity of ship) and Binary (Left leg is grounded, Right leg is grounded).

SB3 models to choose from

As explained above, SB3 is a set of implementations of reinforcement learning algorithms. As the environment has Discrete action space and Box observation space, I will be testing these three algorithms: PPO, A2C and DQN (as described in the project discription page). MlpPolicy is a policy that implements actor critic, using a MLP(2 layers of 64)

PPO Implementation (MLP Policy)

In simple terms, the Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor). The main idea is that after an update, the new policy should be not too far from the old policy. For that, ppo uses clipping to avoid too large update. Minus this small description, more information can be found here.

A2C Implementation

Again, in simpler terms, A2C is a synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C). It uses multiple workers to avoid the use of a replay buffer. It uses a asynchronous gradient descent for optimization of deep neural network controllers. More information can be found here

DQN Implementation

Deep Q Network builds on ‘Fitted Q-Iteration’ and make use of different tricks to stabilize the learning with neural networks: it uses a replay buffer, a target network and gradient clipping. More information can be found here

Conclusion and Results

All agents were trained only for 500k timesteps - Due to computational reasons and how this page is just to showcase basic experience in setting up a GYM environment and testing out basic RL SB3 policies.
Training each policy for more timesteps usually increases the average reward per episode, upto a limit though.
for 500k timesteps, PPO performs the best with an average mean reward of ‘231.71 +/- 19.68’, then DQN and finally A2C.
These results heavily depend on the number of timesteps trained and hyperparameters optimized, therefore I’ll end this with, more experimentation is needed to get a conclusive result.

Unit 1 - Lunar Lander

Introduction

Trained model vs untrained model

Libraries and environment used

Defined functions

Environment Action space and Observation space

SB3 models to choose from

PPO Implementation (MLP Policy)

A2C Implementation

DQN Implementation

Conclusion and Results

Publishing best model on huggingface

Amitoj Battu

Error

Introduction

Trained model vs untrained model

Libraries and environment used

Defined functions

Environment Action space and Observation space

SB3 models to choose from

PPO Implementation (MLP Policy)

A2C Implementation

DQN Implementation

Conclusion and Results

Publishing best model on huggingface

Templates (for web app):

Error