Reinforcement learning is the training of machine learning models to make a sequence of decisions based on observations. The agent learns to achieve a goal in an uncertain, potentially complex environment. The agent employs trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward.
Q-Learning is a value based approach where the agent takes actions where the biggest Values/Advantages (based on Rewards) can be achieved. It is suitable for Discrete action spaces and can be performed in a Tabular style or with Deep Learning.
Q-Learning
Tabular approach
Generates a Table where a an agent can choose the most valuable Action based on its state via the Q-Value
Deep Neural Networks –> Deep Q Learning (DQN)
States are feed to a Neural Networ that outputs Q-Values for actions. The NN is learning the Q-Values by interacting with the Environment. For Eyploitation/ Exploration a Epsilon Greedy Trade Off is used. Greedily means- always chose the most valuable Value (ARGMAX of Value). Exploration means to use random actions. Q-Learning always goes to the global optimum but has a high variance.
DQN Advances
The Policy Gradient Method directly maps States to Actions with the goal to maximize rewards.
Steps:
Cross Entropy Loss = - LOG(Policy) * Ground-Truth-Vector (Labels) ... Labels = One Hot Vector
Policy Loss = Cross Entropy Loss * Rewards
The one hot encoded can be interpreted as a fake label which consists of the choosen actions from the episode. Then Cross Entropy is computed for gradients. Gradients are multiplied with Advantage or Reward Values for decreasing or increasing the likelihood of action probabilities with respect to their Advantage. Ultimately backpropagate this gradient for adjusting the weights and biases of the NN. By sampling from a Categorical Distribution the Exploration Exploitation trade off is handled automatically.
Actor-Critic combines Value Learning with Policy Gradient methods. The Critic tells the Actor how good the choosen action was and updates it accordingly.
Critic
Value Function is the critic and is learned with Regression methods
Actor
Policy Gradient is the actor which is updated with the critics values
Policy Loss = - LOG(Policy) * Critic(Values)
After a lot of interaction steps useful policies can be learned. Following advancements achive better results.
A3C
Rolls out multiple actors that act in different environments and collect training samples. Policy function is updated asynchronous.
A2C
Synchronous version of A3C - Policy update is performed after all agents stopped collecting samples. The gradient averages over all agents samples. The agents always act on the same policy
PPO (Proximal Policy Optimzation)
Actor-Critic method that clips gradients during the update steps to stay in a trust region of the Policy and not move to far away from the old Policy.
Make use of General Advantage Estimation (GAE) which mixes Monte Carlo with Temporal Difference Learning
Give the Policy an Entropy Bonus for better exploration
Is an maximum Entropy Reinforcement Learning Method. Its like Q Learning but for Continuous Acion Spaces.
Behavioral Cloning is one way of Imitation learning. This is learning by showing expert samples to the agent.
Curriculum learning - Learning from easy to hard tasks