WIKI

Reinforcement Learning

Reinforcement learning is the training of machine learning models to make a sequence of decisions based on observations. The agent learns to achieve a goal in an uncertain, potentially complex environment. The agent employs trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward.

Useful thigs to know

Terminology

Categories

Reinfrocement Learning Algorithms

Q-Learning

Q-Learning is a value based approach where the agent takes actions where the biggest Values/Advantages (based on Rewards) can be achieved. It is suitable for Discrete action spaces and can be performed in a Tabular style or with Deep Learning.

Q-Learning

DQN Advances

Policy Gradient Method with Deep Learning

The Policy Gradient Method directly maps States to Actions with the goal to maximize rewards.

Vanilla Policy Gradient for Discrete Actions

Steps:

  Cross Entropy Loss = - LOG(Policy) * Ground-Truth-Vector (Labels) ... Labels = One Hot Vector
  Policy Loss = Cross Entropy Loss * Rewards

The one hot encoded can be interpreted as a fake label which consists of the choosen actions from the episode. Then Cross Entropy is computed for gradients. Gradients are multiplied with Advantage or Reward Values for decreasing or increasing the likelihood of action probabilities with respect to their Advantage. Ultimately backpropagate this gradient for adjusting the weights and biases of the NN. By sampling from a Categorical Distribution the Exploration Exploitation trade off is handled automatically.

Actor-Critic Methods

Actor-Critic combines Value Learning with Policy Gradient methods. The Critic tells the Actor how good the choosen action was and updates it accordingly.

  Policy Loss = - LOG(Policy) * Critic(Values)  

After a lot of interaction steps useful policies can be learned. Following advancements achive better results.

Soft Actor Critic

Is an maximum Entropy Reinforcement Learning Method. Its like Q Learning but for Continuous Acion Spaces.

Imitation learning

Notes