Reinforcement learning

yueyuan
2 min readNov 8, 2020

k-armed bandit problem: we have an agent who chooses between k choices and make actions based on the rewards.

make decision under uncertainty, for example: clinic trial, what move to watch, what song to listen, show advertisement, choose restaurant

estimate Action-values: the goal is to maximize the expected reward:

  1. Estimate action values using the sample average method

Qt(a) = sum of rewards when a taken prior to t/number of times a taken prior to t = sum(Ri)/(t-1)

2. Describe greedy action selection: a = argmax Q(a) the action with the highest value estimate

Estimating action values incrementally: Qn+1=Qn+(1/n)*(Rn-Qn)

3. Introduce the exploration-exploitation dilemma

exploration: improve knowledge for long-term benefit

exploitation: exploit knowledge for short-term benefit

how to balance the dilemma:

  1. epsilon-greedy action selection

At = argmax Qt(a) with probability 1-epsilon 维持根据当前了解的最喜欢

a~uniform (a1,..,ak) with probability epsilon 探索uncertaint的其他item

epsilon=0 (greedy)

epsilon 太小,explore 太少,需要学的更久,不容易得到比较优化的解

Why did epsilon of 0.1 perform better over 1000 steps than epsilon of 0.01?

The 0.01 agent did not explore enough. Thus it ended up selecting a suboptimal arm for longer.

If exploration is so great why did epsilon of 0.0 (a greedy agent) perform better than epsilon of 0.4?

Epsilon of 0.4 explores too often that it takes many sub-optimal actions causing it to do worse over the long term.

做ab testing,就像做 reinforcement learning,会牺牲 traffic 比较哪个模型更好。

2) UCB: use uncertainty in the value esitmation

What is the exploration/exploitation tradeoff?

The agent wants to explore to get more accurate estimates of its values. The agent also wants to exploit to get more reward. The agent cannot, however, choose to do both simultaneously.

MDP

Action 不单带来reward,还有state。就是说reward 不单取决action,还有取决于所处的state。而且action 还会改变state。

MDPs provide a general framework for sequential decision-making. The agent’s choice has future influence.

The dynamics of an MDP are defined by a probability distribution.

Give a goal, machine will design its own mechanism to achieve that goal.

Reward hypothesis: goals and purposes can be thought of as the maximization of the expected value of the cumulative sum of rewards received.

--

--