CFR(Counterfactual Regret Minimization) 演算法簡介

發表於 2015-06-15 文章字數： 2k

參考作者在Quora上的解釋

Introduction

self-learning algorithm

learns strategy by repeatedly playing against itself
initialized with uniformly random
- playing every action at every decision point with equal probability
play the action with maximum regret
it will converge to optimal strategy that can do no worse than tie against any opponent

Summing total regret for each action at each decision point
- regret: how much better if just always played this one action at this decision, instead of previous choices?
  - Positive regret means that we would have done better if we had taken that action more often
  - Negative regret means that we would have done better by not taking that action at all
  - 愈多regret，代表此選項要多選
- do actions with probabilities proportional to their positive regret
- after each game, update regret values
Counter-intuitively, sequence of strategies does not necessarily converge to anything useful
- But it now does so in practice
- in a two-player zero-sum game, if you compute the average strategy over those billions of strategies in the sequence, then that average strategy will converge towards Nash equilibrium of the game

Do no worse than tie against any other strategy
Plays perfect defence
- Just wins when the opponent makes mistakes
  - since attempting to find and exploit an opponent's mistakes usually makes it possible for an even smarter opponent to exploit your new strategy
exploitability(利用度)
- maximum expectation that a perfect counter-strategy could win
- exploitability = 0 when Nash equilibrium
- CFR can make average strategy's exploitability converges towards zero

best poker programs started beating the world's best human players in heads-up limit hold'em in 2008, even though there were still massively exploitable by this worst-case measure
In January 2015, we've essentially weakly solved the game
- a strategy with such a low exploitability (0.000986 big blinds per game)
  - have 95% statistical confidence that they were actually winning against everyone

待補充

待補充

待補充