CFR(Counterfactual Regret Minimization) 演算法簡介

參考作者在Quora上的解釋

Introduction

self-learning algorithm

  • learns strategy by repeatedly playing against itself
  • initialized with uniformly random
    • playing every action at every decision point with equal probability
  • play the action with maximum regret
  • it will converge to optimal strategy that can do no worse than tie against any opponent

Implementation

  • Summing total regret for each action at each decision point

    • regret: how much better if just always played this one action at this decision, instead of previous choices?
      • Positive regret means that we would have done better if we had taken that action more often
      • Negative regret means that we would have done better by not taking that action at all
      • 愈多regret,代表此選項要多選
    • do actions with probabilities proportional to their positive regret
    • after each game, update regret values
  • Counter-intuitively, sequence of strategies does not necessarily converge to anything useful

    • But it now does so in practice
    • in a two-player zero-sum game, if you compute the average strategy over those billions of strategies in the sequence, then that average strategy will converge towards Nash equilibrium of the game

Nash equilibrium(納許均衡)

  • Do no worse than tie against any other strategy
  • Plays perfect defence
    • Just wins when the opponent makes mistakes
      • since attempting to find and exploit an opponent's mistakes usually makes it possible for an even smarter opponent to exploit your new strategy
  • exploitability(利用度)
    • maximum expectation that a perfect counter-strategy could win
    • exploitability = 0 when Nash equilibrium
    • CFR can make average strategy's exploitability converges towards zero

Result

  • best poker programs started beating the world's best human players in heads-up limit hold'em in 2008, even though there were still massively exploitable by this worst-case measure
  • In January 2015, we've essentially weakly solved the game
    • a strategy with such a low exploitability (0.000986 big blinds per game)
      • have 95% statistical confidence that they were actually winning against everyone

Algorithm Implementation

待補充

Example Code

待補充

Summary

待補充

參考資料

  • (CFR)Regret Minimization in Games with Incomplete Information
  • (CFR+)Solving Large Imperfect Information Games Using CFR+
  • (CFR)Explanation of CFR by inventor himself
  • poker AI news
  • poker AI news2
  • (Implementation)openCFR