Play Flappy Bird yourself, then watch a Q-Learning AI start from zero โ dying on the first pipe โ and slowly master the game through pure trial and error!
What the agent observes: bird height, vertical speed, distance to next pipe gap, gap position.
Just two: flap or don't flap. Simple actions, but the timing makes all the difference!
+1 per frame alive, +10 per pipe passed, -100 for collision. The agent learns to maximise total reward.
Maps every (state, action) pair to an expected reward. Updates after every frame using the Bellman equation.
You trained a Q-Learning agent to play Hungry Bird from scratch!
The discretised observation: bird height bucket, velocity bucket, horizontal distance to gap, gap position bucket.
The reward function defines what the agent optimises. A poorly designed reward leads to unexpected, often hilarious, behaviour.
Epsilon-greedy: high ฮต early = try random actions. Low ฮต later = use learned knowledge. The schedule matters!
When the Q-table stabilises and scores plateau at a high level, the agent has converged to a near-optimal policy.
DeepMind's AlphaGo used RL + MCTS to beat the world Go champion. OpenAI Five beat world champions at Dota 2.
In games like Go, you only find out if you won at the very end โ no reward for 200+ moves. Hard for RL to handle!