๐Ÿฆ Reinforcement Learning ยท Game AI

Hungry Bird

Play Flappy Bird yourself, then watch a Q-Learning AI start from zero โ€” dying on the first pipe โ€” and slowly master the game through pure trial and error!

๐ŸŽฎ You Play First
๐Ÿค– AI Learns
๐Ÿ“ˆ Training Curve
๐ŸŽ›๏ธ Reward Shaping
๐Ÿ† Badge

How RL Masters Games

๐ŸŽฎ

State

What the agent observes: bird height, vertical speed, distance to next pipe gap, gap position.

โšก

Actions

Just two: flap or don't flap. Simple actions, but the timing makes all the difference!

๐Ÿ†

Rewards

+1 per frame alive, +10 per pipe passed, -100 for collision. The agent learns to maximise total reward.

๐Ÿ“š

Q-Table

Maps every (state, action) pair to an expected reward. Updates after every frame using the Bellman equation.

๐Ÿฆ
Wizzy the AI Tutor
Play Hungry Bird yourself first! ๐ŸŽฎ Click or press Space to flap. Experience firsthand how hard it is to time those flaps perfectly. Remember your best score โ€” the AI will start with zero knowledge and has to figure all of this out by itself!

Step 1 โ€” You Play First

Click canvas or press Space to flap

๐ŸŽฎ Your Score

Current score0
Best score0
Games played0
Pipes passed0
๐Ÿ’ก Your challenge:
Score as high as you can! The AI starts knowing nothing โ€” it will fail over and over. But after hundreds of tries, it will beat your score easily. That's the power of RL!
๐Ÿฆ
Wizzy the AI Tutor
Now watch the AI learn! ๐Ÿค– It starts completely random โ€” flapping at wrong times and dying immediately. But each death teaches it something. Watch the Q-table fill up and the score gradually improve. Press Fast to skip ahead hundreds of episodes!

Step 2 โ€” AI Training

๐Ÿค– AI Stats

Episode0
Current score0
Best score0
Avg (last 20)0
Exploration ฮต1.00
Q-table size0

๐Ÿ“ˆ Score per Episode

// Ready to train. Press Start!
๐Ÿฆ
Wizzy the AI Tutor
Look at the learning curve! Early episodes = score near zero (random flapping). Around episode 50โ€“100, the curve jumps โ€” the AI discovers the strategy of aligning with the gap. This "breakthrough moment" is one of the most exciting things in RL!

Step 3 โ€” Full Learning Curve

โ€”
Total Episodes
โ€”
Best Score
โ€”
Your Best Score
โ€”
Q-Table States
Train the AI in Phase 2 to see the full learning curve here!
๐Ÿฆ
Wizzy the AI Tutor
Reward shaping is one of the trickiest parts of RL! Change the reward values and retrain โ€” watch how the agent's behaviour completely changes. What if you give a huge reward for every frame alive? Or only reward pipe passages? The reward function IS the goal!

Step 4 โ€” Design the Reward Function

๐ŸŽ›๏ธ Reward Values

1.0
10
-100

๐ŸŽ›๏ธ Reward Training

Episode0
Best score0
Avg score0
Try these presets:
๐ŸŸข Alive only: alive=5, pipe=0, crash=-10
๐Ÿ”ต Pipe focus: alive=0, pipe=50, crash=-100
๐Ÿ”ด Survival: alive=2, pipe=10, crash=-200

๐Ÿ“ˆ Reward Training Progress

๐Ÿฆ
Wizzy the AI Tutor
๐ŸŽŠ You've trained a real RL agent to play a game โ€” from zero to expert through pure trial and error! This is exactly how DeepMind's AlphaGo, OpenAI Five, and Google's game-playing AIs were trained. You understand Q-learning, reward shaping, and the exploration-exploitation trade-off!
๐Ÿฆ

Game AI Badge!

You trained a Q-Learning agent to play Hungry Bird from scratch!

๐Ÿฆ WhizzStep AI Lab
This certifies that
Student Name
has trained a Q-Learning Game AI from scratch
Game AI Expert
Reward Shaper
RL Engineer
whizzstep.in

Key Concepts Mastered

State Space

๐Ÿ“ What the Agent Sees

The discretised observation: bird height bucket, velocity bucket, horizontal distance to gap, gap position bucket.

Reward Shaping

๐ŸŽฏ Designing the Goal

The reward function defines what the agent optimises. A poorly designed reward leads to unexpected, often hilarious, behaviour.

Exploration vs Exploitation

๐ŸŽฒ Try vs Use

Epsilon-greedy: high ฮต early = try random actions. Low ฮต later = use learned knowledge. The schedule matters!

Convergence

๐Ÿ“ˆ Getting Better

When the Q-table stabilises and scores plateau at a high level, the agent has converged to a near-optimal policy.

AlphaGo / OpenAI Five

๐Ÿ† Real Examples

DeepMind's AlphaGo used RL + MCTS to beat the world Go champion. OpenAI Five beat world champions at Dota 2.

Sparse Rewards

โณ Delayed Feedback

In games like Go, you only find out if you won at the very end โ€” no reward for 200+ moves. Hard for RL to handle!