4T - Agent Dory
A Deep Dive into POMDP Reinforcement Learning
Key Takeaways
- POMDP Challenge: Traditional card games present unique challenges for AI due to hidden information (opponent’s hand), making them ideal testbeds for Partially Observable Markov Decision Processes
- A2C + LSTM Architecture: Combined Advantage Actor-Critic with LSTM memory proved effective for sequential decision-making with temporal dependencies
- Domain-Aware Engineering: Custom reward shaping and observation encoding significantly accelerated learning compared to generic approaches
- 70% Win Rate Achievement: The final agent achieved robust performance against various opponents, demonstrating successful generalization
- End-to-End Implementation: Built complete game simulation, RL environment, and training pipeline from scratch in PyTorch
What happens when you combine Ecuador’s beloved traditional card game with cutting-edge reinforcement learning? You get a fascinating journey into the world of Partially Observable Markov Decision Processes, temporal memory, and the art of teaching machines to think strategically.
The Game
“40” (Cuarenta) is a traditional Ecuadorian card game that’s deceptively simple on the surface but rich in strategic depth underneath. Players compete to reach exactly 40 points through a combination of capturing cards and strategic play. What makes this game particularly interesting for AI research is its perfect storm of challenges:
- Hidden information: You can’t see your opponent’s hand
- Sequential dependencies: Your current move affects future opportunities
- Multiple scoring mechanisms: Points can be earned through chains, sums, and special combinations
- Strategic timing: Knowing when to go for points versus setting up future plays
These characteristics make “40” an ideal candidate for exploring Partially Observable Markov Decision Processes (POMDPs) - a class of problems where an agent must make decisions without complete information about the environment state.
The A2C-LSTM Approach
The Memory Problem
Traditional reinforcement learning algorithms often assume full observability - they can see everything they need to make optimal decisions. But card games break this assumption. When you can’t see your opponent’s hand, past observations become crucial for inferring the current state.
This is where Long Short-Term Memory (LSTM) networks shine. By incorporating LSTM into an Advantage Actor-Critic (A2C) architecture, our agent can maintain a “memory” of past observations and use this temporal context to make better decisions.
1class A2C_LSTM(nn.Module):2 def __init__(self, obs_dim: int, action_dim: int, hidden_dim: int = 512,3 lstm_dim: int = 256):4 super().__init__()5
6 # Feature extraction network7 self.feature_net = nn.Sequential(8 nn.Linear(obs_dim, hidden_dim),9 nn.LayerNorm(hidden_dim),10 nn.ReLU(),11 nn.Dropout(0.1),12 nn.Linear(hidden_dim, hidden_dim),13 nn.LayerNorm(hidden_dim),14 nn.ReLU(),15 nn.Dropout(0.1)16 )17
18 # LSTM for temporal memory19 self.lstm = nn.LSTM(hidden_dim, lstm_dim, num_layers=2,20 batch_first=True, dropout=0.1)21
22 # Strategic planning head23 self.strategic_head = nn.Sequential(24 nn.Linear(lstm_dim, hidden_dim // 2),25 nn.ReLU(),26 nn.Dropout(0.1)27 )
Dual-Head Architecture
The network architecture employs a clever dual-head design:
- Actor Head: Decides which card to play based on the current policy
- Critic Head: Evaluates how good the current state is (value estimation)
But there’s a twist - both heads receive input from a “strategic planning” component that processes the LSTM output, allowing the agent to think more strategically about long-term consequences.
Seeing the Game
One of the most critical aspects of this project was designing how the AI perceives the game state. The ObservationEncoder
class transforms the complex game state into a fixed-size tensor that the neural network can process:
1def encode_observation(self, obs_dict: Dict, seen_cards: List[int] = None,2 deck_count: int = 40) -> np.ndarray:3 """Convert dictionary observation to fixed-size tensor"""4
5 # Encode hand and table as card count vectors6 hand_encoded = self.encode_card_counts(obs_dict.get("hand", []))7 table_encoded = self.encode_card_counts(obs_dict.get("table", []))8
9 # Normalize game metrics10 carton = obs_dict.get("carton", 0) / 40.011 points = obs_dict.get("points", 0) / 40.012 opponent_carton = obs_dict.get("opponent_carton", 0) / 40.013 opponent_points = obs_dict.get("opponent_points", 0) / 40.014
15 # Track game progression16 turn = float(obs_dict.get("turn", 0))17 last_card = obs_dict.get("last", -1) / 10.0 if obs_dict.get("last", -1) > 0 else 0.018
19 # Memory component: cards seen so far20 seen_encoded = self.encode_card_counts(seen_cards)21 deck_ratio = deck_count / 40.022
23 return np.concatenate([24 hand_encoded, # 10 features - what cards I have25 table_encoded, # 10 features - what's on the table26 [carton, points, opponent_carton, opponent_points], # 4 features - scores27 [turn, last_card], # 2 features - game state28 seen_encoded, # 10 features - opponent's played cards29 [deck_ratio] # 1 feature - game progression30 ]).astype(np.float32)
This 37-dimensional observation vector captures everything the AI needs to know while maintaining computational efficiency. The genius lies in the “seen cards” component - by tracking which cards the opponent has played, the agent can make educated guesses about what remains in their hand.
The Art of Motivation
Teaching an AI to play a card game isn’t just about showing it the rules - it’s about crafting the right incentives. The reward system in this project goes far beyond simple win/loss signals:
1def calculate_step_reward(self, points: int, carton: int) -> float:2 """Calculate sophisticated step rewards"""3 # Direct point rewards4 point_reward = points * 25.05
6 # Carton rewards with strategic scaling7 if self.carton <= 19:8 carton_reward = carton * 1.5 # Early game: moderate carton value9 else:10 carton_reward = carton * 5.0 # Late game: high carton value11
12 # Bonus for clearing the table13 table_clear_bonus = 3.0 if points >= 2 else 0.014
15 # Progress incentive16 progress_bonus = 0.017 if self.score > 30:18 progress_bonus = (self.score - 30) * 0.519
20 return point_reward + carton_reward + table_clear_bonus + progress_bonus
This multi-faceted reward system captures the strategic nuances of “40”:
- Immediate gratification: Points are heavily rewarded
- Strategic patience: Carton collection becomes more valuable in late game
- Tactical bonuses: Special rewards for table-clearing moves
- Progress incentives: Encouraging the agent to get closer to the winning condition
The Game Engine
The Table
class serves as the game engine, orchestrating the complex interactions between players, cards, and scoring rules. Two key mechanisms make “40” strategically rich:
Chain Capturing
1def check_chain(self, card: int):2 """Handle sequential card capturing"""3 seen = set(self.current)4 end = card5 while end in seen:6 end += 1 # Find end of chain7
8 to_remove = set(range(card, end)) # Remove entire chain9 count = len(to_remove)10
11 # Special bonus for matching the last played card12 points = 2 if card == self.last else 013
14 return [points, count]
Sum Capturing
1def check_sum(self, card):2 """Find optimal subset that sums to played card"""3 if card > 7: # Optimization: high cards can't form sums4 return [0, 0]5
6 # Try all possible combinations, starting with largest7 for r in range(len(self.current), 0, -1):8 for combo in combinations(range(len(self.current)), r):9 subset = [self.current[i] for i in combo]10 if sum(subset) == card:11 # Found optimal capture!12 for val in subset:13 self.current.remove(val)14 return [0, len(subset)]15
16 return [0, 0] # No valid sum found
These mechanisms create a rich decision space where players must balance immediate gains against future opportunities.
Training Regime
The training process follows a carefully orchestrated progression:
1def train_v_random(self, model: str = 'models/rl_player_no_mps.pth'):2 """Train against random opponent with experience replay"""3
4 with tqdm(total=self.episodes) as pbar:5 for episode in range(self.episodes):6 winner = self.table.game_loop()7
8 # Calculate final rewards9 if self.rl_player == winner:10 final_reward = self.rl_player.get_final_reward(True) # +10011 else:12 final_reward = self.rl_player.get_final_reward(False) # Scaled penalty13 self.rl_player.update_last_transition() # Fix last transition14
15 # Periodic training with experience replay16 if episode % 50 == 0 and len(self.rl_player.buffer) > 64:17 for _ in range(5): # Multiple updates per training cycle18 losses = self.rl_player.update(batch_size=64)
The training strategy includes several sophisticated elements:
- Experience replay: Storing sequences of game states for batch learning
- Periodic updates: Training every 50 games to maintain stability
- Multiple update cycles: 5 gradient updates per training session
- Curriculum learning: Starting against random opponents before facing stronger adversaries
Sequences and States
One of the most elegant aspects of this implementation is how it handles sequential decision-making. The ReplayBuffer
doesn’t just store individual transitions - it stores sequences:
1def _process_episode(self):2 """Convert episode into training sequences"""3 if len(self.episode_obs) < self.seq_len:4 return # Not enough data for sequence learning5
6 # Create overlapping sequences for training7 for i in range(len(self.episode_obs) - self.seq_len + 1):8 obs_seq = torch.stack(self.episode_obs[i:i+self.seq_len])9 action_seq = torch.stack(self.episode_actions[i:i+self.seq_len])10 reward_seq = torch.stack(self.episode_rewards[i:i+self.seq_len])11
12 # Store sequence with hidden state13 self.buffer.push(obs_seq, action_seq, reward_seq, next_obs_seq,14 done_seq, hidden_state)
This approach allows the LSTM to learn from complete game sequences rather than isolated decisions, leading to more strategic play.
Results and Implications
After 10,000 training episodes, the agent achieved a remarkable 70% win rate against random opponents and maintained strong performance against various strategic baselines. More importantly, the agent developed emergent behaviors that mirror human strategic thinking:
- Card counting: The agent learned to track played cards and infer opponent capabilities
- Timing strategies: Knowing when to go for immediate points versus setting up future captures
- Risk assessment: Balancing aggressive plays against defensive positioning
Technical Innovations and Lessons Learned
1. Domain-Aware Observation Encoding
Rather than using generic state representations, the custom encoder captures game-specific features like card distributions and scoring potential.
2. Hierarchical Reward Design
The multi-level reward system (immediate, tactical, strategic) proved crucial for stable learning.
3. Sequence-Based Training
Training on sequences rather than individual transitions allowed the agent to develop temporal reasoning capabilities.
4. Masked Action Selection
Ensuring the agent can only select valid actions (cards in hand) prevented illegal moves and accelerated learning.
Looking Forward
This project demonstrates how traditional games can serve as excellent testbeds for advanced AI techniques. The combination of POMDP challenges, temporal dependencies, and strategic depth makes “40” an ideal environment for exploring:
- Multi-agent learning: Training agents to play against each other
- Transfer learning: Applying learned strategies to other card games
- Explainable AI: Understanding what strategic concepts the agent has learned
- Human-AI interaction: Creating engaging gameplay experiences
The success of this A2C-LSTM approach opens doors for applying similar techniques to other domains with partial observability and temporal dependencies - from financial trading to autonomous vehicle navigation.
Conclusion
Building an AI for Ecuador’s traditional game “40” proved to be more than just a fun coding project - it became a comprehensive exploration of modern reinforcement learning techniques. From the careful observation encoding to the sophisticated reward design, every component had to work in harmony to create an agent capable of strategic play.
The 70% win rate isn’t just a number - it represents thousands of learned decisions, strategic patterns, and emergent behaviors that mirror human-like game understanding. As AI continues to tackle increasingly complex challenges, projects like this remind us that sometimes the most profound innovations come from the simplest questions: “Can we teach a machine to play cards?”
The complete source code and trained models are available for those interested in exploring the intersection of traditional games and modern AI techniques.