Prev
PROJ 967
Next
4T


4T - Agent Dory

A Deep Dive into POMDP Reinforcement Learning

Key Takeaways

  • POMDP Challenge: Traditional card games present unique challenges for AI due to hidden information (opponent’s hand), making them ideal testbeds for Partially Observable Markov Decision Processes
  • A2C + LSTM Architecture: Combined Advantage Actor-Critic with LSTM memory proved effective for sequential decision-making with temporal dependencies
  • Domain-Aware Engineering: Custom reward shaping and observation encoding significantly accelerated learning compared to generic approaches
  • 70% Win Rate Achievement: The final agent achieved robust performance against various opponents, demonstrating successful generalization
  • End-to-End Implementation: Built complete game simulation, RL environment, and training pipeline from scratch in PyTorch

What happens when you combine Ecuador’s beloved traditional card game with cutting-edge reinforcement learning? You get a fascinating journey into the world of Partially Observable Markov Decision Processes, temporal memory, and the art of teaching machines to think strategically.

The Game

“40” (Cuarenta) is a traditional Ecuadorian card game that’s deceptively simple on the surface but rich in strategic depth underneath. Players compete to reach exactly 40 points through a combination of capturing cards and strategic play. What makes this game particularly interesting for AI research is its perfect storm of challenges:

  • Hidden information: You can’t see your opponent’s hand
  • Sequential dependencies: Your current move affects future opportunities
  • Multiple scoring mechanisms: Points can be earned through chains, sums, and special combinations
  • Strategic timing: Knowing when to go for points versus setting up future plays

These characteristics make “40” an ideal candidate for exploring Partially Observable Markov Decision Processes (POMDPs) - a class of problems where an agent must make decisions without complete information about the environment state.

The A2C-LSTM Approach

The Memory Problem

Traditional reinforcement learning algorithms often assume full observability - they can see everything they need to make optimal decisions. But card games break this assumption. When you can’t see your opponent’s hand, past observations become crucial for inferring the current state.

This is where Long Short-Term Memory (LSTM) networks shine. By incorporating LSTM into an Advantage Actor-Critic (A2C) architecture, our agent can maintain a “memory” of past observations and use this temporal context to make better decisions.

class A2C_LSTM(nn.Module):
def __init__(self, obs_dim: int, action_dim: int, hidden_dim: int = 512,
lstm_dim: int = 256):
super().__init__()
# Feature extraction network
self.feature_net = nn.Sequential(
nn.Linear(obs_dim, hidden_dim),
nn.LayerNorm(hidden_dim),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, hidden_dim),
nn.LayerNorm(hidden_dim),
nn.ReLU(),
nn.Dropout(0.1)
)
# LSTM for temporal memory
self.lstm = nn.LSTM(hidden_dim, lstm_dim, num_layers=2,
batch_first=True, dropout=0.1)
# Strategic planning head
self.strategic_head = nn.Sequential(
nn.Linear(lstm_dim, hidden_dim // 2),
nn.ReLU(),
nn.Dropout(0.1)
)

Dual-Head Architecture

The network architecture employs a clever dual-head design:

  1. Actor Head: Decides which card to play based on the current policy
  2. Critic Head: Evaluates how good the current state is (value estimation)

But there’s a twist - both heads receive input from a “strategic planning” component that processes the LSTM output, allowing the agent to think more strategically about long-term consequences.

Seeing the Game

One of the most critical aspects of this project was designing how the AI perceives the game state. The ObservationEncoder class transforms the complex game state into a fixed-size tensor that the neural network can process:

def encode_observation(self, obs_dict: Dict, seen_cards: List[int] = None,
deck_count: int = 40) -> np.ndarray:
"""Convert dictionary observation to fixed-size tensor"""
# Encode hand and table as card count vectors
hand_encoded = self.encode_card_counts(obs_dict.get("hand", []))
table_encoded = self.encode_card_counts(obs_dict.get("table", []))
# Normalize game metrics
carton = obs_dict.get("carton", 0) / 40.0
points = obs_dict.get("points", 0) / 40.0
opponent_carton = obs_dict.get("opponent_carton", 0) / 40.0
opponent_points = obs_dict.get("opponent_points", 0) / 40.0
# Track game progression
turn = float(obs_dict.get("turn", 0))
last_card = obs_dict.get("last", -1) / 10.0 if obs_dict.get("last", -1) > 0 else 0.0
# Memory component: cards seen so far
seen_encoded = self.encode_card_counts(seen_cards)
deck_ratio = deck_count / 40.0
return np.concatenate([
hand_encoded, # 10 features - what cards I have
table_encoded, # 10 features - what's on the table
[carton, points, opponent_carton, opponent_points], # 4 features - scores
[turn, last_card], # 2 features - game state
seen_encoded, # 10 features - opponent's played cards
[deck_ratio] # 1 feature - game progression
]).astype(np.float32)

This 37-dimensional observation vector captures everything the AI needs to know while maintaining computational efficiency. The genius lies in the “seen cards” component - by tracking which cards the opponent has played, the agent can make educated guesses about what remains in their hand.

The Art of Motivation

Teaching an AI to play a card game isn’t just about showing it the rules - it’s about crafting the right incentives. The reward system in this project goes far beyond simple win/loss signals:

def calculate_step_reward(self, points: int, carton: int) -> float:
"""Calculate sophisticated step rewards"""
# Direct point rewards
point_reward = points * 25.0
# Carton rewards with strategic scaling
if self.carton <= 19:
carton_reward = carton * 1.5 # Early game: moderate carton value
else:
carton_reward = carton * 5.0 # Late game: high carton value
# Bonus for clearing the table
table_clear_bonus = 3.0 if points >= 2 else 0.0
# Progress incentive
progress_bonus = 0.0
if self.score > 30:
progress_bonus = (self.score - 30) * 0.5
return point_reward + carton_reward + table_clear_bonus + progress_bonus

This multi-faceted reward system captures the strategic nuances of “40”:

  • Immediate gratification: Points are heavily rewarded
  • Strategic patience: Carton collection becomes more valuable in late game
  • Tactical bonuses: Special rewards for table-clearing moves
  • Progress incentives: Encouraging the agent to get closer to the winning condition

The Game Engine

The Table class serves as the game engine, orchestrating the complex interactions between players, cards, and scoring rules. Two key mechanisms make “40” strategically rich:

Chain Capturing

def check_chain(self, card: int):
"""Handle sequential card capturing"""
seen = set(self.current)
end = card
while end in seen:
end += 1 # Find end of chain
to_remove = set(range(card, end)) # Remove entire chain
count = len(to_remove)
# Special bonus for matching the last played card
points = 2 if card == self.last else 0
return [points, count]

Sum Capturing

def check_sum(self, card):
"""Find optimal subset that sums to played card"""
if card > 7: # Optimization: high cards can't form sums
return [0, 0]
# Try all possible combinations, starting with largest
for r in range(len(self.current), 0, -1):
for combo in combinations(range(len(self.current)), r):
subset = [self.current[i] for i in combo]
if sum(subset) == card:
# Found optimal capture!
for val in subset:
self.current.remove(val)
return [0, len(subset)]
return [0, 0] # No valid sum found

These mechanisms create a rich decision space where players must balance immediate gains against future opportunities.

Training Regime

The training process follows a carefully orchestrated progression:

def train_v_random(self, model: str = 'models/rl_player_no_mps.pth'):
"""Train against random opponent with experience replay"""
with tqdm(total=self.episodes) as pbar:
for episode in range(self.episodes):
winner = self.table.game_loop()
# Calculate final rewards
if self.rl_player == winner:
final_reward = self.rl_player.get_final_reward(True) # +100
else:
final_reward = self.rl_player.get_final_reward(False) # Scaled penalty
self.rl_player.update_last_transition() # Fix last transition
# Periodic training with experience replay
if episode % 50 == 0 and len(self.rl_player.buffer) > 64:
for _ in range(5): # Multiple updates per training cycle
losses = self.rl_player.update(batch_size=64)

The training strategy includes several sophisticated elements:

  • Experience replay: Storing sequences of game states for batch learning
  • Periodic updates: Training every 50 games to maintain stability
  • Multiple update cycles: 5 gradient updates per training session
  • Curriculum learning: Starting against random opponents before facing stronger adversaries

Sequences and States

One of the most elegant aspects of this implementation is how it handles sequential decision-making. The ReplayBuffer doesn’t just store individual transitions - it stores sequences:

def _process_episode(self):
"""Convert episode into training sequences"""
if len(self.episode_obs) < self.seq_len:
return # Not enough data for sequence learning
# Create overlapping sequences for training
for i in range(len(self.episode_obs) - self.seq_len + 1):
obs_seq = torch.stack(self.episode_obs[i:i+self.seq_len])
action_seq = torch.stack(self.episode_actions[i:i+self.seq_len])
reward_seq = torch.stack(self.episode_rewards[i:i+self.seq_len])
# Store sequence with hidden state
self.buffer.push(obs_seq, action_seq, reward_seq, next_obs_seq,
done_seq, hidden_state)

This approach allows the LSTM to learn from complete game sequences rather than isolated decisions, leading to more strategic play.

Results and Implications

After 10,000 training episodes, the agent achieved a remarkable 70% win rate against random opponents and maintained strong performance against various strategic baselines. More importantly, the agent developed emergent behaviors that mirror human strategic thinking:

  • Card counting: The agent learned to track played cards and infer opponent capabilities
  • Timing strategies: Knowing when to go for immediate points versus setting up future captures
  • Risk assessment: Balancing aggressive plays against defensive positioning

Technical Innovations and Lessons Learned

1. Domain-Aware Observation Encoding

Rather than using generic state representations, the custom encoder captures game-specific features like card distributions and scoring potential.

2. Hierarchical Reward Design

The multi-level reward system (immediate, tactical, strategic) proved crucial for stable learning.

3. Sequence-Based Training

Training on sequences rather than individual transitions allowed the agent to develop temporal reasoning capabilities.

4. Masked Action Selection

Ensuring the agent can only select valid actions (cards in hand) prevented illegal moves and accelerated learning.

Looking Forward

This project demonstrates how traditional games can serve as excellent testbeds for advanced AI techniques. The combination of POMDP challenges, temporal dependencies, and strategic depth makes “40” an ideal environment for exploring:

  • Multi-agent learning: Training agents to play against each other
  • Transfer learning: Applying learned strategies to other card games
  • Explainable AI: Understanding what strategic concepts the agent has learned
  • Human-AI interaction: Creating engaging gameplay experiences

The success of this A2C-LSTM approach opens doors for applying similar techniques to other domains with partial observability and temporal dependencies - from financial trading to autonomous vehicle navigation.

Conclusion

Building an AI for Ecuador’s traditional game “40” proved to be more than just a fun coding project - it became a comprehensive exploration of modern reinforcement learning techniques. From the careful observation encoding to the sophisticated reward design, every component had to work in harmony to create an agent capable of strategic play.

The 70% win rate isn’t just a number - it represents thousands of learned decisions, strategic patterns, and emergent behaviors that mirror human-like game understanding. As AI continues to tackle increasingly complex challenges, projects like this remind us that sometimes the most profound innovations come from the simplest questions: “Can we teach a machine to play cards?”

The complete source code and trained models are available for those interested in exploring the intersection of traditional games and modern AI techniques.