4T - Agent Dory

A Deep Dive into POMDP Reinforcement Learning

Key Takeaways

POMDP Challenge: Traditional card games present unique challenges for AI due to hidden information (opponent’s hand), making them ideal testbeds for Partially Observable Markov Decision Processes
A2C + LSTM Architecture: Combined Advantage Actor-Critic with LSTM memory proved effective for sequential decision-making with temporal dependencies
Domain-Aware Engineering: Custom reward shaping and observation encoding significantly accelerated learning compared to generic approaches
70% Win Rate Achievement: The final agent achieved robust performance against various opponents, demonstrating successful generalization
End-to-End Implementation: Built complete game simulation, RL environment, and training pipeline from scratch in PyTorch

What happens when you combine Ecuador’s beloved traditional card game with cutting-edge reinforcement learning? You get a fascinating journey into the world of Partially Observable Markov Decision Processes, temporal memory, and the art of teaching machines to think strategically.

The Game

“40” (Cuarenta) is a traditional Ecuadorian card game that’s deceptively simple on the surface but rich in strategic depth underneath. Players compete to reach exactly 40 points through a combination of capturing cards and strategic play. What makes this game particularly interesting for AI research is its perfect storm of challenges:

Hidden information: You can’t see your opponent’s hand
Sequential dependencies: Your current move affects future opportunities
Multiple scoring mechanisms: Points can be earned through chains, sums, and special combinations
Strategic timing: Knowing when to go for points versus setting up future plays

These characteristics make “40” an ideal candidate for exploring Partially Observable Markov Decision Processes (POMDPs) - a class of problems where an agent must make decisions without complete information about the environment state.

The A2C-LSTM Approach

The Memory Problem

Traditional reinforcement learning algorithms often assume full observability - they can see everything they need to make optimal decisions. But card games break this assumption. When you can’t see your opponent’s hand, past observations become crucial for inferring the current state.

This is where Long Short-Term Memory (LSTM) networks shine. By incorporating LSTM into an Advantage Actor-Critic (A2C) architecture, our agent can maintain a “memory” of past observations and use this temporal context to make better decisions.

1
class A2C_LSTM(nn.Module):
2
    def __init__(self, obs_dim: int, action_dim: int, hidden_dim: int = 512,
3
                 lstm_dim: int = 256):
4
        super().__init__()
5

6
        # Feature extraction network
7
        self.feature_net = nn.Sequential(
8
            nn.Linear(obs_dim, hidden_dim),
9
            nn.LayerNorm(hidden_dim),
10
            nn.ReLU(),
11
            nn.Dropout(0.1),
12
            nn.Linear(hidden_dim, hidden_dim),
13
            nn.LayerNorm(hidden_dim),
14
            nn.ReLU(),
15
            nn.Dropout(0.1)
16
        )
17

18
        # LSTM for temporal memory
19
        self.lstm = nn.LSTM(hidden_dim, lstm_dim, num_layers=2,
20
                            batch_first=True, dropout=0.1)
21

22
        # Strategic planning head
23
        self.strategic_head = nn.Sequential(
24
            nn.Linear(lstm_dim, hidden_dim // 2),
25
            nn.ReLU(),
26
            nn.Dropout(0.1)
27
        )

Dual-Head Architecture

The network architecture employs a clever dual-head design:

Actor Head: Decides which card to play based on the current policy
Critic Head: Evaluates how good the current state is (value estimation)

But there’s a twist - both heads receive input from a “strategic planning” component that processes the LSTM output, allowing the agent to think more strategically about long-term consequences.

Seeing the Game

One of the most critical aspects of this project was designing how the AI perceives the game state. The ObservationEncoder class transforms the complex game state into a fixed-size tensor that the neural network can process:

1
def encode_observation(self, obs_dict: Dict, seen_cards: List[int] = None,
2
                       deck_count: int = 40) -> np.ndarray:
3
    """Convert dictionary observation to fixed-size tensor"""
4

5
    # Encode hand and table as card count vectors
6
    hand_encoded = self.encode_card_counts(obs_dict.get("hand", []))
7
    table_encoded = self.encode_card_counts(obs_dict.get("table", []))
8

9
    # Normalize game metrics
10
    carton = obs_dict.get("carton", 0) / 40.0
11
    points = obs_dict.get("points", 0) / 40.0
12
    opponent_carton = obs_dict.get("opponent_carton", 0) / 40.0
13
    opponent_points = obs_dict.get("opponent_points", 0) / 40.0
14

15
    # Track game progression
16
    turn = float(obs_dict.get("turn", 0))
17
    last_card = obs_dict.get("last", -1) / 10.0 if obs_dict.get("last", -1) > 0 else 0.0
18

19
    # Memory component: cards seen so far
20
    seen_encoded = self.encode_card_counts(seen_cards)
21
    deck_ratio = deck_count / 40.0
22

23
    return np.concatenate([
24
        hand_encoded,           # 10 features - what cards I have
25
        table_encoded,          # 10 features - what's on the table
26
        [carton, points, opponent_carton, opponent_points],  # 4 features - scores
27
        [turn, last_card],      # 2 features - game state
28
        seen_encoded,           # 10 features - opponent's played cards
29
        [deck_ratio]            # 1 feature - game progression
30
    ]).astype(np.float32)

This 37-dimensional observation vector captures everything the AI needs to know while maintaining computational efficiency. The genius lies in the “seen cards” component - by tracking which cards the opponent has played, the agent can make educated guesses about what remains in their hand.

The Art of Motivation

Teaching an AI to play a card game isn’t just about showing it the rules - it’s about crafting the right incentives. The reward system in this project goes far beyond simple win/loss signals:

1
def calculate_step_reward(self, points: int, carton: int) -> float:
2
    """Calculate sophisticated step rewards"""
3
    # Direct point rewards
4
    point_reward = points * 25.0
5

6
    # Carton rewards with strategic scaling
7
    if self.carton <= 19:
8
        carton_reward = carton * 1.5  # Early game: moderate carton value
9
    else:
10
        carton_reward = carton * 5.0  # Late game: high carton value
11

12
    # Bonus for clearing the table
13
    table_clear_bonus = 3.0 if points >= 2 else 0.0
14

15
    # Progress incentive
16
    progress_bonus = 0.0
17
    if self.score > 30:
18
        progress_bonus = (self.score - 30) * 0.5
19

20
    return point_reward + carton_reward + table_clear_bonus + progress_bonus

This multi-faceted reward system captures the strategic nuances of “40”:

Immediate gratification: Points are heavily rewarded
Strategic patience: Carton collection becomes more valuable in late game
Tactical bonuses: Special rewards for table-clearing moves
Progress incentives: Encouraging the agent to get closer to the winning condition

The Game Engine

The Table class serves as the game engine, orchestrating the complex interactions between players, cards, and scoring rules. Two key mechanisms make “40” strategically rich:

Chain Capturing

1
def check_chain(self, card: int):
2
    """Handle sequential card capturing"""
3
    seen = set(self.current)
4
    end = card
5
    while end in seen:
6
        end += 1  # Find end of chain
7

8
    to_remove = set(range(card, end))  # Remove entire chain
9
    count = len(to_remove)
10

11
    # Special bonus for matching the last played card
12
    points = 2 if card == self.last else 0
13

14
    return [points, count]

Sum Capturing

1
def check_sum(self, card):
2
    """Find optimal subset that sums to played card"""
3
    if card > 7:  # Optimization: high cards can't form sums
4
        return [0, 0]
5

6
    # Try all possible combinations, starting with largest
7
    for r in range(len(self.current), 0, -1):
8
        for combo in combinations(range(len(self.current)), r):
9
            subset = [self.current[i] for i in combo]
10
            if sum(subset) == card:
11
                # Found optimal capture!
12
                for val in subset:
13
                    self.current.remove(val)
14
                return [0, len(subset)]
15

16
    return [0, 0]  # No valid sum found

These mechanisms create a rich decision space where players must balance immediate gains against future opportunities.

Training Regime

The training process follows a carefully orchestrated progression:

1
def train_v_random(self, model: str = 'models/rl_player_no_mps.pth'):
2
    """Train against random opponent with experience replay"""
3

4
    with tqdm(total=self.episodes) as pbar:
5
        for episode in range(self.episodes):
6
            winner = self.table.game_loop()
7

8
            # Calculate final rewards
9
            if self.rl_player == winner:
10
                final_reward = self.rl_player.get_final_reward(True)  # +100
11
            else:
12
                final_reward = self.rl_player.get_final_reward(False)  # Scaled penalty
13
                self.rl_player.update_last_transition()  # Fix last transition
14

15
            # Periodic training with experience replay
16
            if episode % 50 == 0 and len(self.rl_player.buffer) > 64:
17
                for _ in range(5):  # Multiple updates per training cycle
18
                    losses = self.rl_player.update(batch_size=64)

The training strategy includes several sophisticated elements:

Experience replay: Storing sequences of game states for batch learning
Periodic updates: Training every 50 games to maintain stability
Multiple update cycles: 5 gradient updates per training session
Curriculum learning: Starting against random opponents before facing stronger adversaries

Sequences and States

One of the most elegant aspects of this implementation is how it handles sequential decision-making. The ReplayBuffer doesn’t just store individual transitions - it stores sequences:

1
def _process_episode(self):
2
    """Convert episode into training sequences"""
3
    if len(self.episode_obs) < self.seq_len:
4
        return  # Not enough data for sequence learning
5

6
    # Create overlapping sequences for training
7
    for i in range(len(self.episode_obs) - self.seq_len + 1):
8
        obs_seq = torch.stack(self.episode_obs[i:i+self.seq_len])
9
        action_seq = torch.stack(self.episode_actions[i:i+self.seq_len])
10
        reward_seq = torch.stack(self.episode_rewards[i:i+self.seq_len])
11

12
        # Store sequence with hidden state
13
        self.buffer.push(obs_seq, action_seq, reward_seq, next_obs_seq,
14
                         done_seq, hidden_state)

This approach allows the LSTM to learn from complete game sequences rather than isolated decisions, leading to more strategic play.

Results and Implications

After 10,000 training episodes, the agent achieved a remarkable 70% win rate against random opponents and maintained strong performance against various strategic baselines. More importantly, the agent developed emergent behaviors that mirror human strategic thinking:

Card counting: The agent learned to track played cards and infer opponent capabilities
Timing strategies: Knowing when to go for immediate points versus setting up future captures
Risk assessment: Balancing aggressive plays against defensive positioning

Technical Innovations and Lessons Learned

1. Domain-Aware Observation Encoding

Rather than using generic state representations, the custom encoder captures game-specific features like card distributions and scoring potential.

2. Hierarchical Reward Design

The multi-level reward system (immediate, tactical, strategic) proved crucial for stable learning.

3. Sequence-Based Training

Training on sequences rather than individual transitions allowed the agent to develop temporal reasoning capabilities.

4. Masked Action Selection

Ensuring the agent can only select valid actions (cards in hand) prevented illegal moves and accelerated learning.

Looking Forward

This project demonstrates how traditional games can serve as excellent testbeds for advanced AI techniques. The combination of POMDP challenges, temporal dependencies, and strategic depth makes “40” an ideal environment for exploring:

Multi-agent learning: Training agents to play against each other
Transfer learning: Applying learned strategies to other card games
Explainable AI: Understanding what strategic concepts the agent has learned
Human-AI interaction: Creating engaging gameplay experiences

The success of this A2C-LSTM approach opens doors for applying similar techniques to other domains with partial observability and temporal dependencies - from financial trading to autonomous vehicle navigation.

Conclusion

Building an AI for Ecuador’s traditional game “40” proved to be more than just a fun coding project - it became a comprehensive exploration of modern reinforcement learning techniques. From the careful observation encoding to the sophisticated reward design, every component had to work in harmony to create an agent capable of strategic play.

The 70% win rate isn’t just a number - it represents thousands of learned decisions, strategic patterns, and emergent behaviors that mirror human-like game understanding. As AI continues to tackle increasingly complex challenges, projects like this remind us that sometimes the most profound innovations come from the simplest questions: “Can we teach a machine to play cards?”

The complete source code and trained models are available for those interested in exploring the intersection of traditional games and modern AI techniques.

1 class A2C_LSTM(nn.Module): 2 def __init__(self, obs_dim: int, action_dim: int, hidden_dim: int = 512, 3 lstm_dim: int = 256): 4 super().__init__() 5 6 # Feature extraction network 7 self.feature_net = nn.Sequential( 8 nn.Linear(obs_dim, hidden_dim), 9 nn.LayerNorm(hidden_dim), 10 nn.ReLU(), 11 nn.Dropout(0.1), 12 nn.Linear(hidden_dim, hidden_dim), 13 nn.LayerNorm(hidden_dim), 14 nn.ReLU(), 15 nn.Dropout(0.1) 16 ) 17 18 # LSTM for temporal memory 19 self.lstm = nn.LSTM(hidden_dim, lstm_dim, num_layers=2, 20 batch_first=True, dropout=0.1) 21 22 # Strategic planning head 23 self.strategic_head = nn.Sequential( 24 nn.Linear(lstm_dim, hidden_dim // 2), 25 nn.ReLU(), 26 nn.Dropout(0.1) 27 )

1 def encode_observation(self, obs_dict: Dict, seen_cards: List[int] = None, 2 deck_count: int = 40) -> np.ndarray: 3 """Convert dictionary observation to fixed-size tensor""" 4 5 # Encode hand and table as card count vectors 6 hand_encoded = self.encode_card_counts(obs_dict.get("hand", [])) 7 table_encoded = self.encode_card_counts(obs_dict.get("table", [])) 8 9 # Normalize game metrics 10 carton = obs_dict.get("carton", 0) / 40.0 11 points = obs_dict.get("points", 0) / 40.0 12 opponent_carton = obs_dict.get("opponent_carton", 0) / 40.0 13 opponent_points = obs_dict.get("opponent_points", 0) / 40.0 14 15 # Track game progression 16 turn = float(obs_dict.get("turn", 0)) 17 last_card = obs_dict.get("last", -1) / 10.0 if obs_dict.get("last", -1) > 0 else 0.0 18 19 # Memory component: cards seen so far 20 seen_encoded = self.encode_card_counts(seen_cards) 21 deck_ratio = deck_count / 40.0 22 23 return np.concatenate([ 24 hand_encoded, # 10 features - what cards I have 25 table_encoded, # 10 features - what's on the table 26 [carton, points, opponent_carton, opponent_points], # 4 features - scores 27 [turn, last_card], # 2 features - game state 28 seen_encoded, # 10 features - opponent's played cards 29 [deck_ratio] # 1 feature - game progression 30 ]).astype(np.float32)

1 def calculate_step_reward(self, points: int, carton: int) -> float: 2 """Calculate sophisticated step rewards""" 3 # Direct point rewards 4 point_reward = points * 25.0 5 6 # Carton rewards with strategic scaling 7 if self.carton <= 19: 8 carton_reward = carton * 1.5 # Early game: moderate carton value 9 else: 10 carton_reward = carton * 5.0 # Late game: high carton value 11 12 # Bonus for clearing the table 13 table_clear_bonus = 3.0 if points >= 2 else 0.0 14 15 # Progress incentive 16 progress_bonus = 0.0 17 if self.score > 30: 18 progress_bonus = (self.score - 30) * 0.5 19 20 return point_reward + carton_reward + table_clear_bonus + progress_bonus

1 def check_chain(self, card: int): 2 """Handle sequential card capturing""" 3 seen = set(self.current) 4 end = card 5 while end in seen: 6 end += 1 # Find end of chain 7 8 to_remove = set(range(card, end)) # Remove entire chain 9 count = len(to_remove) 10 11 # Special bonus for matching the last played card 12 points = 2 if card == self.last else 0 13 14 return [points, count]

1 def check_sum(self, card): 2 """Find optimal subset that sums to played card""" 3 if card > 7: # Optimization: high cards can't form sums 4 return [0, 0] 5 6 # Try all possible combinations, starting with largest 7 for r in range(len(self.current), 0, -1): 8 for combo in combinations(range(len(self.current)), r): 9 subset = [self.current[i] for i in combo] 10 if sum(subset) == card: 11 # Found optimal capture! 12 for val in subset: 13 self.current.remove(val) 14 return [0, len(subset)] 15 16 return [0, 0] # No valid sum found

1 def train_v_random(self, model: str = 'models/rl_player_no_mps.pth'): 2 """Train against random opponent with experience replay""" 3 4 with tqdm(total=self.episodes) as pbar: 5 for episode in range(self.episodes): 6 winner = self.table.game_loop() 7 8 # Calculate final rewards 9 if self.rl_player == winner: 10 final_reward = self.rl_player.get_final_reward(True) # +100 11 else: 12 final_reward = self.rl_player.get_final_reward(False) # Scaled penalty 13 self.rl_player.update_last_transition() # Fix last transition 14 15 # Periodic training with experience replay 16 if episode % 50 == 0 and len(self.rl_player.buffer) > 64: 17 for _ in range(5): # Multiple updates per training cycle 18 losses = self.rl_player.update(batch_size=64)

1 def _process_episode(self): 2 """Convert episode into training sequences""" 3 if len(self.episode_obs) < self.seq_len: 4 return # Not enough data for sequence learning 5 6 # Create overlapping sequences for training 7 for i in range(len(self.episode_obs) - self.seq_len + 1): 8 obs_seq = torch.stack(self.episode_obs[i:i+self.seq_len]) 9 action_seq = torch.stack(self.episode_actions[i:i+self.seq_len]) 10 reward_seq = torch.stack(self.episode_rewards[i:i+self.seq_len]) 11 12 # Store sequence with hidden state 13 self.buffer.push(obs_seq, action_seq, reward_seq, next_obs_seq, 14 done_seq, hidden_state)