Q-Learning for a Simple Grid World
Reinforcement learning allows an agent to learn optimal behavior in an environment through trial and error. This challenge focuses on implementing the Q-learning algorithm to train an agent to navigate a simple grid world to reach a goal state. Successfully completing this challenge demonstrates a foundational understanding of reinforcement learning principles.
Problem Description
You are tasked with creating a Q-learning agent that learns to navigate a 4x4 grid world. The grid world has the following properties:
- States: Each cell in the grid represents a state (numbered 0-15).
- Actions: The agent can take four actions: Up, Down, Left, and Right.
- Rewards:
- Reaching the goal state (state 15) yields a reward of +1.
- All other transitions (moving to a new state) yield a reward of -0.1.
- Hitting a wall (attempting to move off the grid) results in the agent staying in the same state and receiving a reward of -0.1.
- Goal State: State 15 is the goal state.
- Learning: The agent should learn the optimal Q-values for each state-action pair through repeated interactions with the environment.
You need to implement the following:
- Environment: A class representing the grid world environment. This class should handle state transitions based on actions and provide rewards.
- Q-Learning Agent: A class that implements the Q-learning algorithm. This class should:
- Initialize a Q-table (a dictionary or NumPy array) to store Q-values.
- Implement an
choose_actionmethod that selects an action based on an exploration-exploitation strategy (e.g., epsilon-greedy). - Implement a
learnmethod that updates the Q-table based on the Q-learning update rule.
- Training Loop: A function that trains the agent for a specified number of episodes.
Examples
Example 1:
Input:
environment_size = 4
num_episodes = 1000
epsilon = 0.1
learning_rate = 0.1
discount_factor = 0.9
Output:
A trained Q-table (dictionary or NumPy array) representing the learned Q-values for each state-action pair. The Q-values should reflect the optimal policy for navigating the grid world.
Explanation:
After training, the Q-table should contain values that indicate the expected cumulative reward for taking each action in each state. The agent should have learned to prioritize actions that lead to the goal state (15) while minimizing the negative rewards associated with movement.
Example 2:
Input:
environment_size = 4
num_episodes = 1000
epsilon = 0.01
learning_rate = 0.1
discount_factor = 0.9
Output:
A trained Q-table. With a smaller epsilon, the agent will exploit more and explore less, potentially leading to faster convergence to the optimal policy.
Explanation:
A lower epsilon value encourages the agent to consistently choose the action with the highest current Q-value, leading to more deterministic behavior and potentially faster learning if the initial Q-values are reasonably close to optimal.
Constraints
- Grid Size: The grid world will always be a square grid (e.g., 4x4). The
environment_sizeparameter determines the grid dimensions. - Number of Episodes: The training loop should run for a specified number of episodes.
- Epsilon: The exploration rate (epsilon) should be between 0 and 1.
- Learning Rate: The learning rate (alpha) should be between 0 and 1.
- Discount Factor: The discount factor (gamma) should be between 0 and 1.
- Q-Table Initialization: Initialize the Q-table with zeros.
- Action Selection: Implement an epsilon-greedy action selection strategy.
- Q-Learning Update Rule: Use the standard Q-learning update rule:
Q(s, a) = Q(s, a) + alpha * [reward + gamma * max_a' Q(s', a') - Q(s, a)]
Notes
- Consider using a dictionary to represent the Q-table, where keys are tuples
(state, action)and values are Q-values. Alternatively, a NumPy array can be used, but you'll need to map state and action indices appropriately. - The actions can be represented as integers: 0 (Up), 1 (Down), 2 (Left), 3 (Right).
- The environment should handle boundary conditions (walls) gracefully, preventing the agent from moving off the grid.
- Experiment with different values for epsilon, learning rate, and discount factor to observe their impact on learning performance.
- Focus on implementing the core Q-learning algorithm correctly. Visualization or advanced exploration strategies are not required for this challenge.
- The goal is to learn a policy that maximizes the cumulative reward over time.