Hone logo
Hone
Problems

Implement Q-Learning for a Simple Grid World

This challenge involves implementing the Q-learning algorithm from scratch to solve a basic Grid World environment. Q-learning is a fundamental reinforcement learning algorithm that learns an optimal policy by estimating the quality of taking a certain action in a given state. This is a crucial skill for anyone looking to delve into the field of reinforcement learning and artificial intelligence.

Problem Description

Your task is to implement a Q-learning agent that can navigate a predefined Grid World. The agent starts at a specific location and aims to reach a target destination. The Grid World contains obstacles that the agent cannot enter. The agent learns by trial and error, receiving rewards for reaching the goal and penalties for hitting obstacles or taking too many steps.

Key Requirements:

  1. Q-Table: Implement a Q-table (a dictionary or a NumPy array) to store the Q-values for each state-action pair. The Q-table should be initialized with zeros.
  2. State Representation: States should be represented by tuples (row, column) corresponding to the agent's position in the grid.
  3. Action Representation: Actions should be represented by integers or strings (e.g., 0: Up, 1: Down, 2: Left, 3: Right).
  4. Q-Learning Update Rule: Implement the Q-learning update rule: Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', a')) - Q(s, a)) where:
    • s: current state
    • a: action taken
    • s': next state
    • reward: immediate reward received
    • alpha: learning rate (0 < alpha <= 1)
    • gamma: discount factor (0 <= gamma < 1)
    • max(Q(s', a')): the maximum Q-value for the next state s' across all possible actions a'.
  5. Exploration vs. Exploitation: Implement an epsilon-greedy policy for action selection. With probability epsilon, the agent chooses a random action (exploration); otherwise, it chooses the action with the highest Q-value for the current state (exploitation). Epsilon should decay over time.
  6. Environment Interaction: The agent needs to be able to interact with a Grid World environment. You will need to define the grid, start/goal positions, obstacles, and reward structure.
  7. Training Loop: Implement a training loop that allows the agent to interact with the environment for a specified number of episodes.

Expected Behavior:

After training, the agent should be able to find an optimal or near-optimal path from the start to the goal in the Grid World, avoiding obstacles. The learned Q-values should reflect the expected future rewards for each state-action pair.

Edge Cases:

  • Stuck in a loop: The exploration strategy should help the agent escape local optima or cycles.
  • Unreachable goal: The agent should ideally learn to avoid states that lead to no path to the goal, or at least not get stuck indefinitely.
  • Grid boundaries: The agent must not attempt to move outside the grid.

Examples

Let's define a simple 3x3 Grid World for demonstration.

Grid:
[['S', '.', 'O'],
 ['.', '.', '.'],
 ['O', '.', 'G']]

'S': Start
'G': Goal
'.': Empty cell
'O': Obstacle

Actions:
0: Up (-1, 0)
1: Down (+1, 0)
2: Left (0, -1)
3: Right (0, +1)

Rewards:
+10 for reaching 'G'
-1 for hitting an 'O'
-0.1 for any other step (to encourage shorter paths)

Example 1: Initial State and First Few Steps (Illustrative)

Assume epsilon is high (e.g., 1.0) and alpha, gamma are set.

  • Initial State: (0, 0) (representing 'S'). Q-table is all zeros.
  • Action Selection: With high epsilon, the agent might randomly choose Action 3 (Right).
  • Environment Response: The agent moves to (0, 1). Reward is -0.1.
  • Q-Learning Update: Q((0,0), 3) is updated based on the reward and the potential future reward from state (0,1).

Example 2: Reaching the Goal

After many episodes, if the agent successfully reaches the goal:

  • State: (2, 2) (representing 'G').
  • Action: Let's say the agent arrived from (2, 1) by taking Action 3 (Right).
  • Reward: +10.
  • Q-Learning Update: The Q-value for the state-action pair that led to the goal (e.g., Q((2,1), 3)) will be significantly updated, reflecting the high reward. This update propagates backward through the episodes.

Example 3: Encountering an Obstacle

If the agent attempts to move into an obstacle:

  • State: Let's say the agent is at (0, 1) and tries Action 0 (Up).
  • Environment Response: The agent remains at (0, 1). Reward is -1.
  • Q-Learning Update: Q((0,1), 0) will be updated with a negative value, making this action less desirable in state (0,1).

Constraints

  • Grid Dimensions: The Grid World will be between 3x3 and 10x10 in size.
  • Number of Episodes: The training process should run for a reasonable number of episodes (e.g., 1000 to 10000), which will be a parameter you can set.
  • Learning Rate (alpha): 0.1 <= alpha <= 0.5
  • Discount Factor (gamma): 0.8 <= gamma < 1.0
  • Epsilon Decay: Epsilon should decay exponentially or linearly over the training episodes.
  • Q-Table Initialization: All Q-values should be initialized to 0.
  • Python Version: Python 3.6+

Notes

  • You can use NumPy for array manipulation if desired, but it's not strictly required for the Q-table itself (a dictionary is also viable).
  • Consider how you will handle states and actions when implementing the Q-table. Dictionaries mapping (state, action) tuples to Q-values are a common and flexible approach.
  • The success of your implementation will be judged by the agent's ability to learn a path to the goal within a reasonable number of training episodes. You might want to implement a function to test the learned policy after training.
  • Experiment with different values for alpha, gamma, and epsilon decay rates to observe their impact on learning.
Loading editor...
python