Implementing a Simple Policy Gradient Agent for a Discrete Action Space

This challenge asks you to implement a fundamental policy gradient algorithm in Python. You will create an agent that learns to make decisions in a simple environment by directly optimizing a parameterized policy. This is a cornerstone of reinforcement learning and understanding it is crucial for building more complex agents.

Problem Description

Your task is to implement a basic policy gradient algorithm, specifically REINFORCE (also known as Monte Carlo Policy Gradient), for an agent interacting with a discrete action space environment. The agent will maintain a policy represented by a neural network that outputs probabilities for each possible action. The goal is to train this policy by sampling trajectories (sequences of states, actions, and rewards) and using the accumulated discounted rewards to update the policy's parameters.

Key Requirements:

Policy Network: Implement a neural network (e.g., using PyTorch or TensorFlow/Keras) that takes the current state as input and outputs a probability distribution over the available actions. For simplicity, assume a fully connected network.
Action Selection: The agent should sample actions from the probability distribution output by the policy network.
REINFORCE Algorithm: Implement the core REINFORCE update rule:
- Sample a trajectory of (state, action, reward) tuples.
- Calculate the discounted return (cumulative future reward) for each time step.
- Compute the policy gradient loss, which encourages actions that lead to higher returns and discourages those that lead to lower returns.
- Update the policy network's weights using an optimizer (e.g., Adam).
Environment Interaction: The agent needs to interact with a predefined, simple environment that provides states, accepts actions, and returns rewards and a done signal.

Expected Behavior:

Over multiple training episodes, the agent should learn to select actions that maximize its cumulative reward. For a simple environment, this means the agent's performance (measured by average episode reward) should improve.

Edge Cases:

Zero Reward Episodes: Handle episodes where no significant rewards are obtained. The algorithm should still be able to learn.
Short/Long Episodes: The algorithm should be robust to varying episode lengths.

Examples

Since this is an implementation challenge, concrete input/output examples are less about specific data transformations and more about demonstrating learning. We will define a hypothetical environment for demonstration.

Hypothetical Environment: "SimpleGridWorld"

State: A 2D integer tuple representing the agent's position (e.g., (row, col)).
Actions: 0 (Up), 1 (Down), 2 (Left), 3 (Right).
Reward:
- +1 if the agent reaches a target location.
- -0.1 for each step taken (to encourage efficiency).
- -1 if the agent hits a boundary (optional, depending on environment design).
Done: True if the target is reached or if a maximum number of steps is exceeded.

Example 1: Initial Performance (Before Training)

Let's say the SimpleGridWorld is a 3x3 grid, and the target is at (2,2). The agent starts at (0,0).

Input:

Environment: SimpleGridWorld (3x3, target at (2,2), start at (0,0)).
Agent: Policy network initialized with random weights.

Output (after 1 episode):

Sequence of states, actions, and rewards.
Initial episode reward: e.g., -2.5 (if it takes 5 steps to hit a wall and doesn't reach target).
Average episode reward over a small number of episodes (e.g., 10): e.g., -2.0

Explanation: The agent, with random initialization, will likely take inefficient or random actions. Its performance will be poor, and the average reward will be low and likely fluctuating.

Example 2: Performance After Training

Input:

Environment: SimpleGridWorld (same as above).
Agent: Policy network trained for 1000 episodes.

Output (after 1 episode):

Sequence of states, actions, and rewards.
Episode reward: e.g., +0.7 (if it takes 7 steps to reach the target: 7 steps * -0.1 reward/step + 1 target reward = 0.7).
Average episode reward over 100 episodes (after training): e.g., +0.5

Explanation: After training, the agent's policy has been updated to favor actions that lead towards the target. It will exhibit more directed movement, resulting in higher episode rewards and a consistently positive average reward.

Constraints

The state space will be discrete and low-dimensional (e.g., up to 10 dimensions).
The action space will be discrete and small (e.g., up to 5 actions).
The environment will provide rewards that are finite and bounded.
You are expected to use a standard Python deep learning library (PyTorch or TensorFlow/Keras).
The implementation should be reasonably efficient; training on a simple environment should not take an excessively long time (e.g., less than 10 minutes on a standard CPU for a few thousand episodes).

Notes

Discount Factor (gamma): You will need to choose a discount factor ($\gamma$) for calculating discounted returns. A common value is 0.99.
Learning Rate: Experiment with different learning rates for your optimizer.
Baseline: For more advanced implementations (though not strictly required for this basic challenge), consider subtracting a baseline from the returns to reduce variance. A simple baseline could be the average return seen so far.
Neural Network Architecture: Start with a simple network (e.g., 2-3 hidden layers with ReLU activations). The output layer should have num_actions units with a softmax activation to produce probabilities.
Gradient Clipping: Consider gradient clipping to prevent exploding gradients, especially in early stages of training.
Logging: Implement basic logging to track episode rewards and loss to monitor training progress.

Implementing a Simple Policy Gradient Agent for a Discrete Action Space

Problem Description

Key Requirements:

Policy Network: Implement a neural network (e.g., using PyTorch or TensorFlow/Keras) that takes the current state as input and outputs a probability distribution over the available actions. For simplicity, assume a fully connected network.
Action Selection: The agent should sample actions from the probability distribution output by the policy network.
REINFORCE Algorithm: Implement the core REINFORCE update rule:
- Sample a trajectory of (state, action, reward) tuples.
- Calculate the discounted return (cumulative future reward) for each time step.
- Compute the policy gradient loss, which encourages actions that lead to higher returns and discourages those that lead to lower returns.
- Update the policy network's weights using an optimizer (e.g., Adam).
Environment Interaction: The agent needs to interact with a predefined, simple environment that provides states, accepts actions, and returns rewards and a done signal.

Expected Behavior:

Edge Cases:

Zero Reward Episodes: Handle episodes where no significant rewards are obtained. The algorithm should still be able to learn.
Short/Long Episodes: The algorithm should be robust to varying episode lengths.

Examples

Hypothetical Environment: "SimpleGridWorld"

State: A 2D integer tuple representing the agent's position (e.g., (row, col)).
Actions: 0 (Up), 1 (Down), 2 (Left), 3 (Right).
Reward:
- +1 if the agent reaches a target location.
- -0.1 for each step taken (to encourage efficiency).
- -1 if the agent hits a boundary (optional, depending on environment design).
Done: True if the target is reached or if a maximum number of steps is exceeded.

Example 1: Initial Performance (Before Training)

Let's say the SimpleGridWorld is a 3x3 grid, and the target is at (2,2). The agent starts at (0,0).

Input:

Environment: SimpleGridWorld (3x3, target at (2,2), start at (0,0)).
Agent: Policy network initialized with random weights.

Output (after 1 episode):

Sequence of states, actions, and rewards.
Initial episode reward: e.g., -2.5 (if it takes 5 steps to hit a wall and doesn't reach target).
Average episode reward over a small number of episodes (e.g., 10): e.g., -2.0

Explanation: The agent, with random initialization, will likely take inefficient or random actions. Its performance will be poor, and the average reward will be low and likely fluctuating.

Example 2: Performance After Training

Input:

Environment: SimpleGridWorld (same as above).
Agent: Policy network trained for 1000 episodes.

Output (after 1 episode):

Sequence of states, actions, and rewards.
Episode reward: e.g., +0.7 (if it takes 7 steps to reach the target: 7 steps * -0.1 reward/step + 1 target reward = 0.7).
Average episode reward over 100 episodes (after training): e.g., +0.5

Constraints

The state space will be discrete and low-dimensional (e.g., up to 10 dimensions).
The action space will be discrete and small (e.g., up to 5 actions).
The environment will provide rewards that are finite and bounded.
You are expected to use a standard Python deep learning library (PyTorch or TensorFlow/Keras).
The implementation should be reasonably efficient; training on a simple environment should not take an excessively long time (e.g., less than 10 minutes on a standard CPU for a few thousand episodes).

Notes

Discount Factor (gamma): You will need to choose a discount factor ($\gamma$) for calculating discounted returns. A common value is 0.99.
Learning Rate: Experiment with different learning rates for your optimizer.
Baseline: For more advanced implementations (though not strictly required for this basic challenge), consider subtracting a baseline from the returns to reduce variance. A simple baseline could be the average return seen so far.
Neural Network Architecture: Start with a simple network (e.g., 2-3 hidden layers with ReLU activations). The output layer should have num_actions units with a softmax activation to produce probabilities.
Gradient Clipping: Consider gradient clipping to prevent exploding gradients, especially in early stages of training.
Logging: Implement basic logging to track episode rewards and loss to monitor training progress.