Implement Actor-Critic for CartPole Balancing

This challenge requires you to implement an Actor-Critic reinforcement learning algorithm in Python to solve the classic CartPole balancing problem. The goal is to train an agent that can learn to balance a pole on a cart by applying forces to the cart. This exercise will deepen your understanding of policy gradient methods and value function approximation.

Problem Description

You will implement an Actor-Critic agent that interacts with the CartPole environment from the OpenAI Gym (or a similar implementation). The agent needs to learn a policy (the Actor) that dictates actions based on the current state and a value function (the Critic) that estimates the expected future reward from a given state.

What needs to be achieved:

Implement the Actor network, which takes the state as input and outputs probabilities for each action.
Implement the Critic network, which takes the state as input and outputs the estimated value of that state.
Combine these networks into an Actor-Critic learning agent.
Train the agent using a suitable algorithm (e.g., Advantage Actor-Critic - A2C) to achieve a satisfactory performance on the CartPole environment.

Key Requirements:

Actor Network: A neural network (e.g., using TensorFlow/Keras or PyTorch) that outputs action probabilities.
Critic Network: A neural network (e.g., using TensorFlow/Keras or PyTorch) that outputs a single value representing the state-value.
Training Loop: A loop that iterates through episodes, collects experience (state, action, reward, next_state, done), and updates the Actor and Critic networks.
Loss Functions: Appropriate loss functions for both the Actor (e.g., policy gradient loss, weighted by advantage) and the Critic (e.g., mean squared error between predicted value and actual return/TD error).
Environment Interaction: Utilize an existing CartPole environment (e.g., gym.make('CartPole-v1')).

Expected Behavior: The trained agent should be able to balance the pole for a sustained period, typically exceeding a threshold score (e.g., averaging 475 or more over 100 consecutive trials, as per CartPole-v1's evaluation criteria).

Important Edge Cases:

Exploration vs. Exploitation: Ensure the agent balances exploration (trying new actions) with exploitation (using learned good actions). This can be managed through action selection strategies (e.g., sampling from the policy's output distribution) and potentially added noise.
Learning Rate Schedules: Consider if a dynamic learning rate schedule is beneficial.
Hyperparameter Tuning: The performance will be highly sensitive to hyperparameters like learning rates, discount factor, entropy bonus, etc.

Examples

Example 1: Initial State and Action

Input:
Observation from CartPole-v1: [0.02565188, -0.00137031, -0.01988315, 0.0294669 ] (Cart Position, Cart Velocity, Pole Angle, Pole Angular Velocity)

Output (Hypothetical, depending on untrained network):
Actor Network Output (Logits for [Left, Right]): [-0.1, 0.2]
Action Probabilities (Softmax): [0.31, 0.69]
Selected Action (Sampling from probabilities): 1 (Move Right)
Critic Network Output (State Value Estimate): 0.01

Explanation: An untrained Actor network might produce logits close to zero. After a softmax, this could result in roughly equal probabilities for each action, leading to a random or slightly biased action choice. The Critic network might also provide a low, initial value estimate.

Example 2: After Training

Input:
Observation from CartPole-v1: [-0.15, 0.5, 0.8, -0.2]

Output (Hypothetical, depending on trained network):
Actor Network Output (Logits for [Left, Right]): [2.5, -1.0]
Action Probabilities (Softmax): [0.92, 0.08]
Selected Action (Sampling from probabilities): 0 (Move Left)
Critic Network Output (State Value Estimate): 50.2

Explanation: A well-trained Actor network will likely output high logits for actions that lead to better outcomes. In this scenario, the agent has learned that moving Left is the optimal action given the state. The Critic network has learned to assign a significantly higher value to this state, indicating a good position to balance the pole.

Constraints

The implementation must be in Python.
You are allowed to use standard Python libraries for numerical computation (NumPy) and deep learning frameworks (TensorFlow/Keras or PyTorch).
You should use an existing CartPole environment (e.g., gym.make('CartPole-v1') or a compatible alternative).
The training process should converge to a state where the agent can consistently achieve high scores (e.g., average score of 475+ over 100 evaluation episodes).
Your solution should be modular, separating network definitions, training logic, and environment interaction.

Notes

Actor-Critic Architecture: A common setup involves two separate neural networks (or a single network with two heads) for the Actor and Critic.
Loss Calculation: The Critic loss is typically based on the Temporal Difference (TD) error: delta = reward + gamma * V(s') - V(s). The Actor loss is then often proportional to log(pi(a|s)) * delta. An entropy bonus can be added to the Actor loss to encourage exploration.
Discount Factor (gamma): Choose an appropriate discount factor (e.g., 0.99) to balance immediate and future rewards.
Baseline: Using the Critic's value estimate V(s) as a baseline for the Actor's loss can significantly reduce variance and improve training stability. The advantage function is often computed as A(s, a) = r + gamma * V(s') - V(s).
Network Size: Simple feedforward neural networks with one or two hidden layers are usually sufficient for CartPole.
Evaluation: Implement a separate evaluation phase to measure the agent's performance without exploration noise.
Reproducibility: Seed the random number generators for both Python and your chosen deep learning framework to ensure reproducibility of results.