Variational Inference for a Gaussian Mixture Model in Rust

Variational inference is a powerful technique for approximating intractable posterior distributions, commonly used in Bayesian machine learning. This challenge asks you to implement a simplified version of variational inference for a Gaussian Mixture Model (GMM) in Rust. Successfully completing this challenge will demonstrate your understanding of variational inference principles and your ability to implement them in Rust.

Problem Description

You are tasked with implementing variational inference to approximate the posterior distribution over the mixing weights and means of a two-component Gaussian Mixture Model (GMM). The observed data consists of a vector of real numbers assumed to be generated from one of two Gaussian distributions, each with a different mean and variance, and a mixing weight determining the probability of each component.

What needs to be achieved:

Define Data and Model: Represent the observed data as a Vec<f64>. The GMM consists of two Gaussian components.
Define Variational Parameters: Represent the variational parameters as a struct containing:
- pi: Variational estimate of the mixing weight (a float between 0 and 1).
- mu1: Variational estimate of the mean of the first Gaussian component (a float).
- mu2: Variational estimate of the mean of the second Gaussian component (a float).
- sigma2: Variational estimate of the variance of both Gaussian components (a float, must be positive).
Implement the Expected Complete Log-Likelihood (E-CLL): This function takes the observed data and the variational parameters as input and returns the E-CLL value. The E-CLL is a lower bound on the log-likelihood of the data and is used to guide the optimization of the variational parameters. The formula for the E-CLL for a 2-component GMM with full-variance variational approximation is provided in the "Notes" section.
Implement Gradient Descent: Implement a simple gradient descent algorithm to update the variational parameters based on the gradient of the E-CLL. The gradients are also provided in the "Notes" section.

Key Requirements:

The code must be written in Rust.
The solution should be well-structured and readable.
The gradient descent algorithm should converge to a reasonable solution (i.e., the E-CLL should increase with each iteration).
The variance (sigma2) must be kept positive throughout the optimization process. Clamp the value to a small positive number if it becomes negative.

Expected Behavior:

The code should take a vector of f64 as input, initialize the variational parameters randomly, and then iteratively update them using gradient descent until convergence (or a maximum number of iterations is reached). The final variational parameters should represent an approximation of the posterior distribution over the GMM parameters.

Edge Cases to Consider:

Initial Parameter Values: The initial values of the variational parameters can affect the convergence of the algorithm. Consider initializing them to reasonable values.
Variance becoming negative: The variance parameter can become negative during optimization. Implement a mechanism to prevent this (e.g., clamping).
Convergence: The gradient descent algorithm may not always converge. Implement a stopping criterion based on the change in the E-CLL.

Examples

Example 1:

Input: vec![1.0, 2.0, 3.0, 4.0, 5.0]
Output: Variational parameters (pi, mu1, mu2, sigma2) after convergence.  The exact values will vary depending on the initialization and convergence criteria.
Explanation: The algorithm will iteratively update the variational parameters to maximize the E-CLL, given the observed data.

Example 2:

Input: vec![0.0, 0.0, 0.0, 10.0, 10.0, 10.0]
Output: Variational parameters (pi, mu1, mu2, sigma2) after convergence.  The algorithm should converge to a solution where mu1 is close to 0, mu2 is close to 10, and pi reflects the proportion of data points near each mean.
Explanation: The algorithm will learn the two clusters in the data and estimate their means and mixing weight.

Constraints

Data Size: The input vector can contain up to 1000 data points.
Parameter Initialization: The variational parameters should be initialized randomly, but within reasonable bounds (e.g., pi between 0 and 1, mu1 and mu2 within the range of the data, sigma2 > 0).
Gradient Descent Steps: The maximum number of gradient descent iterations should be limited to 1000.
Learning Rate: The learning rate for gradient descent should be a small positive value (e.g., 0.01).
Variance Clamping: If sigma2 becomes less than 1e-6, clamp it to 1e-6.

Notes

Expected Complete Log-Likelihood (E-CLL) for a 2-component GMM with full-variance variational approximation:

E-CLL = E[log(pi)] + E[log(N(x | mu1, sigma2))] + E[log(1-pi)] + E[log(N(x | mu2, sigma2))]

where N(x | mu, sigma2) is the probability density function of a Gaussian distribution with mean mu and variance sigma2.

Gradients of the E-CLL with respect to the variational parameters:

dE-CLL/d(pi) = (1 - pi) * (E[N(x | mu1, sigma2)] / (E[N(x | mu1, sigma2)] + E[N(x | mu2, sigma2)])) - 1
dE-CLL/d(mu1) = Sum_i [ (x_i - mu1) * E[N(x_i | mu1, sigma2)] / (E[N(x_i | mu1, sigma2)] + E[N(x_i | mu2, sigma2)]) ]
dE-CLL/d(mu2) = - Sum_i [ (x_i - mu2) * E[N(x_i | mu2, sigma2)] / (E[N(x_i | mu1, sigma2)] + E[N(x_i | mu2, sigma2)]) ]
dE-CLL/d(sigma2) = Sum_i [ ( (x_i - mu1)^2 - sigma2 ) * E[N(x_i | mu1, sigma2)] / (E[N(x_i | mu1, sigma2)] + E[N(x_i | mu2, sigma2)]) + ( (x_i - mu2)^2 - sigma2 ) * E[N(x_i | mu2, sigma2)] / (E[N(x_i | mu1, sigma2)] + E[N(x_i | mu2, sigma2)]) ]

where Sum_i is the sum over all data points x_i. E[N(x_i | mu, sigma2)] represents the expected value of the Gaussian PDF evaluated at x_i with mean mu and variance sigma2. This can be approximated by simply evaluating the PDF at x_i.