Gradient Descent for Linear Regression
Gradient descent is a fundamental optimization algorithm used widely in machine learning. Your task is to implement gradient descent from scratch to find the optimal parameters for a simple linear regression model. This will give you a hands-on understanding of how models learn from data.
Problem Description
You need to implement the gradient descent algorithm to find the best-fit line for a given set of 2D data points $(x, y)$. The goal is to minimize the Mean Squared Error (MSE) cost function.
The linear regression model is defined as: $y_{predicted} = \theta_0 + \theta_1 * x$
Where:
- $y_{predicted}$ is the predicted value of $y$.
- $x$ is the input feature.
- $\theta_0$ is the intercept (bias) term.
- $\theta_1$ is the coefficient for the feature $x$.
The Mean Squared Error (MSE) cost function is defined as: $J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (y_{predicted}^{(i)} - y^{(i)})^2$ $J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)})^2$
Where:
- $m$ is the number of training examples.
- $y^{(i)}$ and $x^{(i)}$ are the actual $y$ and $x$ values for the $i$-th training example.
The gradient descent update rules for $\theta_0$ and $\theta_1$ are: $\theta_0 := \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}$ $\theta_1 := \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}$
Where $\alpha$ is the learning rate.
The partial derivatives are: $\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)})$ $\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)}) x^{(i)}$
Your implementation should:
- Initialize $\theta_0$ and $\theta_1$ (e.g., to 0).
- Iterate a specified number of times (epochs).
- In each iteration, calculate the predicted values for all data points.
- Calculate the gradients for $\theta_0$ and $\theta_1$ based on the current predictions and actual values.
- Update $\theta_0$ and $\theta_1$ using the learning rate and the calculated gradients.
- Return the final values of $\theta_0$ and $\theta_1$.
Edge Cases:
- Empty input data: The function should handle this gracefully, perhaps by returning initial parameters or raising an error.
- Zero learning rate: The parameters will not update.
- Very large learning rate: Could lead to divergence.
Examples
Example 1:
Input:
X = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
learning_rate = 0.01
epochs = 1000
Output:
(1.0200000000000008, 0.7999999999999999)
Explanation: After 1000 iterations of gradient descent with a learning rate of 0.01, the algorithm converges to approximate values for $\theta_0$ and $\theta_1$. These values define the line that best fits the provided data points, minimizing the MSE.
Example 2:
Input:
X = [2.5, 5.0, 7.5, 10.0]
y = [5.0, 7.5, 10.0, 12.5]
learning_rate = 0.02
epochs = 1500
Output:
(2.500000000000002, 1.0)
Explanation: This example represents a dataset that perfectly fits the line $y = 2.5 + 1.0 * x$. Gradient descent, even with a slightly different learning rate and more epochs, successfully converges to these near-perfect parameter values.
Example 3: (Edge case: empty data)
Input:
X = []
y = []
learning_rate = 0.01
epochs = 100
Output:
(0.0, 0.0)
Explanation:
When the input data is empty, the m value will be 0. The implementation should handle this division by zero scenario. Returning the initial parameters (0.0, 0.0) is a reasonable behavior in this case.
Constraints
- The input
Xandywill be lists of numbers (integers or floats). Xandywill have the same length.learning_ratewill be a float, typically between 0.001 and 0.1.epochswill be an integer, typically between 100 and 5000.- The number of data points
mwill not exceed 1000. - The number of epochs will not exceed 10000.
Notes
- You will be implementing this without using any external libraries like NumPy or Scikit-learn for the core gradient descent logic. Standard Python lists and arithmetic operations are sufficient.
- Consider how to handle the case where
m(number of data points) is zero to avoid division by zero errors. - The accuracy of the returned parameters will depend on the learning rate and the number of epochs. The examples provide target outputs for specific configurations.
- Think about the data types you are using for calculations to maintain precision.