Gradient Descent for Linear Regression

Gradient descent is a fundamental optimization algorithm used widely in machine learning. Your task is to implement gradient descent from scratch to find the optimal parameters for a simple linear regression model. This will give you a hands-on understanding of how models learn from data.

Problem Description

You need to implement the gradient descent algorithm to find the best-fit line for a given set of 2D data points $(x, y)$. The goal is to minimize the Mean Squared Error (MSE) cost function.

The linear regression model is defined as: $y_{predicted} = \theta_0 + \theta_1 * x$

Where:

$y_{predicted}$ is the predicted value of $y$.
$x$ is the input feature.
$\theta_0$ is the intercept (bias) term.
$\theta_1$ is the coefficient for the feature $x$.

The Mean Squared Error (MSE) cost function is defined as: $J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (y_{predicted}^{(i)} - y^{(i)})^2$ $J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)})^2$

Where:

$m$ is the number of training examples.
$y^{(i)}$ and $x^{(i)}$ are the actual $y$ and $x$ values for the $i$-th training example.

The gradient descent update rules for $\theta_0$ and $\theta_1$ are: $\theta_0 := \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}$ $\theta_1 := \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}$

Where $\alpha$ is the learning rate.

The partial derivatives are: $\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)})$ $\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)}) x^{(i)}$

Your implementation should:

Initialize $\theta_0$ and $\theta_1$ (e.g., to 0).
Iterate a specified number of times (epochs).
In each iteration, calculate the predicted values for all data points.
Calculate the gradients for $\theta_0$ and $\theta_1$ based on the current predictions and actual values.
Update $\theta_0$ and $\theta_1$ using the learning rate and the calculated gradients.
Return the final values of $\theta_0$ and $\theta_1$.

Edge Cases:

Empty input data: The function should handle this gracefully, perhaps by returning initial parameters or raising an error.
Zero learning rate: The parameters will not update.
Very large learning rate: Could lead to divergence.

Examples

Example 1: Input: X = [1, 2, 3, 4, 5] y = [2, 4, 5, 4, 5] learning_rate = 0.01 epochs = 1000

Output: (1.0200000000000008, 0.7999999999999999)

Explanation: After 1000 iterations of gradient descent with a learning rate of 0.01, the algorithm converges to approximate values for $\theta_0$ and $\theta_1$. These values define the line that best fits the provided data points, minimizing the MSE.

Example 2: Input: X = [2.5, 5.0, 7.5, 10.0] y = [5.0, 7.5, 10.0, 12.5] learning_rate = 0.02 epochs = 1500

Output: (2.500000000000002, 1.0)

Explanation: This example represents a dataset that perfectly fits the line $y = 2.5 + 1.0 * x$. Gradient descent, even with a slightly different learning rate and more epochs, successfully converges to these near-perfect parameter values.

Example 3: (Edge case: empty data) Input: X = [] y = [] learning_rate = 0.01 epochs = 100

Output: (0.0, 0.0)

Explanation: When the input data is empty, the m value will be 0. The implementation should handle this division by zero scenario. Returning the initial parameters (0.0, 0.0) is a reasonable behavior in this case.

Constraints

The input X and y will be lists of numbers (integers or floats).
X and y will have the same length.
learning_rate will be a float, typically between 0.001 and 0.1.
epochs will be an integer, typically between 100 and 5000.
The number of data points m will not exceed 1000.
The number of epochs will not exceed 10000.

Notes

You will be implementing this without using any external libraries like NumPy or Scikit-learn for the core gradient descent logic. Standard Python lists and arithmetic operations are sufficient.
Consider how to handle the case where m (number of data points) is zero to avoid division by zero errors.
The accuracy of the returned parameters will depend on the learning rate and the number of epochs. The examples provide target outputs for specific configurations.
Think about the data types you are using for calculations to maintain precision.

Gradient Descent for Linear Regression

Problem Description

You need to implement the gradient descent algorithm to find the best-fit line for a given set of 2D data points $(x, y)$. The goal is to minimize the Mean Squared Error (MSE) cost function.

The linear regression model is defined as: $y_{predicted} = \theta_0 + \theta_1 * x$

Where:

$y_{predicted}$ is the predicted value of $y$.
$x$ is the input feature.
$\theta_0$ is the intercept (bias) term.
$\theta_1$ is the coefficient for the feature $x$.

Where:

$m$ is the number of training examples.
$y^{(i)}$ and $x^{(i)}$ are the actual $y$ and $x$ values for the $i$-th training example.

Where $\alpha$ is the learning rate.

Your implementation should:

Initialize $\theta_0$ and $\theta_1$ (e.g., to 0).
Iterate a specified number of times (epochs).
In each iteration, calculate the predicted values for all data points.
Calculate the gradients for $\theta_0$ and $\theta_1$ based on the current predictions and actual values.
Update $\theta_0$ and $\theta_1$ using the learning rate and the calculated gradients.
Return the final values of $\theta_0$ and $\theta_1$.

Edge Cases:

Empty input data: The function should handle this gracefully, perhaps by returning initial parameters or raising an error.
Zero learning rate: The parameters will not update.
Very large learning rate: Could lead to divergence.

Examples

Example 1: Input: X = [1, 2, 3, 4, 5] y = [2, 4, 5, 4, 5] learning_rate = 0.01 epochs = 1000

Output: (1.0200000000000008, 0.7999999999999999)

Example 2: Input: X = [2.5, 5.0, 7.5, 10.0] y = [5.0, 7.5, 10.0, 12.5] learning_rate = 0.02 epochs = 1500

Output: (2.500000000000002, 1.0)

Example 3: (Edge case: empty data) Input: X = [] y = [] learning_rate = 0.01 epochs = 100

Output: (0.0, 0.0)

Constraints

The input X and y will be lists of numbers (integers or floats).
X and y will have the same length.
learning_rate will be a float, typically between 0.001 and 0.1.
epochs will be an integer, typically between 100 and 5000.
The number of data points m will not exceed 1000.
The number of epochs will not exceed 10000.

Notes

You will be implementing this without using any external libraries like NumPy or Scikit-learn for the core gradient descent logic. Standard Python lists and arithmetic operations are sufficient.
Consider how to handle the case where m (number of data points) is zero to avoid division by zero errors.
The accuracy of the returned parameters will depend on the learning rate and the number of epochs. The examples provide target outputs for specific configurations.
Think about the data types you are using for calculations to maintain precision.