Hone logo
Hone
Problems

Gradient Descent for Linear Regression

Gradient descent is a fundamental optimization algorithm used widely in machine learning. Your task is to implement gradient descent from scratch to find the optimal parameters for a simple linear regression model. This will give you a hands-on understanding of how models learn from data.

Problem Description

You need to implement the gradient descent algorithm to find the best-fit line for a given set of 2D data points $(x, y)$. The goal is to minimize the Mean Squared Error (MSE) cost function.

The linear regression model is defined as: $y_{predicted} = \theta_0 + \theta_1 * x$

Where:

  • $y_{predicted}$ is the predicted value of $y$.
  • $x$ is the input feature.
  • $\theta_0$ is the intercept (bias) term.
  • $\theta_1$ is the coefficient for the feature $x$.

The Mean Squared Error (MSE) cost function is defined as: $J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (y_{predicted}^{(i)} - y^{(i)})^2$ $J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)})^2$

Where:

  • $m$ is the number of training examples.
  • $y^{(i)}$ and $x^{(i)}$ are the actual $y$ and $x$ values for the $i$-th training example.

The gradient descent update rules for $\theta_0$ and $\theta_1$ are: $\theta_0 := \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}$ $\theta_1 := \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}$

Where $\alpha$ is the learning rate.

The partial derivatives are: $\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)})$ $\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)}) x^{(i)}$

Your implementation should:

  1. Initialize $\theta_0$ and $\theta_1$ (e.g., to 0).
  2. Iterate a specified number of times (epochs).
  3. In each iteration, calculate the predicted values for all data points.
  4. Calculate the gradients for $\theta_0$ and $\theta_1$ based on the current predictions and actual values.
  5. Update $\theta_0$ and $\theta_1$ using the learning rate and the calculated gradients.
  6. Return the final values of $\theta_0$ and $\theta_1$.

Edge Cases:

  • Empty input data: The function should handle this gracefully, perhaps by returning initial parameters or raising an error.
  • Zero learning rate: The parameters will not update.
  • Very large learning rate: Could lead to divergence.

Examples

Example 1: Input: X = [1, 2, 3, 4, 5] y = [2, 4, 5, 4, 5] learning_rate = 0.01 epochs = 1000

Output: (1.0200000000000008, 0.7999999999999999)

Explanation: After 1000 iterations of gradient descent with a learning rate of 0.01, the algorithm converges to approximate values for $\theta_0$ and $\theta_1$. These values define the line that best fits the provided data points, minimizing the MSE.

Example 2: Input: X = [2.5, 5.0, 7.5, 10.0] y = [5.0, 7.5, 10.0, 12.5] learning_rate = 0.02 epochs = 1500

Output: (2.500000000000002, 1.0)

Explanation: This example represents a dataset that perfectly fits the line $y = 2.5 + 1.0 * x$. Gradient descent, even with a slightly different learning rate and more epochs, successfully converges to these near-perfect parameter values.

Example 3: (Edge case: empty data) Input: X = [] y = [] learning_rate = 0.01 epochs = 100

Output: (0.0, 0.0)

Explanation: When the input data is empty, the m value will be 0. The implementation should handle this division by zero scenario. Returning the initial parameters (0.0, 0.0) is a reasonable behavior in this case.

Constraints

  • The input X and y will be lists of numbers (integers or floats).
  • X and y will have the same length.
  • learning_rate will be a float, typically between 0.001 and 0.1.
  • epochs will be an integer, typically between 100 and 5000.
  • The number of data points m will not exceed 1000.
  • The number of epochs will not exceed 10000.

Notes

  • You will be implementing this without using any external libraries like NumPy or Scikit-learn for the core gradient descent logic. Standard Python lists and arithmetic operations are sufficient.
  • Consider how to handle the case where m (number of data points) is zero to avoid division by zero errors.
  • The accuracy of the returned parameters will depend on the learning rate and the number of epochs. The examples provide target outputs for specific configurations.
  • Think about the data types you are using for calculations to maintain precision.
Loading editor...
python