{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "from datascience import *\n", "%matplotlib inline\n", "path_data = 'https://raw.githubusercontent.com/ChemeketaCS/datasci-textbook/main/assets/data/'\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Least Squares Regression\n", "In an earlier section, we developed formulas for the slope and intercept of the regression line through a *football shaped* scatter diagram. It turns out that the slope and intercept of the least squares line have the same formulas as those we developed, *regardless of the shape of the scatter plot*.\n", "\n", "We saw this in the example about Little Women, but let's confirm it in an example where the scatter plot clearly isn't football shaped. For the data, we are once again indebted to the rich [data archive of Prof. Larry Winner](http://www.stat.ufl.edu/~winner/datasets.html) of the University of Florida. A [2013 study](http://digitalcommons.wku.edu/ijes/vol6/iss2/10/) in the International Journal of Exercise Science studied collegiate shot put athletes and examined the relation between strength and shot put distance. The population consists of 28 female collegiate athletes. Strength was measured by the the biggest amount (in kilograms) that the athlete lifted in the \"1RM power clean\" in the pre-season. The distance (in meters) was the athlete's personal best." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "def standard_units(any_numbers):\n", " \"Convert any array of numbers to standard units.\"\n", " return (any_numbers - np.mean(any_numbers))/np.std(any_numbers) \n", "\n", "def correlation(t, x, y):\n", " return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))\n", "\n", "def slope(table, x, y):\n", " r = correlation(table, x, y)\n", " return r * np.std(table.column(y))/np.std(table.column(x))\n", "\n", "def intercept(table, x, y):\n", " a = slope(table, x, y)\n", " return np.mean(table.column(y)) - a * np.mean(table.column(x))\n", "\n", "def fit(table, x, y):\n", " \"\"\"Return the height of the regression line at each x value.\"\"\"\n", " a = slope(table, x, y)\n", " b = intercept(table, x, y)\n", " return a * table.column(x) + b" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "shotput = Table.read_table(path_data + 'shotput.csv')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Weight Lifted Shot Put Distance
37.5 6.4
51.5 10.2
61.3 12.4
61.3 13
63.6 13.2
66.1 13
70 12.7
92.7 13.9
90.5 15.5
90.5 15.8
\n", "

... (18 rows omitted)

" ], "text/plain": [ "Weight Lifted | Shot Put Distance\n", "37.5 | 6.4\n", "51.5 | 10.2\n", "61.3 | 12.4\n", "61.3 | 13\n", "63.6 | 13.2\n", "66.1 | 13\n", "70 | 12.7\n", "92.7 | 13.9\n", "90.5 | 15.5\n", "90.5 | 15.8\n", "... (18 rows omitted)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shotput" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "shotput.scatter('Weight Lifted')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's not a football shaped scatter plot. In fact, it seems to have a slight non-linear component. But if we insist on using a straight line to make our predictions, there is still one best straight line among all straight lines.\n", "\n", "Our formulas for the slope and intercept of the regression line, derived for football shaped scatter plots, give the following values." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.09834382159781997" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "slope(shotput, 'Weight Lifted', 'Shot Put Distance')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5.959629098373952" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "intercept(shotput, 'Weight Lifted', 'Shot Put Distance')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does it still make sense to use these formulas even though the scatter plot isn't football shaped? We can answer this by finding the slope and intercept of the line that minimizes the mse.\n", "\n", "We will define the function `shotput_linear_mse` to take an arbirtary slope and intercept as arguments and return the corresponding mse. Then `minimize` applied to `shotput_linear_mse` will return the best slope and intercept." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def shotput_linear_mse(any_slope, any_intercept):\n", " x = shotput.column('Weight Lifted')\n", " y = shotput.column('Shot Put Distance')\n", " fitted = any_slope*x + any_intercept\n", " return np.mean((y - fitted) ** 2)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.09834382, 5.95962911])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "minimize(shotput_linear_mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These values are the same as those we got by using our formulas. To summarize:\n", "\n", "**No matter what the shape of the scatter plot, there is a unique line that minimizes the mean squared error of estimation. It is called the regression line, and its slope and intercept are given by**\n", "\n", "$$\n", "\\mathbf{\\mbox{slope of the regression line}} ~=~ r \\cdot\n", "\\frac{\\mbox{SD of }y}{\\mbox{SD of }x}\n", "$$\n", "\n", "$$\n", "\\mathbf{\\mbox{intercept of the regression line}} ~=~\n", "\\mbox{average of }y ~-~ \\mbox{slope} \\cdot \\mbox{average of }x\n", "$$" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fitted = fit(shotput, 'Weight Lifted', 'Shot Put Distance')\n", "shotput.with_column('Best Straight Line', fitted).scatter('Weight Lifted')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nonlinear Regression\n", "The graph above reinforces our earlier observation that the scatter plot is a bit curved. So it is better to fit a curve than a straight line. The [study](http://digitalcommons.wku.edu/ijes/vol6/iss2/10/) postulated a quadratic relation between the weight lifted and the shot put distance. So let's use quadratic functions as our predictors and see if we can find the best one. \n", "\n", "We have to find the best quadratic function among all quadratic functions, instead of the best straight line among all straight lines. The method of least squares allows us to do this.\n", "\n", "The mathematics of this minimization is complicated and not easy to see just by examining the scatter plot. But numerical minimization is just as easy as it was with linear predictors! We can get the best quadratic predictor by once again using `minimize`. Let's see how this works." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that a quadratic function has the form\n", "\n", "$$\n", "f(x) ~=~ ax^2 + bx + c\n", "$$\n", "\n", "for constants $a$, $b$, and $c$.\n", "\n", "To find the best quadratic function to predict distance based on weight lifted, using the criterion of least squares, we will first write a function that takes the three constants as its arguments, calculates the fitted values by using the quadratic function above, and then returns the mean squared error. \n", "\n", "The function is called `shotput_quadratic_mse`. Notice that the definition is analogous to that of `lw_mse`, except that the fitted values are based on a quadratic function instead of linear." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def shotput_quadratic_mse(a, b, c):\n", " x = shotput.column('Weight Lifted')\n", " y = shotput.column('Shot Put Distance')\n", " fitted = a*(x**2) + b*x + c\n", " return np.mean((y - fitted) ** 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use `minimize` just as before to find the constants that minimize the mean squared error. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-1.04004838e-03, 2.82708045e-01, -1.53182115e+00])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "best = minimize(shotput_quadratic_mse)\n", "best" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our prediction of the shot put distance for an athlete who lifts $x$ kilograms is about\n", "\n", "$$\n", "-0.00104x^2 ~+~ 0.2827x - 1.5318\n", "$$\n", "\n", "meters. For example, if the athlete can lift 100 kilograms, the predicted distance is 16.33 meters. On the scatter plot, that's near the center of a vertical strip around 100 kilograms." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16.3382" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-0.00104)*(100**2) + 0.2827*100 - 1.5318" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the predictions for all the values of `Weight Lifted`. You can see that they go through the center of the scatter plot, to a rough approximation." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "x = shotput.column(0)\n", "shotput_fit = best.item(0)*(x**2) + best.item(1)*x + best.item(2)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "shotput.with_column('Best Quadratic Curve', shotput_fit).scatter(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** We fit a quadratic here because it was suggested in the original study. But it is worth noting that at the rightmost end of the graph, the quadratic curve appears to be close to peaking, after which the curve will start going downwards. So we might not want to use this model for new athletes who can lift weights much higher than those in our data set. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 1 }