{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "from datascience import *\n", "%matplotlib inline\n", "path_data = 'https://raw.githubusercontent.com/ChemeketaCS/datasci-textbook/main/assets/data/'\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Confidence Intervals\n", "We have developed a method for estimating a parameter by using random sampling and the bootstrap. Our method produces an interval of estimates, to account for chance variability in the random sample. By providing an interval of estimates instead of just one estimate, we give ourselves some wiggle room.\n", "\n", "In the previous example we saw that our process of estimation produced a good interval about 95% of the time, a \"good\" interval being one that contains the parameter. We say that we are *95% confident* that the process results in a good interval. Our interval of estimates is called a *95% confidence interval* for the parameter, and 95% is called the *confidence level* of the interval.\n", "\n", "The method is called the *boostrap percentile method* because the interval is formed by picking off two percentiles of the bootstrapped estimates.\n", "\n", "The situation in the previous example was a bit unusual. Because we happened to know the value of the parameter, we were able to check whether an interval was good or a dud, and this in turn helped us to see that our process of estimation captured the parameter about 95 out of every 100 times we used it.\n", "\n", "But usually, data scientists don't know the value of the parameter. That is the reason they want to estimate it in the first place. In such situations, they provide an interval of estimates for the unknown parameter by using methods like the one we have developed. Because of statistical theory and demonstrations like the one we have seen, data scientists can be confident that their process of generating the interval results in a good interval a known percent of the time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimating a Population Median\n", "\n", "We will now use the bootstrap method to estimate an unknown population median. You have encountered the dataset before. It comes from a sample of newborns in a large hospital system. we will treat it as if it were a simple random sample though the sampling was done in multiple stages. [Stat Labs](https://www.stat.berkeley.edu/~statlabs/) by Deborah Nolan and Terry Speed has details about a larger dataset from which this set is drawn. \n", "\n", "The table `births` contains the following variables for mother-baby pairs: the baby's birth weight in ounces, the number of gestational days (the number of days the mother was pregnant), the mother's age in completed years, the mother's height in inches, pregnancy weight in pounds, and whether or not the mother smoked during pregnancy." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "births = Table.read_table(path_data + 'baby.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker | \n", "
---|---|---|---|---|---|
120 | 284 | 27 | 62 | 100 | False | \n", "
113 | 282 | 33 | 64 | 135 | False | \n", "
128 | 279 | 28 | 64 | 115 | True | \n", "
... (1171 rows omitted)
" ], "text/plain": [ "Birth Weight | Gestational Days | Ratio BW:GD | \n", "
---|---|---|
120 | 284 | 0.422535 | \n", "
113 | 282 | 0.400709 | \n", "
128 | 279 | 0.458781 | \n", "
108 | 282 | 0.382979 | \n", "
136 | 286 | 0.475524 | \n", "
138 | 244 | 0.565574 | \n", "
132 | 245 | 0.538776 | \n", "
120 | 289 | 0.415225 | \n", "
143 | 299 | 0.478261 | \n", "
140 | 351 | 0.39886 | \n", "
... (1164 rows omitted)
" ], "text/plain": [ "Birth Weight | Gestational Days | Ratio BW:GD\n", "120 | 284 | 0.422535\n", "113 | 282 | 0.400709\n", "128 | 279 | 0.458781\n", "108 | 282 | 0.382979\n", "136 | 286 | 0.475524\n", "138 | 244 | 0.565574\n", "132 | 245 | 0.538776\n", "120 | 289 | 0.415225\n", "143 | 299 | 0.478261\n", "140 | 351 | 0.39886\n", "... (1164 rows omitted)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratios" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a histogram of the ratios." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "Birth Weight | Gestational Days | Ratio BW:GD | \n", "
---|---|---|
116 | 148 | 0.783784 | \n", "