{ "cells": [ { "cell_type": "code", "execution_count": 24, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "path_data = 'https://raw.githubusercontent.com/ChemeketaCS/datasci-textbook/main/assets/data/'\n", "np.set_printoptions(threshold=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Example: Population Trends\n", "\n", "We are now ready to work with large tables of data. The file below contains \"Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.\" Notice that `read_table` can read data directly from a URL." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 POPESTIMATE2016 POPESTIMATE2017 POPESTIMATE2018 POPESTIMATE2019
0 0 3944153 3944160 3951430 3963092 3926570 3931258 3954787 3983981 3954773 3893990 3815343 3783052
0 1 3978070 3978090 3957730 3966225 3977549 3942698 3948891 3973133 4002903 3972711 3908830 3829599
0 2 4096929 4096939 4090621 3970654 3978925 3991740 3958711 3966321 3991349 4020045 3987032 3922044
0 3 4119040 4119051 4111688 4101644 3981531 3991017 4005928 3974351 3982984 4006946 4033038 3998665
0 4 4063170 4063186 4077346 4121488 4111490 3992502 4004032 4020292 3989750 3997280 4018719 4043323
0 5 4056858 4056872 4064521 4087054 4131049 4121876 4004576 4017589 4035033 4003452 4008443 4028281
0 6 4066381 4066412 4072904 4074531 4096631 4141126 4133372 4017388 4031568 4048018 4014057 4017227
0 7 4030579 4030594 4042990 4082821 4084175 4106756 4152666 4145872 4030888 4044139 4058370 4022319
0 8 4046486 4046497 4025501 4052773 4092559 4094513 4118349 4165033 4158848 4042924 4054236 4066194
0 9 4148353 4148369 4125312 4035319 4062726 4103052 4106068 4130887 4177895 4170813 4053179 4061874
\n", "

... (296 rows omitted)

" ], "text/plain": [ "SEX | AGE | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | POPESTIMATE2016 | POPESTIMATE2017 | POPESTIMATE2018 | POPESTIMATE2019\n", "0 | 0 | 3944153 | 3944160 | 3951430 | 3963092 | 3926570 | 3931258 | 3954787 | 3983981 | 3954773 | 3893990 | 3815343 | 3783052\n", "0 | 1 | 3978070 | 3978090 | 3957730 | 3966225 | 3977549 | 3942698 | 3948891 | 3973133 | 4002903 | 3972711 | 3908830 | 3829599\n", "0 | 2 | 4096929 | 4096939 | 4090621 | 3970654 | 3978925 | 3991740 | 3958711 | 3966321 | 3991349 | 4020045 | 3987032 | 3922044\n", "0 | 3 | 4119040 | 4119051 | 4111688 | 4101644 | 3981531 | 3991017 | 4005928 | 3974351 | 3982984 | 4006946 | 4033038 | 3998665\n", "0 | 4 | 4063170 | 4063186 | 4077346 | 4121488 | 4111490 | 3992502 | 4004032 | 4020292 | 3989750 | 3997280 | 4018719 | 4043323\n", "0 | 5 | 4056858 | 4056872 | 4064521 | 4087054 | 4131049 | 4121876 | 4004576 | 4017589 | 4035033 | 4003452 | 4008443 | 4028281\n", "0 | 6 | 4066381 | 4066412 | 4072904 | 4074531 | 4096631 | 4141126 | 4133372 | 4017388 | 4031568 | 4048018 | 4014057 | 4017227\n", "0 | 7 | 4030579 | 4030594 | 4042990 | 4082821 | 4084175 | 4106756 | 4152666 | 4145872 | 4030888 | 4044139 | 4058370 | 4022319\n", "0 | 8 | 4046486 | 4046497 | 4025501 | 4052773 | 4092559 | 4094513 | 4118349 | 4165033 | 4158848 | 4042924 | 4054236 | 4066194\n", "0 | 9 | 4148353 | 4148369 | 4125312 | 4035319 | 4062726 | 4103052 | 4106068 | 4130887 | 4177895 | 4170813 | 4053179 | 4061874\n", "... (296 rows omitted)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# As of August 2021, this census file is online here: \n", "data = 'http://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/nc-est2019-agesex-res.csv'\n", "\n", "# A local copy can be accessed here in case census.gov moves the file:\n", "# data = path_data + 'nc-est2019-agesex-res.csv'\n", "\n", "full_census_table = Table.read_table(data)\n", "full_census_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only the first 10 rows of the table are displayed. Later we will see how to display the entire table; however, this is typically not useful with large tables.\n", "\n", "A [description of the table](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/nc-est2019-agesex-res.pdf) appears online. \n", "\n", "The `SEX` column contains numeric codes: `0` stands for the total, `1` for male, and `2` for female. The assumptions underlying this binary categorization are questionable. We will discuss them in the next section.\n", "\n", "In this section we will focus on age and population size.\n", "\n", "The `AGE` column contains ages in completed years, but the special value `999` is a sum of the total population. The \"age\" `100` too has a special status. In that category, the Census Bureau includes all people aged 100 or more.\n", "\n", "The other columns contain estimates of the US population in each category of sex and age in the years 2010 through 2019. The Census is decennial: it takes place every 10 years. The most recent Census was held in 2020 and the one before that in 2010. The Census Bureau also estimates population changes each year. As explained in the Bureau's description of its [methodology](https://www2.census.gov/programs-surveys/popest/technical-documentation/methodology/2010-2020/methods-statement-v2020-final.pdf), it \"adds [the estimated changes] to the last decennial census to produce updated population estimates every year.\"\n", "\n", "Typically, a publicly available table will contain more information than necessary for a particular investigation or analysis. To get the large table into a more usable form, we have to do some *data cleaning*.\n", "\n", "Suppose we are only interested in the population changes from 2014 to 2019. Let's `select` the relevant columns." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEX AGE POPESTIMATE2014 POPESTIMATE2019
0 0 3954787 3783052
0 1 3948891 3829599
0 2 3958711 3922044
0 3 4005928 3998665
0 4 4004032 4043323
0 5 4004576 4028281
0 6 4133372 4017227
0 7 4152666 4022319
0 8 4118349 4066194
0 9 4106068 4061874
\n", "

... (296 rows omitted)

" ], "text/plain": [ "SEX | AGE | POPESTIMATE2014 | POPESTIMATE2019\n", "0 | 0 | 3954787 | 3783052\n", "0 | 1 | 3948891 | 3829599\n", "0 | 2 | 3958711 | 3922044\n", "0 | 3 | 4005928 | 3998665\n", "0 | 4 | 4004032 | 4043323\n", "0 | 5 | 4004576 | 4028281\n", "0 | 6 | 4133372 | 4017227\n", "0 | 7 | 4152666 | 4022319\n", "0 | 8 | 4118349 | 4066194\n", "0 | 9 | 4106068 | 4061874\n", "... (296 rows omitted)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "partial_census_table = full_census_table.select('SEX', 'AGE', 'POPESTIMATE2014', 'POPESTIMATE2019')\n", "partial_census_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can simplify the labels of the selected columns." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEX AGE 2014 2019
0 0 3954787 3783052
0 1 3948891 3829599
0 2 3958711 3922044
0 3 4005928 3998665
0 4 4004032 4043323
0 5 4004576 4028281
0 6 4133372 4017227
0 7 4152666 4022319
0 8 4118349 4066194
0 9 4106068 4061874
\n", "

... (296 rows omitted)

" ], "text/plain": [ "SEX | AGE | 2014 | 2019\n", "0 | 0 | 3954787 | 3783052\n", "0 | 1 | 3948891 | 3829599\n", "0 | 2 | 3958711 | 3922044\n", "0 | 3 | 4005928 | 3998665\n", "0 | 4 | 4004032 | 4043323\n", "0 | 5 | 4004576 | 4028281\n", "0 | 6 | 4133372 | 4017227\n", "0 | 7 | 4152666 | 4022319\n", "0 | 8 | 4118349 | 4066194\n", "0 | 9 | 4106068 | 4061874\n", "... (296 rows omitted)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "us_pop = partial_census_table.relabeled('POPESTIMATE2014', '2014').relabeled('POPESTIMATE2019', '2019')\n", "us_pop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ages 97-100\n", "As a warm-up, let's examine the total population, labeled by `SEX` code 0. Since all these rows will have the same value 0 in the `SEX` column, we will drop that column." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AGE 2014 2019
0 3954787 3783052
1 3948891 3829599
2 3958711 3922044
3 4005928 3998665
4 4004032 4043323
5 4004576 4028281
6 4133372 4017227
7 4152666 4022319
8 4118349 4066194
9 4106068 4061874
\n", "

... (92 rows omitted)

" ], "text/plain": [ "AGE | 2014 | 2019\n", "0 | 3954787 | 3783052\n", "1 | 3948891 | 3829599\n", "2 | 3958711 | 3922044\n", "3 | 4005928 | 3998665\n", "4 | 4004032 | 4043323\n", "5 | 4004576 | 4028281\n", "6 | 4133372 | 4017227\n", "7 | 4152666 | 4022319\n", "8 | 4118349 | 4066194\n", "9 | 4106068 | 4061874\n", "... (92 rows omitted)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "us_pop_by_age = us_pop.where('SEX', are.equal_to(0)).drop('SEX')\n", "us_pop_by_age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's look at the population in the highest ages." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AGE 2014 2019
97 82948 116969
98 59546 86150
99 41277 57124
100 70685 100322
" ], "text/plain": [ "AGE | 2014 | 2019\n", "97 | 82948 | 116969\n", "98 | 59546 | 86150\n", "99 | 41277 | 57124\n", "100 | 70685 | 100322" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "us_pop_by_age.where('AGE', are.between(97, 101))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not surprisingly, the numbers of people are smaller at higher ages. For example, there are fewer 99-year-olds than 98-year-olds. \n", "\n", "But the numbers for `AGE` 100 are quite a bit larger than those for age 99. That is because the row with `AGE` 100 doesn't just represent 100-year-olds. It also includes those who are older than 100. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Percent Change\n", "\n", "Each column of the table `us_pop_by_age` is an array of the same length, and so columns can be combined using arithmetic. The array below contains the change in population between 2014 and 2019. There is one entry corresponding to each row of `us_pop_by_age`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-171735, -119292, -36667, ..., 15847, 29637, 9938515])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "change = us_pop_by_age.column('2019') - us_pop_by_age.column('2014')\n", "change" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can augment `us_pop_by_age` with a column that contains these changes, both in absolute terms and as percents relative to the value in 2014." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AGE 2014 2019 Change Percent Change
0 3954787 3783052 -171735 -4.34%
1 3948891 3829599 -119292 -3.02%
2 3958711 3922044 -36667 -0.93%
3 4005928 3998665 -7263 -0.18%
4 4004032 4043323 39291 0.98%
5 4004576 4028281 23705 0.59%
6 4133372 4017227 -116145 -2.81%
7 4152666 4022319 -130347 -3.14%
8 4118349 4066194 -52155 -1.27%
9 4106068 4061874 -44194 -1.08%
\n", "

... (92 rows omitted)

" ], "text/plain": [ "AGE | 2014 | 2019 | Change | Percent Change\n", "0 | 3954787 | 3783052 | -171735 | -4.34%\n", "1 | 3948891 | 3829599 | -119292 | -3.02%\n", "2 | 3958711 | 3922044 | -36667 | -0.93%\n", "3 | 4005928 | 3998665 | -7263 | -0.18%\n", "4 | 4004032 | 4043323 | 39291 | 0.98%\n", "5 | 4004576 | 4028281 | 23705 | 0.59%\n", "6 | 4133372 | 4017227 | -116145 | -2.81%\n", "7 | 4152666 | 4022319 | -130347 | -3.14%\n", "8 | 4118349 | 4066194 | -52155 | -1.27%\n", "9 | 4106068 | 4061874 | -44194 | -1.08%\n", "... (92 rows omitted)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "us_pop_change = us_pop_by_age.with_columns(\n", " 'Change', change,\n", " 'Percent Change', change/us_pop_by_age.column('2014')\n", ")\n", "us_pop_change.set_format('Percent Change', PercentFormatter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Almost all the entries displayed in the `Percent Change` column are negative, demonstrating a drop in population at the youngest ages. However, the overall population grew by about 9.9 million people, a percent change of just over 3%." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AGE 2014 2019 Change Percent Change
999 318301008 328239523 9938515 3.12%
" ], "text/plain": [ "AGE | 2014 | 2019 | Change | Percent Change\n", "999 | 318301008 | 328239523 | 9938515 | 3.12%" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "us_pop_change.where('AGE', are.equal_to(999))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us compare this to the change at each age. For ease of interpretation, we will sort the table in decreasing order of the absolute change in population, contained in the column `Change`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AGE 2014 2019 Change Percent Change
72 2191642 3191048 999406 45.60%
68 2567511 3345475 777964 30.30%
69 2530460 3252423 721963 28.53%
70 2461426 3136704 675278 27.43%
71 2516392 3083083 566691 22.52%
76 1692960 2222392 529432 31.27%
62 3677408 4156645 479237 13.03%
28 4345247 4818725 473478 10.90%
64 3481789 3950578 468789 13.46%
38 3848856 4305576 456720 11.87%
\n", "

... (91 rows omitted)

" ], "text/plain": [ "AGE | 2014 | 2019 | Change | Percent Change\n", "72 | 2191642 | 3191048 | 999406 | 45.60%\n", "68 | 2567511 | 3345475 | 777964 | 30.30%\n", "69 | 2530460 | 3252423 | 721963 | 28.53%\n", "70 | 2461426 | 3136704 | 675278 | 27.43%\n", "71 | 2516392 | 3083083 | 566691 | 22.52%\n", "76 | 1692960 | 2222392 | 529432 | 31.27%\n", "62 | 3677408 | 4156645 | 479237 | 13.03%\n", "28 | 4345247 | 4818725 | 473478 | 10.90%\n", "64 | 3481789 | 3950578 | 468789 | 13.46%\n", "38 | 3848856 | 4305576 | 456720 | 11.87%\n", "... (91 rows omitted)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "us_pop_change.where(\n", " 'AGE', are.below(999)\n", ").sort('Change', descending=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a look at the top few rows. While the percent change is about 3% for the overall population, it jumps to well over 20% for the people in their late sixties and early seventies. This stunning change contributes to what is known as the greying of America.\n", "\n", "What could explain this large increase? We can explore this question by examining the years in which the relevant groups were born.\n", "\n", "- Those who were in the age group 69 to 72 in 2014 were born in the years 1942 to 1945. The attack on Pearl Harbor was in late 1941, and by 1942 U.S. forces were heavily engaged in a massive war that ended in 1945. \n", "\n", "- Those who were 69 to 72 years old in 2019 were born in the years 1947 to 1950, at the height of the post-WWII baby boom in the United States. \n", "\n", "The post-war jump in births is a major reason for the large changes that we have observed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }