Decision trees | 决策树 jué cè shù

Today, we focused our the class on decision tree. Decision tree is a way to organize data.   You can look at it this way: you ask a bunch of questions and make a bunch of decisions, and organize data based on these decisions.

For example, if our data consists of colors and shapes of 3 pieces of fruits. We have 1 yellow apple, 1 red apple and a yellow banana. We have two features: shape and color.

By organizing our data, we can identify types of fruit. We go through our data on shape and color one by one. If we first organize our data by color, we know that will incorrectly group the yellow apple with the yellow banana. But if we first organize our data by shape, that will right away group apples and banana separately. So we organize this data by shape, and then by color (if we want to make a distinction between yellow apple and red apple).
Magic Math Mandarin
The way to organize data may be (highly likely) different for another dataset of fruits. But you get the point: we organize data to best group things. In each step of the way, our data gets more organized. The “energy distribution” has become lower entropy.

That was a classification tree model.

When we have lots of decision trees for different random parts of a larger data, we have the so-called “random forest” 随机森林 model, originated by Leo Breiman.

We showed in class how to code a decision tree from scratch.  Here is a shorter version using Python sklearn library.

 

Entropy | 熵 shāng

Nothing is lost, nothing is created, everything is transformed.
― Antoine Lavoisier (August 1743 – 8 May 1794)

Unlike before, we started the class today with a quote. This is because it is really difficult to talk about entropy, and we made many analogies (such as water flows from high to low, a mirror broken never, or almost never, returns to whole again) to bring our attention to how things work in daily life that we have taken for granted.

Some theory/hypothesis says that the universe started with Big Bang, a state with very low entropy. There are many states of high entropy than low entropy (imagine 10…000 to 1). So we will have to cycle through lots of high entropy states before it is low again. Well, we only have barely touched the topic. Whereas our true goal is to talk about the so-called decision tree model, which we will cover tomorrow.

To help you remember the word “entropy” and its meaning (as if we knew!), “en” comes from “energy”.  “tropy” means “transfom”, and comes from Latin.

Entropy is a measure of the number of possible ways energy can be distributed in a system.

By the way, Lavoisier 拉瓦锡 was a great chemist.

Cosmic distance ladder | 宇宙距离

The class has no homework today.  We watched the video lecture by Terence Tao (see link below).   The name of the video is “Cosmic Distance Ladder”.  Quite a mystifying name.

The stories, which Terence Tao told in the lecture, were about philosophers and astronomers from ancient times, such as Aristotle and others, and those who were closer to us in history.  What all of them have in common is that they were able to use good observations and ingenious reasonings to indirectly measure the distance between the Earth and the Moon, and the Sun, and the distance of the galaxies, without any technology (the earliest did not even know the number Pi), with amazing accuracy (as verified by what we know today).

You should definitely watch the video a few times.  Think about this: compare with human observation and reasonings, what computers can do is still just technology and tools.  The computers can’t do indirect reasonings that connect the dots from disparate information. It makes zero sense to believe computers (including phones) are smarter than you are.

So, use your great mind. Let your mind observe and reason, and make computers help you along the way.

 

 

 

zero, one and two | 零,一,二

It is not easy for a young child to comprehend multiplication by 1, as how they are taught in school is often the robotic multiplication table.   She or he can very quickly answer mutiplications by 2, or 3.    Because of this, questions like “what is the product of 1,2, 3, 4” (i.e. 4 factorial) can get a wide range of answers because the number “1” confuses the young mind.

Pychologist says that an infant learns the number 2 before the number 1.   And we can see why: with 2, there is something to compare against, like two fingers.  If there is only one finger, there is no variation, it is confusing.

When we teach multiplication, don’t forget to show that math is an integral part of the real world around us.   It is invented to simplify addition.  Multiply by 1 means just the thing itself.  Multiply by 2 means adding two of this thing together.  Multiply by 3 means adding three of the thing together.  The thing can be a bag of candies or the footage of a home.

Finally, we should show children how to use computers (not calculators) to do computations.   While a question like “give me the sum from 1 to 199” can be solved within seconds with math tricks, a slightly different question “give me the product from 1 to 199” won’t work with the same trick.  But if you know how to make the computer do the job, you can still answer it within seconds.

 

argmax, argmin argsort and quick sort | 快速排序

This Saturday class we went over indexing and ordering a group of items by their sorted indices. For those who are more advanced, please go over the section on quick sort.

For example,

>> import numpy as np
>>> packpack =np.array([‘snack’,’book’,’pen’,’eraser’,’apple’])
# Position of the biggest word (alphabetically)
>>> np.argmax(packpack)

[out]: 0
# Position of the smallest word (alphabetically)
>>> np.argmin(packpack)

[out]: 4

# Position of the words if we are to sort them alphabetically
>>> np.argsort(packpack)

[out]: array([4, 1, 3, 2, 0], dtype=int64)

Now let us sort them:
>>> packpack[np.argsort(packpack)]

[out]: array([‘apple’, ‘book’, ‘eraser’, ‘pen’, ‘snack’], dtype='<U6′)

Then we tried sorting numbers:

numbers = np.array([2,3,5,7,1,4,6,15,5,2,7,9,10,15,9,17,12])
>>> numbers[np.argsort(numbers)]

[out]: array([ 1, 2, 2, 3, 4, 5, 5, 6, 7, 7, 9, 9, 10, 12, 15, 15, 17])


Finally we dig deeper: how do you really sort things fast systematically? Using quick sort!

 

Logarithm | 对数

As we had explored in previous classes, division is subtraction again and again and again, multiplication is adding again and again.  Exponentiation is multiply again and again and again— They are all inventions to simplify repeated computation.

So is the invention of logarithm: taking log is division again and again and again.   They were invented by John Napier who was a Scottish mathematician, physicist, and astronomer  in 1614 as a means to simplify calculations.

🙂 Today’s  Python numpy class summary:

Log10 means how many times divide by 10 will return you to 1. log10(100) will give you 2 because 100 divide by 10 twice returns us to one.
>>> np.log10(100)
One trillion divide by 10 twelve times returns it to 1.
>>> np.log10(1000000000000)

>>> np.linspace(0.0, 3.0, num=4)
Out: array([0., 1., 2., 3.])

>>> np.logspace(0.0, 3.0, num=4)
Out: array([   1.,   10.,  100., 1000.])

>>> np.linspace(0.0, 12.0, num=13)
Out: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.])
>>> np.logspace(0.0, 12.0, num=13)
Out: array([1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12])

Bonus:  Did you know that Engineers and scientists used to use a tool called “slide rule” (计算尺) to do logarithmic computations until 1970s when electronic computer and calculators came into use.  You should go and check it out if any of your grandparents have one of these.

Count non-zeros using numpy.count_nonzero | 数非零数

Today our class practiced making the computer count number of non-zero numbers using the numpy library from Python.  This can be useful if you have a ton of numbers.

import numpy as np; import pandas as pd

some_array = np.array([[0,1,7,0,0],[3,0,0,2,19]])

array([[ 0,  1,  7,  0,  0],
[ 3,  0,  0,  2, 19]])

np.count_nonzero(some_array)

5

np.count_nonzero(some_array,  axis=0)  Count across the rows, i.e. count along the column

array([1, 1, 1, 1, 1], dtype=int64)

np.count_nonzero(some_array,  axis=1)  Count across the columns, i.e. count along the row

array([2, 3], dtype=int64)

We talked about this example:

d = {'Basket1': [3, 0], 'Basket2': [3, 4]}
df = pd.DataFrame(data=d, index=['Apple','Chips'])

# Count the number of non-zeros across the rows
pd.Series(np.count_nonzero(df, axis=0), index=df.columns.tolist())

This was the result we got.

Basket1    1
Basket2    2
dtype: int64

That was a very tiny data. If we have a dataset with a million rows and columns, we should definitely do this!

An great collection of Python notebooks | Python 笔记本集

Here is a really great collection of Python notebooks with lots and lots of links.  We start with some appetizers:

But there are so many and so much more!  You can find them from this page:

Mathematics

    • Linear algebra with Cython. A tutorial that styles the notebook differently to show that you can produce high-quality typography online with the Notebook. By Carl Vogel.

More

Math olympia medal count analysis | 奥数奖牌分析

The International Mathematical Olympiad (IMO) is an annual six-problem mathematical olympiad for pre-college students younger than 20. The first IMO was held in Romania in 1959. As we will see, eastern Europeans were top performers in the IMO in the earlier years. You can find the summary data analysis in our Jupyter Notebook on GitHub.

It has since been held annually, except in 1980 (what happened in 1980?). More than 100 countries, representing over 90% of the world’s population, send teams of up to six students (under 20 years old) to compete.

Problems cover extremely difficult algebra, pre-calculus, and branches of mathematics not conventionally covered at school and often not at university level either, such as
– projective and complex geometry
– functional equations
– combinatorics
– number theory (where extensive knowledge of theorems is required).

No calculus is required. Supporters of not requiring calculus claim that this allows “more universality and creates an incentive to find elegant, deceptively simple-looking problems which nevertheless require a certain level of ingenuity”.

Rank Country Appearance Gold Silver Bronze Honorable_Mentions
0 1 China 32 147 33 6 0
1 2 United States 43 119 111 29 1
2 3 Russia 26 92 52 12 0
3 4 Hungary 57 81 160 95 10
4 5 Soviet Union 29 77 67 45 0
5 6 Romania 58 75 141 100 4
6 7 South Korea 30 70 67 27 7
7 8 Vietnam 41 59 109 70 1
8 9 Bulgaria 58 53 111 107 10
9 10 Germany 40 49 98 75 11

Debt | 债务

Exactly two years ago, we wrote about national debt.  It was close to $20 trillion at that time.  Now it is $22 trillion.


We are  presenting  very large numbers.

But large is only a relative term, depending on the unit we are using, and relative to what.

According to the Institute of International Finance, global debt, as of 3Q2018, is close to $244 trillion.
About one third of the debt was added in the last ten years or so. So that means that over the last ten years the total global debt grew by a half.

You can see it from the Global Debt Monitor January 2019 Report.

This probably does not mean much to you or me, unless we have some comparisons.

Visuals can help you see the numbers, but it stops short of helping us to understand the number, since money in dollars is just money in dollars unless we compare it with something.

How about we compare it with GDP (gross domestic product)? GDP in dollar is the value of all the things people produce or service for a period of time in dollar.

So debt to GDP ratio is like the amount of money you owe at the end of the year relative to the amount of money you have made over the year.   When the ratio is over 1, it means what we owe is more than what we have made in a year.

Now hopefully we can understand the ratio a little bit.

For a great narrative of history of US debt to gdp ratio, see “The Long Story of U.S. Debt, From 1790 to 2011, in 1 Little Chart” from The Atlantic by Matt Phillips.

The article was written on Nov 13, 2012. But history does not go away.

You can connect the dots to the following chart, which you can find from Federal Reserve Bank of St. Louis.  It seems that we have debt to GDP ratio getting close to historical highest level.

That was right after World War II.

So what is in the US debt?

The total US debt now is about $22 trillion.

The U.S. debt to China is $1.138 trillion as of October 2018. That’s 29 percent of the $3.9 trillion in Treasury bills, notes, and bonds held by foreign countries.

The rest of the $22 trillion national debt is owned by either the American people or by the U.S. government itself. China has the greatest amount of U.S. debt held by a foreign country.

Domestically, the total US household debt as of 4Q2018 is at $13.54 trillion (New York Fed). For a fantastic and fascinating visual account of the numbers, see the report by New York Fed.

You can find the numbers and reports easily from different federal reserve banks and government office such as the Congressional Budget Office, and the US Treasury.

These numbers, ratios and time series by component are a lot more interesting and tell a whole lot more than everyday noisy news.