Decision trees | 决策树 jué cè shù

Today, we focused our class on decision tree. Decision tree is a way to organize data.   You can look at it this way: you ask a bunch of questions and make a bunch of decisions, and organize data based on these decisions.

For example, if our data consists of colors and shapes of 3 pieces of fruits. We have 1 yellow apple, 1 red apple and a yellow banana. We have two features: shape and color.

By organizing our data, we can identify types of fruit. We go through our data on shape and color one by one. If we first organize our data by color, we know that will incorrectly group the yellow apple with the yellow banana. But if we first organize our data by shape, that will right away group apples and banana separately. So we organize this data by shape, and then by color (if we want to make a distinction between yellow apple and red apple).
Magic Math Mandarin
The way to organize data may be (highly likely) different for another dataset of fruits. But you get the point: we organize data to best group things. In each step of the way, our data gets more organized. The “energy distribution” has become lower entropy.

That was a classification tree model.

When we have lots of decision trees for different random parts of a larger data, we have the so-called “random forest” 随机森林 model, originated by Leo Breiman.

We showed in class how to code a decision tree from scratch.  Here is a shorter version using Python sklearn library.

# making up data
>>> training_data = np.array([
>>>     [1, 1],
>>>     [2, 1],
>>>     [1, 0],
>>> ])

# Yellow = 1, Red=2
# round =1, oblong = 0
>>> from sklearn import tree
>>> data = np.array(['Apple','Apple','Banana'])
>>> data_names= ["color", "shape"]
>>> fruit_names = ['Apple', 'Banana']
>>> clf = tree.DecisionTreeClassifier()
>>> clf =, data)
>>> tree.plot_tree(, data))

# visualize tree
>>> import graphviz 
>>> dot_data = tree.export_graphviz(clf, out_file=None) 
>>> graph = graphviz.Source(dot_data) 
>>> graph.render("fruit") 
>>> dot_data = tree.export_graphviz(clf, out_file=None, 
...                      feature_names=data_names,  
...                      class_names=fruit_names,  
...                      filled=True, rounded=True,  
...                      special_characters=True)  
>>> graph = graphviz.Source(dot_data)  
>>> graph.render()


error: Content is protected !!