Problem: This is our first programming assignment. This assignment has two parts:
• Part 1: I ask you to write a program to build a decision tree using Gini impurity measurement to guide tree generation. The data set is the poker hand data set archived at UCI Machine Learning Repository: Poker Hand Data Set
Data Set
Characteristics: Multivariate Number of Instances: 1025010 Area: Game
Attribute
Characteristics: Categorical, Integer Number of Attributes: 11 Date Donated 2007-01-
01
Associated Tasks: Classification Missing Values? No Number of Web
Hits: 212827
You shall use the training data set to build your decision tree and then use the testing data set to evaluate your decision tree. You need to report classification accuracy using a bar chart and compare it with the distance based classification which is given in Part II.
• Part II: For this part, I ask you to use the same training data set in Part I to build a distance-based classification model. Here, you need to find a good distance metric and a parameter k that serves as the threshold to bound the nearest neighbors for any given data item (or point). Then, you need to apply your model to the testing data set to evaluate your classification model. You shall record the classification accuracy and compare it in a
bar chart with that of the decision tree model built in Part I.
Programming Language:
C++ or Java, but C++ is preferred.