If you followed through the Basic Decision Tree exercise, this should be useful for you. This is like a continuation but we add so much more. We are working with a bigger and badder datasets. We will be also using techniques we learned from model evaluation and work with ROC, accuracy and other metrics.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
read in the adult.csv file with header=FALSE. Store this in df. Use
str() command to see the dataframe. Download the Data from here
You are given the meta_data that goes with the CSV. You can download this here Use that to add the column names for your dataframe. Notice the df is ordered going from V1,V2,V3 _ _ and so on. As a side note, it is always best practice to use that to match and see if all the columns are read in correctly.
Use the table command and print out the distribution of the class feature.
Change the class column to binary.
cor() command to see the corelation of all the numeric and integer columns including the class column. Remember that numbers close to 1 means high corelation and number close to 0 means low. This will give you a rough idea for feature selection
Split the dataset into Train and Test sample. You may use sample.split() and use the ratio as 0.7 and set the seed to be 1000. Make sure to install and load caTools package.
Check the number of rows of Train
Check the number of rows of Test
We are ready to use decision tree in our dataset. Load the package “rpart” and “rpart.plot” If it is not installed, then use the install.packages() commmand.
Use rpart to build the decision tree on the Train set. Include all features.Store this model in dec
Use the prp() function to plot the decision tree. If you get any error use this code before the
par(mar = rep(2, 4))