Using the knowledge you acquired in the previous exercises on sampling and selecting(here), we will now go through an entire data analysis process. You will be using what you know as crutches to solve the problems. Don’t worry. It might look intimidating but follow the sequence and you will see that modeling a decision tree is the best decision you made today. We will take you through all stages of the data pipeline. From Data loading,feature selection, sampling, plotting, modelling and evaluating a decision tree.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Use read.csv() command to load the lenses.csv data and store it in lens. Use the str() command to see lens. Download the dataset from here
Notice there are no column names. The column names are as follows
index, age, spec_pres, astigmatic, tpr. Use one line code to change the column names to the aforementioned names.
Given the meta data
age: (1) young, (2) pre-presbyopic, (3) presbyopic
spec_pres: (1) myope, (2) hypermetrope
astigmatic: (1) no, (2) yes
tpr: (1) reduced, (2) normal
class: (1) patient needs hard contact lens, (2) patient needs soft contact lens, (3) patient does not need contact lens
Type the code
lens$age[lens$age == "1"]="young"
Use the same format to change all the data to its names for the age and spec_pres variables.
str() command to see the changes. Also notice that the astigmatic column is a factor that is also storing numbers as characters. To get all of them in the same format, lets convert it to character. Use the code as.character() to convert this column data type to character.
Now change the astigmatic column data to the right names
Use the following code to replace the 1 with “reduced in the tpr column
Now type str(lens) to see the dataframe. Notice that the tpr column data type change to character from integer. Anytime you introduce something that is not a number in a number dataframe, it will become a character.
Go ahead and replace 2 in the tpr column with “normal”
table() command to see the counts of each data type
Notice that there is a g in the count. That could possibly be a typo. We can go ahead and remove that row since there is only one row with that typo. Hint: You can select all rows that does not have that typo and store it back in the lens dataframe.
Great Work. We realized that the index column is not necessary for our modeling purposes. So lets remove the index column.