In the last set of exercise of data.table ,we saw some interesting features of data.table .In this set we will cover some of the advanced features like set operation ,join in data.table.You should ideally complete the first part before attempting this one .
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Create a data.table from diamonds dataset ,create key using setkey over cut and color .Now select first entry of the groups Ideal and Premium
With the same dataset,select the first and last entry of the groups Ideal and Premium
Earlier we have seen how we can create/update columns by reference using
:= .However there is a lower over head ,faster alternative in data.table . This is achieved by SET and Loop in data.table ,however this is meant for simple operations and will not work in grouped operation .Now take the diamonds data.table and make columns x,y,z value squared .For example if the value is currently 10 ,the resulting value would be 100 .You are awesome if you find out all alternative answer and check the time using system.time .
In the same dataset ,capitalize first letter of column names .
Reordering column sometimes is necessary , however if your data frame is of several GBs it might be a overhead to create new data frame with new order .Data.Table provides features to overcome this . Now reorder your diamonds data.table’s column by sorting alphabetically
If you are not convinced with the powerful intuitive features of data.table till now , I am pretty sure you will by the end of THIS . Suppose I want to have a metric on diamonds where I want to find for each group of cut
maximum of x * mean of depth and name it my_int_feature and also I want another metric which is
my_int_feature* maximum of y again for each group of cut . This is achievable by chaining but also with a single operation without chaining which is the expected answer .
Suppose we want to merge iris and airquality,akin to the functionlaity of
rbind . We want to do it fast and want to keep track of the rows with their original dataset , and keep all the columns of both the data set in the merged data set as well ,How do we achieve that
The Next 3 exercises are on rolling Join like features of data.table ,which is useful in time series like data . Create a data.table with the following
x <- data.frame(rep(letters[1:2],6),c(1,2,3,4,6,7),sample(100,6))
names(x) <- c("id","day","value")
test_dt <- setDT(x)
Now this mimics a sales data of 7 days for a and b . Notice that day 5 is not present for both a and b .This is not desirable in many situations , A common practise is to use the previous days data .How do we get previous days data for the id a ,You should ideally set keys and do it using join features
May be you dont want the previous day’s data ,you may want to copy the nearest value for day 5 ,How do we achieve that
Now there may be a case when you don’t want to copy any value if the date is beyond last observation .Use your answer for question 8 to find the value for day 5 and 9 for b ,Now since 9 falls beyond last observation of 7 you might want to avoid copying it .How do you explicitly tell your data.table to stop when it sees last observation and don’t copy previous value .This may not seem useful since you know that here 9 falls beyond 7 ,but imagine you have a series of data points and you don’t really want to copy data to observations after your last observation.This might come handy in such cases .