Home
An introduction to R
Contents
1. T T ral T T T 0 20 40 60 80 100 120 140 months Figure 8 7 Survival curve 10 will develop AIDS before 45 months and 20 before 76 months 166 CHAPTER 8 STATISTICS 8 6 SURVIVAL ANALYSIS It is interesting to know if the age of a person has any impact on the incubation time A Cox proportional hazards model is used to investigate that IDU analysis1i lt coxph Surv IncubationTime AidsStatus Age data IDUdata The result of coxph is an object of class coxph It has its own printing method IDU analysis1 Call coxph formula Surv IncubationTime AidsStatus Age data IDUdata coef exp coef se coef Z p Age 0 0209 102 0 0175 1 20 0 23 Likelihood ratio test 1 39 on 1 df p 0 238 n 418 The summary function for coxph objects returns the following information summary IDU analysis1 Calls coxph formula Surv IncubationTime AidsStatus Age data IDUdata n 418 coef exp coef se coef Zz p Age 0 0209 1 02 0 0175 1 20 0 23 exp coef exp coef lower 95 upper 95 Age 1 02 0 98 0 987 1 06 Rsquare 0 003 max possible 0 851 Likelihood ratio test 1 39 on 1 df p 0 238 Wald test 1 43 on 1 df p 0 231 Score logrank test 1 44 on 1 df p 0 231 Use the generic function resid to extract model residuals In a survival analysis there are several types of residuals for example martingale residuals and deviance residuals The residuals can be used assess the linearity
2. re eval evalstr String impstr indata read csv infile re eval impstr 203 CHAPTER 9 MISCELLANEOUS 9 7 CREATING FANCY OUTPUT Linear regression DER Import Data Create pairs plot FitModel File imported C Documents and Settings Longhowi Mijn documenteniRWRCourse air csv Graph Model fit results O 100 200 300 60 70 80 90 A TAR f PS ee Call E Im formula form data indata agit ey f i Residuals pags Susto Min 1Q Median 3Q Max p 740 485 14 219 3 551 10 097 95 619 f3 Solar R Fs Coefficients s sa Se Estimate Std Error t value Pr gt ItI Intercept 64 34208 23 05472 2 791 0 00623 Solar R 0 05982 0 02319 2 580 0 01124 Nind 3 33359 0 65441 5 094 1 52e 06 Temp 1 65209 0 25353 6 516 2 42e 09 o a o D o Signif codes 0 0 001 0 01 0 05 0 1 1 Residual standard error 21 18 on 107 degrees of freedom 42 observations deleted due to missingness Multiple R Squared 0 6059 Adjusted R squared 0 5948 50 100 150 10 15 20 F statistic 54 83 on 3 and 107 DF p value lt 2 2e 16 Figure 9 3 A small java gui that can call R functions This will cause R to call the read csv function and create an R object infile Then if the user click on the Create pairs plot button the user can select the variables that will be plotted in a pairs plot T
3. Intercept 4236 9773 7409 1846 0 57 0 5697 Mileage 161 5201 146 5253 1 10 0 2750 Weight 2 7349 1 6323 1 68 0 0994 HP 36 0914 18 5871 1 94 0 0572 Table 9 1 Regression output 9 7 2 An simple HTML report A small demonstration of Sweave Every month you need to publish a report that includes some summary statistics and a graph on your local intranet site Create a file say datareport Treat this file as a normal HTML file lt DOCTYPE html PUBLIC W3C DTD HTML 4 01 Transitional EN gt lt html gt lt head gt lt meta content text html charset IS0O 8859 1 http equiv content type gt lt title gt data report lt title gt lt head gt lt body gt lt h2 gt Monthly summary of input data lt h2 gt lt br gt Data summaryg nbsp of amp nbsp sales data from this month lt lt echo FALSE gt gt out lt var cars c Price Weight Mileage out lt xtable out caption Correlation of this months price data print out type HTML A graph of the data lt lt fig TRUE echo FALSE gt gt pairs cars c Price Weight Mileage lt body gt lt html gt 206 CHAPTER 9 MISCELLANEOUS 9 7 CREATING FANCY OUTPUT The chunks of R code start with lt lt some options gt gt and end with an There are a few options you can set e echo FALSE the R statements in the chunk are not put in the output Useful when some R statements need to run for exa
4. 0 3 0 1 0 1 0 1 1 FALSE 2 1 2 Integer Integers are natural numbers They can be used to represent counting variables for example the number of children in a household nchild lt as integer 3 is integer nchild 1 TRUE Note that 3 0 is not an integer nor is 3 by default an integer nchild lt 3 0 is integer nchild 1 FALSE nchild lt 3 is integer nchild 1 FALSE So a 3 of type integer in R is something different than a 3 0 of type double How ever you can mix objects of type double and integer in one calculation without any problems x lt as integer 7 y lt 2 0 ZK In contrast to some other programming languages the answer is of type double and is 3 5 The maximum integer in R is 2 1 20 CHAPTER 2 DATA OBJECTS 2 1 DATA TYPES as integer 2731 1 1 2147483647 as integer 2731 1 NA Warning message NAs introduced by coercion 2 1 3 Complex Objects of type complex are used to represent complex numbers In statistical data analysis you will not need them often Use the function as complex or complex to create objects of type complex testi lt as complex 25 5i sqrt test1 1 0 4975427 5 024694i test2 lt complex 5 real 2 im 6 test2 1 2 6i 2 6i 2 6i 2 6i 2 6i typeof test2 1 complex Note that by default calculations are done on real numbers so sqrt 1 results in NA Use sqrt as complex 1 2
5. 11 8 1 3 5 7 9 610 2 4 Try to figure out what the result of x order x is The function rev reverses the order of vector elements So rev sort x is a sorted vector in descending order x lt rnorm 10 round rev sort x 2 1 1 18 1 00 0 87 0 57 0 37 0 42 0 49 0 72 0 91 1 26 The function unique returns a vector which only contains the unique values of the input vector The function duplicated returns for every element a TRUE or FALSE depending on whether or not that element has previously appeared in the vector x lt 2 6 4 5 5 8 8 1 3 0 unique x 11126458130 duplicated x 1 FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE 50 CHAPTER 4 DATA MANIPULATION 4 2 MATRIX SUBSCRIPTS Our last example of a vector manipulation function is the function diff It returns a vector which contains the differences between the consecutive input elements x lt c 1 3 5 8 15 diff x 1 22 37 So the resulting vector of the function diff is always at least one element shorter than the input vector An additional lag argument can be used to specify the lag of differences to be calculated x lt c 1 3 5 8 15 diff x lag 2 1 4 5 10 So in this case with lag 2 the resulting vector is two elements shorter 4 2 Matrix subscripts As with vectors parts of matrices can be selected by the subscript mechanism The general scheme for a matrix x is given by x subscript Where subscript h
6. 5 confint 1m cooks distance lm deviance 1m dfbeta 1m 9 dfbetas 1m drop1 1m dummy coef 1m effects 1m 13 extractAIC 1m family 1m formula 1m hatvalues 1m 17 influence 1m kappa 1m labels 1m logLik 1m 21 model frame 1m model matrix 1m plot 1m predict lm 25 print 1m proj lm residuals 1m rstandard 1m 29 rstudent 1m simulate lm summary lm variable names 1m 33 vcov 1m Non visible functions are asterisked The output of the function methods is a vector with the specific methods So for the class lm we see that plot 1m is a specific method so a we could use plot 1m object Another specific method is extractAIC 1m The AIC quantity for a linear regression model can be calculated as follows Fit a linear regression model with the function 1m This results in an object of class lm Then apply the generic function extractAIC which will call the specific extractAIC 1m function 181 CHAPTER 9 MISCELLANEOUS 9 1 OBJECT ORIENTED cars lm lt lm Price Mileage data cars extractAIC cars 1m 1 2 0000 967 2867 The AIC quantity can also be calculated for other models such as the Cox proportional hazards model For the model fitted in section 8 6 with the function coxph we extract the AIC IDU analysis1i lt coxph Surv IncubationTime AidsStatus Age data IDUdata extractAIC IDU analysis1 1 1 0000 796 3663 The function methods can also be used to see which classes have
7. alpha 0 94588 0 28034 3 374 0 00160 Signif codes 0 0 001 x 0 01 0 05 0 1 1 Residual standard error 0 08786 on 42 degrees of freedom Number of iterations to convergence 4 Achieved convergence tolerance 8 43e 07 k Vm alpha k 0 002234067 0 001549951 0 00304783 Vm 0 001549951 0 004598040 0 01778873 alpha 0 003047830 0 017788730 0 07859150 When data with a smaller x value are not available the Hill model with three parameters is not identifyable Maybe a parameter should be fixed at a certain value instead of trying to estimate it 8 7 2 Singular value decomposition A useful trick to find out if certain parameters or combination of parameters are es timable identifyable i e whether they influence model predictions enough or whether they will be obscured by measurement noise is a singular value decomposition of the so called sensitivity matriz Let us assume we have a non linear model yp f x 0 For data point i 1 n Then the sensitivity matrix S 0 for the parameter vector 0 is defined by Oypred S 0 ey So S can be calculated by 177 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION OyPred dyPred SO su 1 a y 88 This method will rank the importance with respect to the influence on y of linear combinations of the parameters Thereto a singular value decomposition of S is per formed where U and V are unitary and the d are called the
8. u matrix u skipped d 1 6 89792666 0 52385899 0 03482234 v 1 2 3 1 0 8862039 0 4128734 0 2101863 2 0 3014789 0 1694330 0 9382979 3 0 3517857 0 8948900 0 2746248 The largest singular value is 6 898 and the smallest has a value of 0 0348 This ratio becomes better is we include data points with smaller x values sensitivity lt eval ModelDeriv envir list x seq from 0 01 to 5 1 50 k 0 3 Vm 1 108 alpha 0 8 sensitivity lt attributes sensitivity gradient svd sensitivity d 1 6 6156912 1 3827939 0 4279231 179 Q Miscellaneous Stuff 9 1 Object Oriented Programming 9 1 1 Introduction The programming language in R is object oriented In R this means e All objects in R are members of a certain class e There are generic methods that will pass an object to its specific method e The user can create a new classes new generic and specific methods There are many classes in R such as data frame Im and h test The function data class can be used to request the class of a specific object mydf lt data frame x c 1 2 3 4 5 y c 4 3 2 1 1 data class mydf 1 data frame myfit lt lm y x data mydf data class myfit 11 in There are two object oriented systems in R old style classes also called S version 3 or S3 classes and new style classes also called S version 4 or S4 classes We first discuss old style classes
9. SETCADR R_fcall x_input fn_out eval R_fcall rho UNPROTECT 3 return REAL fn_out 0 The same constructions are used as in the previous example Evaluating the R functions results in a variable of type SEXP this is then converted to a double and returned by func When the dll is compiled we can link it to R and run the function myd11 C Test Release Integrate d11 dyn load myd11 myf lt function x 101 CHAPTER 6 EFFICIENT 6 4 SOME COMPILED CODE x sin x Call Integrate myf as double 0 as double 2 new env 1 1 741591 Ofcourse you could have used the R function integrate as a comparison integrate myf 0 2 1 741591 with absolute error lt 1 9e 14 it gives the same result 102 7 Graphics 7 1 Introduction One of the strengths of R above SAS or SPSS is its graphical system there are numerous functions You can create standard graphs use the R syntax to modify existing graphs or create completely new graphs A good overview of the different aspects of creating graphs in R can be found in 6 In this chapter we will first discuss the graphical functions that can be found in the base R system and the lattice package There are more R packages that contain graphical functions one very nice package is ggplot2 http had co nz ggplot2 We will give some examples of ggplot2 in the last section of this chapter The graphical functions in the b
10. myf pp Executing the command testf 9 will result in an error execute traceback to see the function calls before the error Error in if x gt 0 missing value where TRUE FALSE needed In addition Warning message NaNs produced in log x traceback 2 myf pp 1 testf 9 Sometimes it may not be obvious where a warning is produced in that case you may set the option options warn 2 Instead of continuing the execution R will now halt the execution if it encounters a warning 5 4 2 The warning and stop functions You as the writer of a function can also produce errors and warnings In addition to putting ordinary print statements like print Some message in your function you can use the function warning For example 81 CHAPTER 5 WRITING FUNCTIONS 5 4 DEBUGGING YOUR R variation lt function x if min x lt 0 sf warning variation only useful for positive data de sd x mean x variation rnorm 100 1 19 4427 Warning message variation only useful for positive data in variation rnorm 100 If you want to raise an error you can use the function stop In the above example when we replace warning by stop R would halt the execution variation rnorm 100 Error in variation rnorm 100 variation only useful for positive data R will treat your warnings and errors as normal R warnings and errors That means for example the function traceback can be used to se
11. 175 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION 1 2 1 0 f x 0 6 0 4 0 2 Figure 8 12 Hill curves for two sets of parameters Signif codes 0 0 001 0 01 0 05 0 1 1 Residual standard error 0 08745 on 42 degrees of freedom Number of iterations to convergence 14 Achieved convergence tolerance 6 249e 06 k Vm alpha k 1 9993975 0 4390971 7 79519 Vm 0 4390971 0 1013896 1 74292 alpha 7 7951900 1 7429199 30 59291 Eventhough the fitting routine nls started with the same parameter values as those that were used in simulating the data the nls function does not get really close and the standard error of the alpha parameter is quit large Even more disturbing when we simulate new data with the same parameters the nls function will come up with very different results When observations with a smaller x value are available the problem is less ill conditioned simulate data with smaller x values x lt runif 45 0 01 5 datap lt HillModel x alpha1 Vm1 k1 rnorm 45 0 0 09 simdata lt data frame x datap Fit the model out lt nls 176 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION datap HillModel x alpha Vm k data simdata start list k 0 3 Vm 1 108 alpha 0 8 Print output summary out vcov out Estimate Std Error t value Pr gt t k 0 24442 0 04727 5 171 6 1e 06 Vm 1 04371 0 06781 15 392 lt 2e 16
12. 37 So element 37 of the car names vector is a name that contains the string Volvo which is confirmed by a quick check car names 37 1 Volvo 240 4 To find the car names with second letter a we must use a more complicated regular expression tmp lt grep a car names car names tmp 1 Eagle Summit 4 Mazda Protege 4 3 Mazda 626 4 Eagle Premier V6 5 Mazda 929 V6 Mazda MPV V6 For those who are familiar with wildcards aka globbing there is a handy function glob2rx that transforms a wildcard to a regular expression rg lt glob2rx tmp rg 1 tmp To find patterns in texts you can also use the regexpr function This function also makes use of regular expressions however it returns more information than grep 65 CHAPTER 4 DATA MANIPULATION 4 5 CHARACTER MANIPULATION Volvo match lt regexpr Volvo car names Volvo match 1 Sabah Sad Se a aad et Sle 1 SL Sa ade 1 HI ab at a St a1 Sh Sted Sh 1 Ba tat Sb ab Sb al St al ah ah Sh Sh Si A a Sel 88 i i etsi 1 1 attr match length H dad St Al al LL Sd St 19 L lt SL St a Hd ed Sl BA 75 ait 1 et Sl Sa Sa Sh Sa Sat ed Si 1 2s ie i i i The result of regexpr is a numeric vector with a match length attribute A minus one means no match was found a positive number means a match was found In our example we see that element 37 of Volvo match equa
13. CHAPTER 7 GRAPHICS 7 4 TRELLIS GRAPHICS 20 25 30 35 ll li li Small Sporty Van 25000 20000 4 Moo 15000 M oo o O ES O 4 M e a 10000 oo 0 g o Xp 6 g l 5000 Compact Large Medium 25000 M L o o 20000 y E 09 M o oO o o 15000 6 o L o 8 10000 8 0 o L 5000 T T m c 20 25 30 35 20 25 30 35 Mileage Figure 7 15 Trellis plot with modified panel function 7 4 4 Conditioning plots The function coplot can be a nice alternative to the trellis function xyplot It can also create multi panel layouts where each panel represents a part of the data The function has many arguments that can be set A few examples are given below no need to specify intervals only the number of intervals coplot lat long depth number 4 data quakes col blue two conditioning variables coplot lat long depth mag number c 4 5 data quakes conditioning on a factor and numeric variable coplot Price Mileage Type Weight number 3 data cars The function coplot can also use a customized panel function the points function is used as default panel function The following example uses the function panel smooth as panel function coplot Price Mileage Weight number 4 panel panel smooth 130 CHAPTER 7 GRAPHICS 7 5 THE GGPLOT2 PACKAGE 20 25 30 35 i Small ll Sporty Van 25000 _ 7 7 20000 El
14. X lt model matrix out model svd X d 1 94 26983374 3 06760623 1 29749565 0 02145514 matrix U and V not displayed The variance inflation factors VI F i 1 p are based on regressing one of the regres sion variables x on the remaining regression variables xj j 4 i for i 1 p For each of these regressions the R squared statistic R i 1 p can be calculated Then the VIF is defined as 1 RATES It can be shown that the VIF can be interpreted as how much the variance of the estimated regression coeeficient 9 is inflated by the existence of correlation among the regression variables in the model A VIF of 1 means that there is no correlation among the th regression variable and the remaining regression variables and hence the vari ance of 5 is not inflated at all The general rule of thumb is that VIFs exceeding 4 war rant further investigation while VIFs exceeding 10 are signs of serious multicollinearity requiring correction The function vif in the DAAG package calculates the VIFs for a fitted linear regression model 151 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS library DAAG vif out model xi X2 x3 4150 9 13130 0 17797 0 8 3 4 Factor categorical variables as regression variables The 1m function and other modeling functions such as coxph and glm as well ac cepts factor variables categorical variables as regression variables It is not possible to estimate
15. dim A 1 mat means lt t A x x rep 1 n n 6 2 The apply and outer functions 6 2 1 the apply function This function is used to perform calculations on parts of arrays Specifically calculations on rows and columns of matrices or on columns of a data frame To calculate the means of all columns in a matrix use the following syntax 86 CHAPTER 6 EFFICIENT 6 2 THE APPLY AND OUTER M lt matrix rnorm 10000 ncol 100 apply M 1 mean The first argument of apply is the matrix the second argument is either a 1 or a 2 If one chooses 1 then the mean of each column will be calculated if one chooses 2 then the mean will be calculated for each row The third argument is the name of a function that will be applied to the columns or rows The function apply can also be used with a function that you have written yourself Extra arguments to your function must now be passed trough the apply function The following construction calculates the number of entries that is larger than a threshold d for each column in a matrix tresh lt function x d sum x gt d M lt matrix rnorm 10000 ncol 100 apply M 1 tresh 0 6 1 20 39 58 77 96 24 28 37 30 18 30 26 25 36 37 27 27 24 28 26 28 28 28 26 30 23 22 33 21 31 25 23 27 33 31 26 30 27 28 29 28 32 23 24 27 28 26 28 30 25 20 30 24 29 21 25 21 35 25 33 26 23 33 23 27 23 27 31 22 33 29 25 2
16. lt seq 3 3 1 100 titletext lt deparse substitute sin x y lt sin x plot x y type 1 title titletext This seems cumbersome since we could simply have used 189 CHAPTER 9 MISCELLANEOUS 9 2 R LANGUAGE OBJECTS title sinGo However the substitute deparse combination will come to full advantage in functions for example printexpr lt function expr tmp lt deparse substitute expr cat The expression cat tmp cat was typed invisible printexpr sin x The expression sin x was typed The function sys call can be used inside a function and stores the complete call to the function that contains the sys call function plotit lt function s y 41 plot x y title deparse sys call plotit rnorm 100 rnorm 100 9 2 2 Expressions as Lists Expressions and calls can be seen as recursive lists so they can be manipulated in the same way you manipulate ordinary lists To create calls use the function quote to create expressions use the function expression The output of these functions can be evaluated using the eval function my expr lt expression 3 sin rnorm 10 my expr eval my expr 1 0 09410139 1 41480964 0 71378911 2 33318300 2 05434944 2 91265390 7 2 98911418 0 25474676 2 02085040 0 88611195 Let s look at the expression my expr we transform it to a list using the function as list 190 CHAPTER 9 MISCELLANEOUS 9 2 R LANGUAGE OBJECT
17. o x e 2 Sa ro E 74 a PA Y TATTOO T T T T T 3 2 1 0 1 2 3 1960 1970 1980 x Time Figure 7 3 Different uses of the function plot set a 2 by 2 layout par mfrow c 2 2 plot xf col blue plot xf rnorm 500 col red plot sin 3 3 plot myts 7 2 2 Distribution plots R has a variety of plot functions to display the distribution of a data vector Suppose the vector x is numeric data vector for example x lt rnorm 1000 Then the following function calls can be used to analyze the distribution of x graphi cally e hist x creates a histogram e qqnorm x creates a quantile quantile plot with normal quantiles on the x axis e qqgplot x y creates a qq plot of x against y 106 CHAPTER 7 GRAPHICS 7 2 MORE PLOT FUNCTIONS e boxplot x creates a box and whisker plot of x The above functions can take more arguments for fine tuning the graph see the cor responding help files The code below creates an example of each graph in the above list x lt rnorm 100 y lt rt 100 df 3 par mfrow c 2 2 hist x col 2 qqnorm x qqplot x y boxplot x col green Histogram of x Normal Q Q Plot o o N i o 2 wo r 4 zor S y A 9 2 J W lo o Vo o o o A O i 4 T T T T T 3 2 1 0 1 2 2 1 0 1 2 x Theoretical Quantiles g 0 i y 4 024 Litto o 1 I I Figure 7 4 Example distribution plot in R If you have a factor variable x
18. put command call system command run The SAS output delivery system ODS is a convenient system to create reports in HTML PDF or other formats The ODS takes output from SAS procedures and graphs together with specific user settings it creates a certain report The graphs don t have to be SAS graphs they could be any graph Lets use the same dataset testdata as in the previous example First run the SAS code that calls the R code that creates the graph Alet myf C Temp plotR2 R data _null_ command Rcmd BATCH amp myf put command call system command run When the graphs in R are created and are stored on disc start the specifications of the SAS ODS 195 CHAPTER 9 MISCELLANEOUS 9 4 DEFAULTS AND ods html file sasreport html title SAS output and R graphics title2 a small example Some SAS procedure that writes results in the report proc means data Testdata run export the SAS data and call R to create the plot proc export data testdata outfile C temp sasdata csv REPLACE run Alet myf C Temp plotR2 R data _null_ command Rcmd BATCH amp myf put command call system command run insert additional html that inserts the graph that R created ODS html text lt b gt My Graph created in R lt b gt ODS html text lt img src c temp Rgraph2 jpg BORDER 0 gt ODS html close 9 4 Defaul
19. s GNU General Public License in source code form It compiles and runs on a wide variety of UNIX platforms and similar systems including FreeBSD and Linux Windows and MacOS CHAPTER 1 INTRODUCTION 1 2 THE R ENVIRONMENT 1 2 The R environment R is an integrated suite of software facilities for data manipulation calculation and graphical display It includes e an effective data handling and storage facility e a suite of operators for calculations on arrays in particular matrices e a large coherent integrated collection of intermediate tools for data analysis e graphical facilities for data analysis and display either on screen or on hardcopy and a well developed simple and effective programming language which includes con ditionals loops user defined recursive functions and input and output facilities The term environment is intended to characterize it as a fully planned and coherent system rather than an incremental accretion of very specific and inflexible tools as is frequently the case with other data analysis software R like S is designed around a true computer language and it allows users to add additional functionality by defining new functions Much of the system is itself written in the R dialect of S which makes it easy for users to follow the algorithmic choices made For computationally intensive tasks C C and Fortran code can be linked and called at run time Advanced users can write C cod
20. xq lt quantile x c 0 05 0 1 xq 5 10 1 496649 1 205602 The function returns a vector with the quantiles as named elements stem and leaf plots A stem and leaf plot of x is generated by stem x N 100 Median 0 014053 Quartiles 0 676618 0 749655 Decimal point is at the colon The decimal point is at the 3 5 2 721 1 654422222111000000 0 988877666555544444433333222111100 O 111233345566667777788888889 1 000012224444788 2 01 313 136 CHAPTER 8 STATISTICS 8 1 BASIC STATISTICAL FUNCTIONS distribution tests To test if a data vector is drawn from a certain distribution the function ks test can be used x lt runif 100 out ks test x pnorm out One sample Kolmogorov Smirnov test data x D 0 5003 p value lt 2 2e 16 alternative hypothesis two sided The output object out is an object of class htest It is a list with five components names out 1 statistic p value alternative method data name out statistic D 0 5003282 The function can also be used to test if two data vectors are drawn from the same distribution x1 rnorm 100 x2 rnorm 100 ks test x1 x2 Two sample Kolmogorov Smirnov test data x1 and x2 D 0 1 p value 0 6994 alternative hypothesis two sided Alternative functions that can be used are chisq test shapiro test and wilcox test Note that the functions in table 8 1 usually require a vector with data
21. you can use the functions pie or barplot in combination with the table function to get a graphical display of the distribution of the levels of x Lets look at the cars data it has the factor columns Country and Type pie table cars Country barplot table cars Type The first argument of barplot can also be a matrix in that case either stacked or grouped bar plots are created This will depend on the logical argument beside 107 CHAPTER 7 GRAPHICS 7 2 MORE PLOT FUNCTIONS barplot table cars Country cars Type J beside T legend text T 00 France EH Germany E Japan Japan USA E Korea Mexico Onc EH Sweden O USA N O I II lIA Compact Large Medium Small Sporty Van Figure 7 5 Example barplot where the first argument is a matrix 108 CHAPTER 7 GRAPHICS 7 2 MORE PLOT FUNCTIONS 1 2 3 Two or more variables When you have two or more variables in a data frame you can use the following functions to display their relationship e pairs mydf when mydf is a data frame then each column in mydf is is plotted against each other the same as plot mydf e symbols creates a scatterplot where the symbols can vary in size e dotchart creates a dot plot that can be grouped by levels of a factor e contour image filled contour create contour and image plots e persp creates surface plots In addition multi panel
22. 0 0132 1 02 3 07e 01 Log scale 0 2721 0 1001 2 72 6 57e 03 Scale 0 762 Weibull distribution Loglik model 510 6 Loglik intercept only 511 1 Chisq 1 02 on 1 degrees of freedom p 0 31 Number of Newton Raphson Iterations 7 n 418 Predictions of the survival time can be made with the predict method newAges data frame Age 30 35 newAges prediction predict IDU param newdata newAges newAges Age prediction 30 216 6348 31 213 7250 32 210 8543 33 208 0222 34 205 2282 35 202 4716 O M4 UN 8 7 Non linear regression A good book on nonlinear regression is 14 The function nls in R can fit nonlinear regression models of the form 170 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION y f r 0 for some nonlinear function f and where x is a vector of regression variables The error term e is often assumed to be normally distributed with mean zero The p unknown parameters are in the vector 6 01 0p and need to be estimated from data points Yi zi i 1 eo N Unlike the formula specification in linear models the operators in a formula object for nonlinear models have the normal mathematical meaning For example to specify the following nonlinear model E p t Y P2 T2 use the formula object y b1xrx1 b2 x2 The right hand side of an the formula for nonlinear models can also be a function of the data and parameters For example mymodel lt function b
23. 2 x2 lt runif 100 1 2 x3 lt Qkx1 4x x2 rnorm 100 0 0 01 y lt 6xx1 5 x2 3 x2 rnorm 100 O 0 4 testdata lt data frame y x1 x2 x3 out model lt Im y x1 x2 x3 data testdata summary out mode1 Call lm formula y x1 x2 x3 data testdata Residuals Min 10 Median 3Q Max 1 09211 0 26002 0 05173 029653 0 362532 Coefficients Estimate Std Error t value Pr gt t Intercept 0 1932 0 3146 0 614 0 541 x1 0 7615 83111 0 092 0 927 X2 2 4102 16 5890 0 145 0 885 x3 2 6310 4 1428 0 635 0 527 Residual standard error 0 4079 on 96 degrees of freedom Multiple R Squared 0 9815 Adjusted R squared 0 9809 F statistic 1698 on 3 and 96 DF p value lt 2 2e 16 Looking at the output a strange thing is the huge standard error for x2 This may indicate that there is something wrong 150 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS SVD and VIF Two tools to detect multicollinearity are the singular value decomposition SVD of the X matrix and the calculation variance inflation factors VIF The singular value decomposition of X finds matrices U D and V such that When X does not have full rank one or more of the singular values d are zero In practice this will not happen often The more likely case is that the smallest singular value is small compared to the largest singular value The SVD of the X matrix in the above example can be calculated with the function svd
24. CALLING R FROM SAS jpg filename C temp Rgraph jpg x lt rnorm 100 y lt rnorm 100 par mfrow c 2 1 plot x y hist y dev off Then in a SAS session use the call system function to call an external program In this case Rcmd BATCH let myf C Temp plotR R data _null_ command Rcmd BATCH amp myf put command call system command run The same example but now using the X function in SAS let myf C Temp plotR R Alet command Rcmd BATCH amp myf X amp command 9 3 2 Using SAS data sets and SAS ODS The previous example used internal R data to create the plot To create R graphs using SAS data sets you can e export SAS data to a text file and import that in R e import the SAS data set in R directly The following example creates a small data set in SAS and exports it using proc export data testdata input x y datalines 13 26 38 194 CHAPTER 9 MISCELLANEOUS 9 3 CALLING R FROM SAS run proc export data testdata outfile C temp sasdata csv REPLACE run Then in the R file plotR2 R we import the data and use the data to create a simple graph sasdata lt read csv C temp sasdata csv jpegC C temp Rgraph2 jpg plot sasdata x sasdata y dev off Then in SAS we call Remd BATCH to run the above R file non interactively Alet myf C Temp plotR2 R data _null_ command Rcmd BATCH amp myf
25. In the above example z1 z2 is returned note that the individual objects z1 and z2 will be lost You can only return one object If you want to return more than one object you have to return a list where the components of the list are the objects to be returned For example myf lt function x y zi lt sin x z2 lt cos y list izi 22 74 CHAPTER 5 WRITING FUNCTIONS 5 2 ARGUMENTS AND VARIABLES To exit a function before it reaches the last line use the return function Any code after the return statement inside a function will be ignored For example myf lt function x y 1 Zi lt sin x z2 lt cos y iF ad lt 0H return list z1 z2 elsef return z1 z2 F 5 2 5 The Scoping rules The scoping rules of a programming language are the rules that determine how the programming language finds a value for a variable This is especially important for free variables inside a function and for functions defined inside a function Let s look at the following example function myf lt function x y 6 Z x y al a2 9 insidef function p tmp p a2 sin tmp 2 insidef z In the above function e x p are formal arguments e y tmp are local variables e a2 is a local variable in the function myf e a2 is a free variable in the function insidef R uses a so called lexical scoping rule to find the value of free variables see 3 With lexical scoping free variables are first resolved
26. a short description on writing your own package When you download R already a number around 30 of packages are downloaded as well To use a function in an R package that package has to be attached to the system When you start R not all of the downloaded packages are attached only seven packages are attached to the system by default You can use the function search to see a list of packages that are currently attached to the system this list is also called the search path gt search 1 GlobalEnv package stats package graphics 4 package grDevices package datasets package utils 7 package methods Autoloads package base 13 CHAPTER 1 INTRODUCTION 1 7 R PACKAGES The first element of the output of search is GlobalEnv which is the current workspace of the user To attach another package to the system you can use the menu or the library function Via the menu Select the Packages menu and select Load pack age a list of available packages on your system will be displayed Select one and click OK the package is now attached to your current R session Via the library function gt library MASS gt shoes A 1 13 2 8 2 1079 14 3 10 7 6 6 9 5 10 8 8 8 13 3 B 1 14 0 8 8 11 0 14 9 11 8 GA 9 8 11 3 9 3 15 6 The function library can also be used to list all the available libraries on your system with a short description Run the function without any arguments gt
27. allow you to manipulate R objects from C code and evaluate R expressions from within your C code It is more efficient than the standard C interface but because it allows you to work directly with R objects without the usual R protection mechanisms you must be careful when programming with it to avoid memory faults and corrupted data The Call interface provides you with several capabilities that the standard C interface lacks including the following e the ability to create variable length output variables as opposed to the pre allocated objects the C interface expects to write to e a simpler mechanism for evaluating R expressions within C e the ability to establish direct correspondence between C pointers and R objects 6 4 Some Compiled Code examples 6 4 1 The arsim example To apply a first order recursive filter to a data vector one could write the following R function 94 CHAPTER 6 EFFICIENT 6 4 SOME COMPILED CODE arsimR lt function x phi n lt length x if m gt 1 for i in 2 04 x i lt phi x i 1 x i x tmp lt Sys time outi lt arsimR rnorm 10000 phi 0 75 Sys time tmp Time difference of 0 25 secs We cannot avoid explicit looping in this case the R function could be slow for large vectors We implement the function in C and link it to R In C we can program the arsim function and compile it to a dll as follows First create a text file arsim c and insert
28. an implementation of a specific method methods generic function extractAIC 1 extractAIC aov extractAIC coxph extractAIC coxph penal 4 extractAIC glm extractAIC 1m extractAIC negbin 7 extractAIC survreg Non visible functions are asterisked Creating new classes R allows the user to define new classes and new specific and generic methods in addition to the existing ones The function class can be used to assign a certain class to an object For example mymatrix lt matrix rnorm 50 2 ncol 50 class mymatrix lt bigMatrix The object mymatrix is now a matrix of class bigMatrix whatever that may mean The class bigMatrix does not have a lot of meaning yet since it does not have any specific methods We will write a number of specific methods for objects of class bigMatrix in the following section Using the function class directly is not recommended One could for instance run the following statements without any complaints or warnings m2 lt matrix rnorm 16 ncol 4 class m2 lt lm 182 CHAPTER 9 MISCELLANEOUS 9 1 OBJECT ORIENTED However m2 is not a real Im object If it is printed R will give something strange When an lm object is printed the specific function print 1m is called This function expects a proper lm object with certain components Our object m2 does not have these components m2 Call NULL Warning messages 1 operator is deprecated for atomic vect
29. as input To calculate for example the median value of a column in a data frame Either access the column directly or use the function with 137 CHAPTER 8 STATISTICS 8 1 BASIC STATISTICAL FUNCTIONS median cars Price 4 12975 5 with cars mean Price Some functions accept a matrix as input For example the mean of a matrix x mean x will calculate the mean of all elements in the matrix x The function var applied on a matrix x will calculate the covariances between the columns of the matrix x x lt matrix rnorm 99 ncol 3 var x 1 s2 31 1 1 4029791 0 1047594 0 1188696 2 0 1047594 1 0752726 0 0587097 3 0 1188696 0 0587097 0 8468122 The function summary is convenient for calculating basic statistics of columns of a data frame summary cars Price Country Reliability Mileage Min 5866 USA 26 Min 1 000 Min 18 00 1st Qu 9932 Japan 19 ist Qu 2 000 1st Qu 21 00 Median 12216 Japan USA 7 Median 3 000 Median 23 00 Mean 12616 Korea 3 Mean 3 388 Mean 24 58 3rd Qu 14933 Germany 2 3rd Qu 5 000 3rd Qu 27 00 Max 24760 France 1 Max 5 000 Max 737 00 Other 2 NA s 11 000 Type Weight Disp HP Compact 15 Min 1845 Min 73 0 Min 63 0 Large 3 ist Qu 2571 tet Qu 2113 8 Tst Qu T015 Medium 13 Median 2885 Median 144 5 Median 111 5 Small 13 Mean 72901 Mean 152 1 Mean 1122 3 Sporty 9 3rd Qu 3231 3rd Qu 180 0 3rd Qu 142 8 Van 7 Max 73
30. average W W a TiWi 200 in R of the numbers in a vector x with corresponding weights in the vector w use ave w lt sum x w sum w The multiplication and divide operator act on the corresponding vector elements Replacing numbers Suppose we want to replace all elements of a vector which are larger than one by the the value 1 You could use the following construction as in C or Fortran timing the calculation using Sys time tmp lt Sys time x lt rnorm 15000 for i in 1 length x weus 174 84 CHAPTER 6 EFFICIENT 6 1 VECTORIZED COMPUTATIONS x i lt 1 F Sys time tmp Time difference of 0 2110000 secs However the following construction is much more efficient tmp lt Sys time x lt rnorm 15000 bet lt 1 Sys time tmp Time difference of 0 0400002 secs The second construction works on the complete vector x at once instead of going through each separate element Note that it is more reliable to time an R expression using the function system time or proc time See their help files The ifelse function Suppose we want to replace the positive elements in a vector by 1 and the negative elements by 1 When a normal if else construction is used then each element must be used individually tmp lt Sys time x lt rnorm 15000 for i in 1 length x im gt 104 x i lt 1 F else x i lt 1 F Sys time tmp Time difference of 0 3009999 secs In thi
31. coordinates The function mtext is used to place text in one of the four margins of the plot mtext Text in the margin side 4 In R you can place ordinary text on plots but also special symbols Greek characters and mathematical formulas on the graph You must use an R expression inside the title legend mtext or text function This expression is interpreted as a mathematical expression similar to the rules in La Tex 120 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH text 1 5 1 5 expression pastel frac 1 sigma sqrt 2 pi plain e frac x mu 2 2 sigma 2 Jy cex 1 2 See for more information the help of the plotmath function My title c 2 2 text in the margin c 2 2 My subtitle Figure 7 10 The graph that results from the previous low level plot functions 7 3 3 Controlling the axes When you create a graph the axes and the labels of the axes are drawn automatically with default settings To change those settings you specify the graphical parameters 121 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH that control the axis or use the axis function One approach would be to first create the plot without the axis with the axes F argument and then draw the axis using the low level axis function x lt rnorm 100 y lt rnorm 100 do not draw the axes automatically plot x y axes F draw them manually axis side 1 axis side 2 The side argument repr
32. create a multi panel layout use the conditioning operator in the formula To see the relationship between Price and Weight for each Type of car use the following construction xyplot Price Weight Type data cars Use the operator to specify more than one conditioning variable The following example demonstrates two conditioning variables First create some example data one numeric variable and two grouping variables with three levels x lt rnorm 1000 y lt sample letters 1 3 size 1000 rep T z lt sample letters 11 13 size 1000 rep T exdata lt data frame x y z Next a histogram plot is created for the variable x conditioned on the variables y and Z histogram x y z data exdata For each combination of the levels of y and z a histogram plot is created The order can be changed histogram x z y data exdata 125 CHAPTER 7 GRAPHICS 7 4 TRELLIS GRAPHICS 2000 2500 3000 3500 Small Sport Van 4 25000 4 20000 o 8 15000 Q o o O A 8 k i 10000 4 a E 000 9 go Fo g 4 5000 Compact Large Medium O 25000 o E o e 20000 L o o O o o o o E o 15000 L o b og 05 0 g 10000 x o L 5000 E DTM EA TT rr EE E 2000 2500 3000 3500 2000 2500 3000 3500 Weight Figure 7 12 Trellis plot Price versus Weight for different types The above examples were based on conditioning variables of type factor In this case R will c
33. function will not overwrite objects outside the function An object created within a function will be lost when the function has finished Only if the last line of the function definition is an assignment then the result of that assignment will be returned by the function In the next example an object x will be defined with value zero Inside the function functionx xis defined with value 3 Executing the function functionx will not affect the value of the global variable x g lt 0 functionx lt function x lt 3 functionx 1 3 x 1 0 73 CHAPTER 5 WRITING FUNCTIONS 5 2 ARGUMENTS AND VARIABLES If you want to change the global variable x with the return value of the function functionx you can assign the function result to x overwriting the object x with the result of functionx x lt functionx The arguments of a function can be objects of any type even functions Consider the next example test lt function n fun 1 u lt runif n fun u test 10 sin 1 0 28078332 0 30438298 0 55219120 0 37357375 The second argument of the function test needs to be a function which will be called inside the function 5 2 4 Returning an object Often the purpose of a function is to do some calculations on input arguments and return the result By default the last expression of the function will be returned myf lt function x y 1 zi lt sin x z2 lt cos y z1 z2
34. ifelse function x lt rnorm 1000 cont lt rnorm 1000 0 10 p lt runif 1000 z lt ifelse p lt 0 95 x cont The function sample randomly samples from a given vector By default it samples without replacement and by default the sample size is equal to the length of the input vector Consequently the following statement will produce a random permutation of the elements 1 to 50 139 CHAPTER 8 STATISTICS 8 1 BASIC STATISTICAL FUNCTIONS x lt y y 1 21 41 Code Distribution Parameters Defaults beta beta shapel shape2 binom binomial size prob cauchy Cauchy location scale 0 1 chisq chi squared df ncp 1 exp exponential rate f F df1 df2 gamma gamma shape rate scale 1 1 rate geom geometric prob hyper hyper geometric m n k E lnorm lognormal meanlog sdlog 0 1 logis logistic location scale 0 1 nbinom negative binomial size prob mu norm normal Gaussian mean sd 0 1 pois Poisson Lambda 1 t Student s t df ncp 0 unif uniform min max 0 1 weibull Weibull shape scale 1 wilcoxon Wilcoxon m n Table 8 2 Probability distributions in R 1 50 sample x 1 45 42 27 49 14 38 18 5 44 41 2 22 4 35 36 26 19 32 47 48 20 10 37 39 16 11 50 33 25 12 24 34 30 6 43 13 15 40 31 17 21 728 3232946 8 9 To randomly sample three elements from x use sample x 3 1 13 4 1 To sample three elements from x with
35. in 10 Sept Use the substring function to extract the integers Note that the result of the substring function is of type character To convert that to numeric use the as numeric function as numeric substring x w wtattr w match length 1 1 10 9 2 4 4 5 3 Replacing characters The functions sub and gsub are used to replace a certain pattern in a character object with another pattern mychar lt c My_test My_Test_3 _qwerty_pop_ sub pattern _ replacement x mychar 1 My test My Test_3 qwerty_pop_ gsub pattern _ replacement x mychar 1 My test My Test 3 qwerty pop Note that by default the pattern argument is a regular expression When you want to replace a certain string it may be handy to use the fixed argument as well mychar lt c mytest abctestabc test po test gsub pattern test replacement x mychar fixed TRUE 1 my abcabc po 67 CHAPTER 4 DATA MANIPULATION 4 6 CREATING FACTORS FROM 4 5 4 Splitting characters A character string can be split using the function strsplit The two main arguments are x and split The function returns the split results in a list each list componenent is the split result of an element of x strsplit x c Some text another string split NULL 111 11 ES o m te LL dla Ag Myt sida y 121 EL R n Mat ee sl a Uan did a te uE wm wM Aide a h g The argum
36. in the environment in which the function was created The following calls to the function myf show this rule 75 CHAPTER 5 WRITING FUNCTIONS 5 2 ARGUMENTS AND VARIABLES R tries to find al in the environment where myf was created but there is no object al myf 8 Error in myf 8 object ai not found define the objects al and a2 but what value did a2 in the function insidef get al lt 10 a2 lt 1000 myf 8 1 1 392117 It took a2 in myf so a2 has the value 9 5 2 6 Lazy evaluation When writing functions in R a function argument can be defined as an expression like myf lt function x nc length x rest of the function When arguments are defined in such a way you must be aware of the lazy evaluation mechanism in R This means that arguments of a function are not evaluated until needed Consider the following examples myf lt function x nc length x A xx print nc xin lt o myf xin 1 20 The argument nc is evaluated after x has doubled in length it is not ten the initial length of x when it entered the function logplot lt function y ylab deparse substitute y y lt log y plot y ylab ylab 76 CHAPTER 5 WRITING FUNCTIONS 5 3 CONTROL FLOW The plot will create a nasty label on the y axis This is the result of lazy evaluation ylab is evaluated after y has changed One solution is to force an evaluation of ylab first logp
37. is regarded as a vector of length one y lt c x 0 55 x x y 1 10 0 5 0 3 0 6 0 0 55 10 0 5 0 3 0 6 0 10 10 0 5 0 3 0 6 0 28 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES Typing the name of an object in the commands window results in printing the object The numbers between square brackets indicate the position of the following element in the vector Use the function round to round the numbers in a vector round y 3 round to 3 decimals Mathematical calculations on vectors Calculations on numerical vectors are usually performed on each element For example x x results in a vector which contains the squared elements of x x 1 10536 Z lt X X Z 1 100 25 9 36 The symbols for elementary arithmetic operations are Use the symbol to raise power Most of the standard mathematical functions are available in R These functions also work on each element of a vector For example the logarithm of x log x 1 2 302585 1 609438 1 098612 1 791759 Function name Operation abs absolute value asin acos atan inverse geometric functions asinh acosh atanh inverse hyperbolic functions exp log exponent and natural logarithm floor ceiling trunc creates integers from floating point numbers gamma lgamma gamma and log gamma function log10 logarithm with basis 10 round rounding sin cos tan geometric functions sinh cosh tanh hyperbolic functions sqrt square root Table 2
38. it to a factor sex lt factor sex sex 0111213 Levels 1 2 23 CHAPTER 2 DATA OBJECTS 2 1 DATA TYPES The object sex looks like but is not an integer variable The 1 represents level 1 here So arithmetic operations on the sex variable are not possible sex 7 1 NA NA NA NA NA Warning message not meaningful for factors in Ops factor sex 7 It is better to rename the levels so level 1 becomes male and level 2 becomes female levels sex lt c male female sex 1 male male female male female You can transform factor variables to double or integer variables using the as double or as integer function sex numeric lt as double sex sex numeric faq 2 toa The 1 is assigned to the female level only because alphabetically female comes first If the order of the levels is of importance you will need to use ordered factors Use the function ordered and specify the order with the levels argument For example Income lt c High Low Average Low Average High Low Income lt ordered Income levels c Low Average High Income 1 High Low Average Low Average High Low Levels Low lt Average lt High The last line indicates the ordering of the levels within the factor variable When you transform an ordered factor variable the order is used to assign numbers to the levels Income numeric lt as double Income Income numeric 1413120231 The
39. library Packages in library C PROGRA 1 R R 25 1 0 library base The R Base Package boot Bootstrap R S Plus Functions Canty class Functions for Classification cluster Cluster Analysis Extended Rousseeuw et al codetools Code Analysis Tools for R datasets The R Datasets Package DBI R Database Interface foreign Read Data Stored by Minitab 5 SAS SPSS Stata Systat dBase graphics The R Graphics Package If you have a connection to the internet then a package on CRAN can be installed very easily To install a new package go to the Packages menu and select Install package s Then select a CRAN mirror near you a long list with all the packages will appear where you can select one or more packages Click OK to install the selected packages Note that the packages are only installed on your machine and not loaded attached to your current R session As an alternative to the function search use sessionInfo to see system packages and user attached packages 14 CHAPTER 1 INTRODUCTION 1 8 CONFLICTING OBJECTS gt sessionInfo R version 2 5 0 2007 04 23 1386 pc mingw32 locale LC_COLLATE English_United States 1252 LC_CTYPE English_United States 1252 LC_MONETARY English_United States 1252 LC_NUMERIC C LC_TIME English_United States 1252 attached base packages 1 stats graphics grDevices datasets tcltk 7 utils methods base other attached packages MASS svSocket sv
40. list The names of the list com ponents and the contents of list components can be specified as arguments of the list function by using the character xi lt 1 5 12 5 eCT T EPT y lt list numbers x1 wrong x2 y numbers 1112345 wrong 1 TRUE TRUE FALSE FALSE TRUE So the left hand side of the operator in the list function is the name of the component and the right hand side is an R object The order of the arguments in the list function determines the order in the list that is created In the above example the logical object wrong is the second component of y 39 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES y LEI 1 TRUE TRUE FALSE FALSE TRUE The function names can be used to extract the names of the list components It is also used to change the names of list components names y 1 numbers wrong names y lt c lots valid names y 1 lots valid To add extra components to a list proceed as follows yL 8 lt 1 50 y test lt hello y lots 1112345 valid 1 TRUE TRUE FALSE FALSE TRUE 31 1 12345678 91011 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 test 1 hello Note the difference in single square brackets and double square brackets y 1 numbers 1 12345 yl 1 1 12345 When single square brackets are used the component is returned as list
41. mtcars EL mpg cyl disp hp drat W Wea qsec ug tam gear t1 carb Creating data frames You can create data frames in several ways by importing a data file as in Chapter 3 for example or by using the function data frame This function can be used to create new data frames or convert other objects into data frames A few examples of the data frame function my logical lt sample c T F size 20 replace T my numeric lt rnorm 20 my df lt data frame my logical my numeric my df my logical my numeric 1 TRUE 0 63892503 2 TRUE 1 14575124 3 TRUE 1 27484164 19 TRUE 0 01115154 20 TRUE 1 07818944 36 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES test lt matrix rnorm 21 7 3 create a matrix with random elements test lt data frame test X1 X2 X3 0 36428978 0 63182432 0 6977597 0 24943864 1 05139082 0 9063837 0 95472560 0 46806163 1 0057703 48152529 2 03857066 0 7163017 0 71593428 2 18493234 2 7043682 1 20729385 0 50772018 1 1240321 0 07551876 0 06711515 0 1897599 NO oP WN EF o names test 1 xq xo x3 R automatically creates column names X1 X2 and X3 You can use the names function to change these column names names test lt c Price Length Income row names test lt c Paul Ian Richard David Rob Andrea John test Price Length Income Paul 0 36428978 0 63182432 0 6977597 Tan 0 24943864 1 05139082 0
42. o E 15000 o LOO Se OP T b 2 Alig ee o 257 10000 e 9 0 9 o Reo g 5000 Compact Large Medium 25000 4 0 5 g o 20000 4 5 oan L ae El be o o ar 15000 de E o oo 8 10000 o o E 5000 T TT TT T a a E 20 25 30 35 20 25 30 35 Mileage Figure 7 16 Trellis plot adding a least squares line in each panel data cars col dark green pch 2 7 5 The ggplot2 package The ggplot2 package see 7 is a collection of plotting routines based on the grammer of graphics see 8 The functions in ggplot2 can take away some of the anoying extra code that makes plotting a hassle like drawing legends At the same time ggplot2 provides a powerful model of graphics that makes it easy to produce complex multi layered graphics This section only gives a brief introduction for a thorough description of the possibilities see http had co nz ggplot2 7 5 1 The gplot function The function qplot quick plot in ggplot2 can be used to create complex plots with little coding effort The following code displays some examples 131 CHAPTER 7 GRAPHICS 7 5 THE GGPLOT2 PACKAGE Given depth 100 200 300 400 500 600 fi fi fi L L L T T T T T T 165 175 185 165 175 185 RS L 4 oe oe L a Za Ei RE ER if oly 35 20 Ag Ag g ig g Ha Given mag hy Diak CEFA 35 20 1 T 5 0 TT T T TT TT 165 175 1
43. of a regression variable or to identify 167 CHAPTER 8 STATISTICS 8 6 SURVIVAL ANALYSIS influential points See 9 and 13 for a detailed discussion on how to use the residuals from a Cox model As an example we use the martingale residuals to look at the functional form of the Age regression variable Do this by e Fitting a model without the Age variable in our case the model reduces to a model with only the intercept e Extract the martingale residual from that model e Plot the martingale residual against the Age variable see Figure 8 8 IDU analysisO lt coxph Surv IncubationTime AidsStatus 1 data IDUdata mgaleres lt resid IDU analysis0 type martingale plot IDUdata Age mgaleres xlab Age ylab Residuals O pa 2800 8 a og3888o8 Te o ogo 800 E o ofo o o go o oo a 0 o s o N w o 3 g 3 BB898SeScesgeeg co E 8 oe a goccge 828 o o ae 008000 k 90 E N 08 5 o Bog O Sogo a 3 0 08980008 800 oo gt oo 00 o T oo oo o o o o o T T T T T 20 30 40 50 60 Age Figure 8 8 Scatter plot of the martingale residuals An estimation of the survival time can be made for subjects of certain ages use the function survfit as in the following code The output shows that the median predicted survival time for a subject of age ten is infinite As Figure 8 9 shows the solid line corresponding to a subject of age ten never reaches 0 5 168 CHAPTER 8 STATISTICS 8 6 S
44. of the function 1m is stored in the object cars lm which is an object of class Im To print the object simply enter its name in the R console window cars lm Gall lm formula Weight Mileage data fuel frame Coefficients Intercept Mileage 5057 83 87 74223 Degrees of freedom 60 total 58 residual Residual standard error 265 1798 Objects of class Im and almost every other object resulting from statistical model ing functions have their own printing method in S PLUS What you see when you type in cars 1m is not the complete content of the cars 1m object Use the function print default to see the complete object print default cars 1m coefficients Intercept Mileage 5057 82990 87 74223 residuals Eagle Summit 4 Ford Escort 4 397 663796 182 663796 Ford Festiva 4 Honda Civic 4 33 632728 9 921563 Mazda Protege 4 Mercury Tracer 4 189 921563 491 531836 As you can see the object cars mis in fact a list with named components coefficients residuals etc Use the function names to retrieve all the component names of the cars lm objects names cars 1m 1 coefficients residuals effects 4 rank fitted values assign 7 ar df residual xlevels 10 call terms model 145 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS So cars 1m contains much more information than you would see by just printing it The next table gives an overview of some generic functions wh
45. replacement use sample x 3 rep T 1 136 9 To randomly select five cars from the data frame cars proceed as follows 140 CHAPTER 8 STATISTICS 8 2 REGRESSION MODELS x lt sample 1 dim cars 1 5 cars x Weight Disp Mileage Fuel Type Toyota Camry 4 2920 122 27 3 703704 Compact Acura Legend V6 3265 163 20 5 000000 Medium Ford Festiva 4 1845 81 37 2 702703 Small Honda Civic 4 2260 91 32 3 125000 Small Dodge Grand Caravan V6 3735 202 18 5 555556 Van There are a couple of algorithms implemented in R to generate random numbers look at the help of the function set seed set seed to see an overview The algorithms need initial values to generate random numbers the so called seed of a random number generator These initial numbers are stored in the S vector Random seed Every time random numbers are generated the vector Random seed is modified which means that the next random numbers differ from the previous ones If you need to re produce your numbers you need to manually set the seed with the set seed function set seed 12 rnorm 5 1 1 258 0 710 1 807 2 229 1 429 rnorm 5 different random numbers set seed 12 rnorm 5 the same numbers as the first call 1 1 258 0 710 1 807 2 229 1 429 8 2 Regression models 8 2 1 Formula objects R has many routines to fit and analyse statistical models In general these models are used by calling a modeling function like 1m tree glm nls or coxp
46. scatter plot with a loess smoothing line and then a scatter plot with a regression line using 1m aplot x y data testdata geom c point smooth span 0 2 aplot x y data testdata geom c point smooth method 1m 7 5 2 Facetting Facetting in ggplot2 is the equivalent of trellis plots and allows you to display certain sub sets of your data in different facets x lt rnorm 1000 Gi lt sample c A B C size 1000 rep T G2 lt sample c X Y Z size 1000 rep T testdata lt data frame x G1 G2 qplot x data testdata facets G1 G2 geom histogram 133 CHAPTER 7 GRAPHICS 7 5 THE GGPLOT2 PACKAGE 7 5 3 Plots with several layers Although the function qplot is enough for creating many plots the use of the ggplot function in combination with geometric object functions is a much more powerful way to create plots that consists of several layers the function ggplot c lt ggplot data testdata aes y y x x c stat_smooth colour 3 size 6 geom_point 134 8 Statistics The base installation of R contains many functions for calculating statistical summaries data analysis and statistical modeling Even more functions are available in all the R packages on CRAN In this section we will discuss only some of these functions For a more comprehensive overview of the statistical possibilities see for example 9 and 10 8 1 Basic statistical functions 8 1 1 Statisti
47. shows some more parameters these are usually set as an argument of the plotting routine For example plot x y col 2 e lwd the line width of lines in a plot a positive number default 1wd 1 e lty the line type of lines in a plot this can be a number or a character For example 1ty dashed 116 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH e col the color of the plot this can be a number or a character For example col red e font an integer specifying which font to use for text on plots e pch an integer or character that specifies the plotting symbols in scatterplots for example e xlab ylab character strings that specify the labels of the x and y axis Usually given direct with the high level plotting functions like plot or hist e cex character expansion A numerical value that gives the amount to scale the plotting symbols and texts The default is 1 Some of the graphical parameters may be set as vectors so that each point text or symbol could have its own graphical parameter This is another way to display an additional dimension Lets look at a plot with different symbols for the cars data set we can plot the Price and Mileage variables in a scatterplot and have different symbols for the different Types of cars Nears dim cars 1 plot cars Price cars Mileage pch as integer cars Type legend 20000 37 legend levels cars Type cex 1 25 pch 1 6 The col
48. symbol NA There is also the symbol NaN Not a Number which can be detected with the function is nan x lt as double c 1 2 qaz is na x 1 FALSE FALSE TRUE z lt sqrt c 1 1 Warning message NaNs produced in sqrt c 1 1 is nan z 1 FALSE TRUE Infinite values are represented by Inf or Inf You can check if a value is infinite with the function is infinite Use is finite to check if a value is finite x lt c 1 3 4 y lt c 1 0 4 x y 1 1 Inf 1 27 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES z lt log c 4 0 8 is infinite z 1 FALSE TRUE FALSE In R NULL represents the null object NULL is used mainly to represent the lists with zero length and is often returned by expressions and functions whose value is undefined 2 2 Data structures Before you can perform statistical analysis in R your data has to be structured in some coherent way To store your data R has the following structures e vector e matrix e array data frame e time series e list 2 2 1 Vectors The simplest structure in R is the vector A vector is an object that consists of a number of elements of the same type all doubles or all logical A vector with the name x consisting of four elements of type double 10 5 3 6 can be constructed using the function c x lt c 10 5 3 6 x 1 10 5 3 6 The function c merges an arbitrary number of vectors to one vector A single number
49. the existing R functions 92 CHAPTER 6 EFFICIENT 6 3 USING COMPILED CODE already In fact the source code of R is available so you can see many examples There are a couple of ways to link C or Fortran code to R On Windows platforms the use of dynamic link libraries dll s is probably the easiest solution For a detailed description see for example the R manual Writing R Extensions or Chapter 6 and Appendix A of 4 Reasons to use compiled code Compiled C or Fortran code is faster than interpreted R code Loops and especially recursive functions run a lot faster and a lot more efficiently when they are programmed in C or Fortran It is also possible that you already have some tested code at hand that performs a certain routine Translating the entire C code to R can be cumbersome so that it may pay off to organize the C code in such a way that it can be used within R 6 3 1 The C and Fortran interfaces The C and Fortran interfaces are basic interfaces to C and Fortran To call a C function that is loaded in R use the function C giving it the name of the function as a character string and one argument for each argument of the C function Note that if you pass a vector x to the C code you also need to explicitly pass its length In C it is not possible like length x in R to find out the length from only the vector x C arsim x as double x n as integer length x We ll define the C routine ar
50. to an object the debugging process will continue with this new value If the debug process is finished remove the debug flag undebug myf 5 4 4 The browser function It may happen that an error occurs at the end of a lengthy function To avoid stepping through the function line by line manually the function browser can be used Inside your function insert the browser statement at a location where you want to enter the debugging environment myf lt function x some code browser some code Run the function myf as normally When R reaches the browser statement then the normal execution is halted and the debug environment is started 83 6 Efficient calculations 6 1 Vectorized computations The efficiency of calculations depends on how you perform them Vectorized calculations for example avoid going trough individual vector or matrix elements and avoid for loops Though very efficient vectorized calculations cannot always be used On the other hand users having a Pascal or C programming background often forget to apply vectorized calculations where they could be used We therefore give a few examples to demonstrate its use A weighted average Take advantage of the fact that most calculations and mathemat ical operations already act on each element of a matrix or vector For example log x sin x calculate the log and sin on all elements of the vector x For example to calculate a weighted
51. very infrequently observed Tree based models can used in the regression context to create bins or perform the coarse classification In this context a response variable and a regression variable for which we want to create bins are availabe Suppose we have the following data age lt runif 500 17 75 p exp O 1 age 0 5 r lt runif 500 y lt ifelse p gt r bad good testdata lt data frame age y So the probability of observing good increases with age For the creation of a score card we don t want to use the absolute value of age we want to bin the age variable into bins and use those bins How do we chose these bins e Simple approach just manually split the age variable into intervals For example intervals with the same number of points or intervals with the same length e Use a tree based approach so fit a tree with only the age variable as the regression variable and the good bad variable as the response To make the analysis more robust we want a minimum number of obervations in a bin for example 30 out lt rpart y age data testdata control rpart control minbucket 30 plot out text out The tree in Figure 8 6 shows the result of the binning In this case the tree algorithm identifies four age intervals bins age lt 26 6 26 6 lt age lt 31 51 31 51 lt age lt 41 88 and age gt 41 88 163 CHAPTER 8 STATISTICS 8 6 SURVIVAL ANAL
52. whereas double square brackets return the component itself 40 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES Transforming objects to a list Many objects can be transformed to a list with the function as list For example vectors matrices and data frames as list 1 6 1 il 1 123 i 2 3 11 3 2 2 7 The str function A handy function is the str function it displays the internal structure of an R object The function can be used to see a short summary of an object x1 lt rnorm 1000 x2 lt matrix rnorm 80000 ncol 80 myli lt list x1 x2 my df str zi num 1 1000 2 326 1 889 1 740 1 008 0 916 str x2 num 1 1000 1 80 0 0368 0 2626 0 8323 0 3204 0 2559 str my df data frame 20 obs of 2 variables my logical logi TRUE TRUE TRUE TRUE TRUE FALSE my numeric num 0 0079 0 0480 0 4988 0 2047 0 6340 str myl1 List of 3 num 1 1000 2 326 1 889 1 740 1 008 0 916 num 1 1000 1 80 0 0368 0 2626 0 8323 0 3204 0 2559 data frame 20 obs of 2 variables my logical logi 1 20 TRUE TRUE TRUE TRUE TRUE FALSE my numeric num 1 20 0 0079 0 0480 0 4988 0 2047 0 6340 41 3 Importing data One of the first things you want to do in a statistical data analysis system is to import data R provides a few methods to import data we will discuss them in this chapter 3 1 Text files In R you can import text files with t
53. windows width 9 hist rnorm 100 windows width 10 qqnorm rnorm 100 Now three devices of different size are open A list of all open devices can be obtained by using the function dev list 111 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH dev list windows windows windows 2 3 4 When more than one device is open there is one active device and one or more inactive devices To find out which device is active the function dev cur can be used dev cur windows 4 Low level plot commands are placed on the active device In the above example the command title qqplot will result in a title on the qqnorm graph Another device can be made active by using the function dev set dev set which 2 title Scatterplot A device can be closed using the function dev off The active device is then closed For example to export an R graph to a jpeg file so that it can be used in a website use the jpeg device jpeg C Test jpg plot rnorm 100 dev off 7 3 Modifying a graph 7 3 1 Graphical parameters To change the layout of a plot or to change a certain aspect of a plot such as the line type or symbol type you will need to change certain graphical parameters We have seen some in the previous section The graphical functions in R accept graphical parameters as extra arguments These graphical parameters are usually three or four letter abbreviations like col cex or mai The following use of the plot funct
54. x 1 0 41644379 0 89240433 0 88980142 0 77224325 0 80395122 0 83608564 7 0 04149246 0 24511134 0 74946802 0 26268302 Slot y 1 0 6828478 0 2134961 0 8681543 0 9748187 0 0253564 0 9479711 0 3381227 8 0 3446705 0 4415452 0 0979566 Slot species 1 Nal di a ls ite qu i a HEN Hel wam H When you instantiate new objects from a certain class you can perform a validity check For example our fungi class should have input vectors of the same lengths We can build in a validity check as follows 186 CHAPTER 9 MISCELLANEOUS 9 1 OBJECT ORIENTED function to check validity it should return true or false validFungi lt function object len lt length object x if len length object y length object species len cat Mismatch in length of slots return FALSE else return TRUE setClass fungi representation x numeric y numeric species character validity validFungi setValidity fungi validFungi field2 lt new fungi x runif 110 y runif 10 species sample letters 1 5 rep T 110 Error in validObject Object invalid class fungi object FALSE Mismatch in length of slots The function validFungi as any validity checking function must have exactly one argument called object Creating new generic and specific methods A generic function is show which shows an object If we want to show our objects from class fungi in a diffe
55. 0 265 1 3 22 7 1 202 22 282 0 0 19 8 1 195 29 195 0 0 100 9 1 183 22 166 0 0 123 10 1 223 33 224 1 2 10 An estimation of the survival curve for the incubation time can be calculated with the function survfit kmfit lt survfit Surv IncubationTime AidsStatus 1 data IDUdata kmfit Call survfit formula Surv IncubationTime AidsStatus 1 data IDUdata n events median 0 95LCL 0 95UCL 418 76 135 118 Inf 165 CHAPTER 8 STATISTICS 8 6 SURVIVAL ANALYSIS The median survival time is estimated to be 135 months So if a person is infected with HIV he has a 50 probability that he will not develop AIDS within 135 months A numerical and graphical output the complete survival curve can be created from the kmfit object Use the functions summary and plot summary kmfit Call survfit formula Surv IncubationTime AidsStatus 1 data IDUdata time n risk n event survival std err lower 95 CI upper 95 CI 0 414 3 0 993 0 00417 0 985 1 000 1 405 1 0 990 0 00483 0 981 1 000 2 402 al 0 988 0 00541 0 977 0 998 5 388 al 0 985 0 00596 0 974 0 997 T 385 2 0 980 0 00694 0 967 0 994 10 378 3 0 972 0 00821 0 956 0 989 plot kmfit title Survival curve for the AIDS incubation time months abline h c 0 9 0 8 lty 2 abline v c 45 76 lty 2 Survival curve for the AIDS incubation time months Seu 0 4 0 2 0 0
56. 1 Some mathematical functions that can be applied on vectors 29 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES The recycling rule It is not necessary to have vectors of the same length in an expression If two vectors in an expression are not of the same length then the shorter one will be repeated until it has the same length as the longer one A simple example is a vector and a number which is a vector of length one sqrt x 2 1 5 162278 4 236068 3 732051 4 449490 In the above example the 2 is repeated 4 times until it has the same length as x and then the addition of the two vectors is carried out In the next example x has to be repeated 1 5 times in order to have the same length as y This means the first two elements of x are added to x and then x y is calculated x lt c 1 2 3 4 y lt c 1 2 3 4 5 6 2 lt RAY Warning message longer object length is not a multiple of shorter object length in x y gt z 1 1 4 916 512 Generating vectors Regular sequences of numbers can be very handy for all sorts of reasons Such sequences can be generated in different ways The easiest way is to use the column operator index lt 1 20 index 1 12 3 4 5 6 7 8 910 11 12 13 14 15 16 17 18 19 20 A descending sequence is obtained by 20 1 The function seq together with its arguments from to by or length is used to generate more general sequences Specify the beginning and end of the sequence and either specify t
57. 1 4 Logical An object of data type logical can have the value TRUE or FALSE and is used to indicate if a condition is true or false Such objects are usually the result of logical expressions x lt 9 y Xx 10 y 1 FALSE The result of the function is double is an object of type logical TRUE or FALSE 21 CHAPTER 2 DATA OBJECTS 2 1 DATA TYPES is double 9 876 1 TRUE Logical expressions are often built from logical operators lt smaller than lt smaller than or equal to gt larger than gt larger than or equal to is equal to is unequal to The logical operators and or and not are given by and respectively x lt c 9 166 y lt 3 lt x x lt 10 1 TRUE FALSE Calculations can also be carried out on logical objects in which case the FALSE is replaced by a zero and a one replaces the TRUE For example the sum function can be used to count the number of TRUE s in a vector or array lt 215 number of elements in x larger than 9 sum x gt 9 1 6 2 1 5 Character A character object is represented by a collection of characters between double quotes For example x test character and iuiu8ygy iuhu One way to create character objects is as follows 5 COA bY Ue x 1 tat pi igi mychari lt This is a test mychar2 lt This is another test charvector lt 0 a Up c test The double quotes indicate that we ar
58. 1 b2 x1 x2 b1x x1 b1 x2 y mymodel b1 b2 x1 x2 The nls function tries to estimate parameters for a nonlinear model that minimize the sum of squared residuals So the following statement nls y mymodel b1 b2 x1 x2 minimizes X y i mymodel b1 b2 x1 i x2 4 2 with respect to b1 and b2 In nonlinear models the right hand side of the model formula may be empty in which case R will minimize the sum of the quadratic right hand side terms The above specification for example is equivalent to nls y mymodel b1 b2 x1 x2 171 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION To demonstrate the nls function we first generate some data from a known nonlinear model and add some noise to it x lt runif 100 0 30 y lt 3 x 8 x y lt y rnorm 100 0 0 15 our exp lt data frame x x y y The next plot is a scatterplot of our simulated data a o 80 o ES 08 0299390 5 00 0 0 9 20 0 A o 2 o e o O vo o N o amp S20 oo 9 2 o o 0o00 o0 amp TE 5 a o eo gt 8 fou o oo w o o o T T T T T T 0 5 10 15 20 25 30 our exp x Figure 8 10 Scatter plot of our simulated data for nls The model that we used to simulate our example data is the so called Michaelis Menten model which is given by the following form y Bix Bo tx where e is normally distributed 9 has value 3 and 2 has value 8 To fit the model and display the fit re
59. 12 0 27 2 Gose 1970 0 53 0 74 3 Rolf 1971 0 53 0 28 4 Heleen 1974 0 81 0 29 test2 lt read delim test2 txt sep name year A FA 1 Dick 1963 0 42 0 12 2 Gose 1970 0 26 0 57 3 Rolf 1971 0 87 0 37 4 Heleen 1974 0 86 0 15 test merge lt merge test1 test2 test merge name year BA HR A FA 1 Dick 1963 0 12 0 27 0 42 0 12 2 Gose 1970 0 53 0 74 0 26 0 57 3 Heleen 1974 0 81 0 29 0 86 0 15 4 Rolf 1971 0 53 0 28 0 87 0 37 By default the merge function leaves out rows that where not matched consider the following data sets quotes data frame date 1 100 quote runif 100 testfr data frame date c 5 7 9 110 position c 45 89 14 90 To extend the data frame testfr with the wright quote data from the data frame quotes and to keep the last row of testfr for which there is no quote use the following code testfr merge quotes testfr all y TRUE testfr date quote position 1 5 0 6488612 45 59 CHAPTER 4 DATA MANIPULATION 4 3 MANIPULATING DATA FRAMES 2 7 0 4995684 89 3 9 0 5242953 14 4 110 NA 90 For more complex examples see the Help file of the function merge merge 4 3 5 Aggregating data frames The function aggregate is used to aggregate data frames It splits the data frame into groups and applies a function on each group The first argument is the data frame the second argument is a list of grouping variables the third argument is a function that returns a scalar A small example gr
60. 197 CHAPTER 9 MISCELLANEOUS 9 5 CREATING AN R PACKAGE print extra digits options digits 10 papersize setting options papersize a4 to prefer HTML help options htmlhelp TRUE adding libraries that should be attached library MASS library lattice 9 5 Creating an R package 9 5 1 A private package Once you start working with R you will soon start creating your own functions If these functions are used regularly more than one or two times it may be useful to collect these functions in a private package I e the functions are only used by you and some colleagues and you want to have the functions in a separate library so that they don t pollute your workspace In this case it may be enough to define the functions and save them to an R workspace image a RData file This file is a binary file and can be attached to the R search path myf1 lt function x x72 myf2 lt function x sin x72 x save c myfi myf2 file C MyRstuff AuxFunc RData When a colleague needs these functions give him the binary file and let him attach it to his R session attach C MyRstuff AuxFunc RData 198 CHAPTER 9 MISCELLANEOUS 9 5 CREATING AN R PACKAGE 9 5 2 A real R package When you intend to make your package available for more people or even to the R community by putting it on CRAN you may want to create a real R package Le add documentat
61. 30 1 21 22 23 24 25 26 27 28 29 30 4 3 Manipulating Data frames 4 3 1 Extracting data from data frames A data frame can be considered as a generalized matrix consequently all subscripting methods that work on matrices also work on data frames However data frames offer a few extra possibilities Lets import the data in the file cars csv a comma separated text file so that different aspects of data frame manipulation can be demonstrated cars lt read csv cars csv row names 1 The argument row names is specified in the read csv function because the first column of the data file contains row names that we will use in our data fame To see the column names of the cars data frame use the function names names cars 1 Price Country Reliability Mileage 5 Type Weight Disp gp To select a specific column from a data frame use the symbol or double square brackets and quotes prices lt cars Price prices lt cars Price The object prices is a vector If you want the result to be a data frame use single square brackets 54 CHAPTER 4 DATA MANIPULATION 4 3 MANIPULATING DATA FRAMES prices2 lt cars Price When using single brackets it is possible to select more than one column from a data frame The result is again a data frame test lt cars c Price Type To select a specific row by name of the data frame cars use the following R code cars Nissan Van 4 Pric
62. 35 8 1 1 Statistical summaries and tests o oo 135 8 1 2 Probability distributions and random numbers 139 8 2 Regression models aa een Mee a 141 8 2 1 Formula objects IAE AE Pid GON See Ew Bees 141 8 3 Linear regression models Rep ow aide ols ee ee WS ace Sed 142 Saal Formula objects dfs ie al TE ja GOS eee A dd Oe ae 142 8 3 2 Modeling functions i de NT O AAA 144 Sisto Multicolinearity not ge Gere atte Ir Be Te 8 3 4 Factor categorical variables as regression variables 8 4 HoBistio te stessiDl ue ele HRA Ge le ole e dise el sr 8 4 1 The modeling function glm 8 4 2 Performance measures ita E Sey Sees 8 4 3 Predictive ability of a logistic regression Sy lresimodeles is rl e das e Selon E a dd 8 5 1 An example of a tree model 2 2 4 6 425d ea oda ec es 8 5 2 Coarse classification and binning 8 07 DUE val analysis ict ay gis rk at a ee ete O eee Re SS td 8 6 1 The Cox proportional hazards model 8 6 2 Parametric models for survival analysis 8 7 Non linear regression 04 ati gh o e a aS 8 7 1 Ill conditioned models too a eee ee a OR 8 7 2 Singular value decomposition 9 Miscellaneous Stuff 9 1 Object Oriented Programming aj a e RA OL IO dc E A tate Gene E Gnd Ed BS Ser 9 1 2 Old style classes Sas rk E A A rd id 9 1 3 New Style classes AA Ak amit Ge Se ike
63. 56 CHAPTER 8 STATISTICS 8 4 LOGISTIC REGRESSION Coefficients Estimate Std Error z value Pr gt lzl Intercept 0 9616 0 2617 3 675 0 000238 X1 1 6361 0 3065 5 338 9 4e 08 Xa 3 3955 0 3317 10 236 lt 2e 16 X3 4 0446 0 3513 11 515 lt 2e 16 x Signif codes O 0 001 0 01 0 05 0 1 1 Dispersion parameter for binomial family taken to be 1 Null deviance 1158 48 on 999 degrees of freedom Residual deviance 863 47 on 996 degrees of freedom ATC 871 47 Number of Fisher Scoring iterations 5 The object test glm is a glm object As with lm objects in the previous section the glm object contains more information Enter print default test glm to see the entire object The functions listed in table 8 3 can also be used on glm objects 8 4 2 Performance measures To assess the quality of a logistic regression model several performance measures can be calculated In the ROCR package there are functions to calculate the Receiver Operator curve ROC the Area under the ROC and lift charts from any glm model that is fitted A logistic regression model can be used to predict good or bad Since the model only calculates probabilities of good a threshold t 0 1 is chosen if the probability is above t then good is predicted otherwise bad is predicted Dependent on this threshold t we will have the following numbers TP FP FN and TN as displayed in the follo
64. 6 20 31 28 29 31 23 30 20 34 29 32 34 30 29 26 26 31 26 18 26 34 29 20 27 28 20 35 23 31 25 29 20 6 2 2 the lapply and sapply functions These functions are suitable for performing calculations on the components of a list Specifically calculations on the columns of a data frame If for instance you want to find out which columns of the data frame cars are of type numeric then proceed as follows lapply cars is numeric Price 1 TRUE Country 1 FALSE Reliability 1 FALSE 87 CHAPTER 6 EFFICIENT 6 2 THE APPLY AND OUTER Mileage 1 TRUE The function sapply can be used as well sapply car test frame is numeric Price Country Reliability Mileage Type Weight Disp HP T F F T F T TT The function sapply can be considered as the simplified version of lapply The func tion lapply returns a list and sapply a vector if possible In both cases the first argument is a list or data frame the second argument is the name of a function Extra arguments that normally are passed to the function should be given as arguments of lapply or sapply mysummary lt function x if is numeric x return mean x else return NA sapply car test frame mysummary Price Country Reliability Mileage Type Weight Disp HP 12615 67 NA NA 24 58333 NA 2900 833 152 05 122 35 Some attention should be paid to the situation where the output of the function to be called in sapply is not constant For ins
65. 825 0 2999556 1 131336 0 2536510 0 3878151 0 8964895 0 2022080 1 379076 1 7892237 0 9087716 The plot method displays the matrix as an image plot plot bigMatrix lt function x image 1 ncol x 1 nrow x x title paste plot of matrix deparse substitute x sep plot m2 9 1 3 New Style classes There is no formal description of an object of S3 class It could be a matrix where the user accidently assigned the lm class to it The new style classes in R allow the user to define a new class more formally than old style classes The new style classes differ from the old style classes All objects of a new style class must have the same structure This is not true for many old style classes The new style classes have a more formal and tighter specification The new class mechanism has greater uniformity than the old and the new engine has many more tools to make programming easier Methods can be designed for data types for example vectors that where not classes in the old engine Class inheritance is more rigorous than in the old style classes 184 CHAPTER 9 MISCELLANEOUS 9 1 OBJECT ORIENTED plot of matrix m2 1 nrow x 30 40 50 20 10 10 20 30 40 50 1 ncol x Figure 9 1 Result of the specific plot method for class bigMatrix Creating a new style class definition New style classes are made up of slots These are similar to but distinct from components o
66. 85 165 175 185 long Figure 7 17 A coplot with two conditioning variables library ggplot2 x lt runif 1000 y lt 2 x72 rnorm 1000 0 0 3 z lt bBxlog x rnorm 1000 0 0 3 testdata lt data frame x y z a simple scatter plot qplot x y data testdata transformations qplot x log y data testdata changing the size of the symbols qplot x y data testdata size z The function qplot can also create other types of plots This can be done by using the argument geom which stands for geometric object Such an object not only describes the type of plot but also a corresponding statistical calculation For example a smoothing line calculated according to some smoothing algorthm The default value for geom is point a standard scatter plot To draw a line graph between the points use qplot x y data testdata geom c line 132 CHAPTER 7 GRAPHICS 7 5 THE GGPLOT2 PACKAGE Given Weight 2000 2500 3000 3500 i 1 1 1 T T T T 20 25 30 35 i L i L A A F A A 8 PS 46 nas a E nas ES A A FS Q 2 E 4 o s e 9 A A J A A A A A A A e A aia A A S DRA A amp a A A FS A T T T T 20 25 30 35 Mileage Figure 7 18 A coplot with a smoothing line The geom argument can be a vector of names this will result in one plot with multiple graphs on top of each other The following code first plots a
67. 855 Max 305 0 Max 225 0 138 CHAPTER 8 STATISTICS 8 1 BASIC STATISTICAL FUNCTIONS 8 1 2 Probability distributions and random numbers Most of the probability distributions are implemented in R and each of the distributions has four flavors the cumulative probability distribution function the probability den sity function the quantile function and the random sample generator The names of these functions consist of the code for the distribution preceded by a letter indicating the desired flavor e p cumulative probability distribution function d probability density function q quantile function e r random sample For example the corresponding commands for the normal distribution are pnorm x m s dnorm x m s qnorm p m s rnorm n m s In these expressions m and s are optional arguments representing the mean and standard deviation not the variance p is the probability and n the number of random draws to be generated The next table gives an overview of the available distributions in R with the correspond ing parameters Don t forget to precede the code with p d q or r for example pbeta or qgamma The column Defaults specifies the default values of the parameters If there are no default values you must specify them in the function call For example rnorm 100 will run but rbeta 100 will not The following code generates 1000 random standard normal numbers with 5 contami nation using the
68. 9063837 Richard 0 95472560 0 46806163 1 0057703 David 0 48152529 2 03857066 0 7163017 Rob 0 71593428 2 18493234 2 7043682 Andrea 1 20729385 0 50772018 1 1240321 John 0 07523445 0 32454334 1 3432442 2 2 5 Time series objects In R a time series object an object of class ts is created with the function ts It combines two components e The data a vector or matrix of numeric values In case of a matrix each column is a separate time series e The dates of the data the dates are equispaced points in time starting from jan 87 100 monthly intervals mytsi lt ts data rnorm 100 start c 1987 freq 12 two time series starting from apr 1987 50 monthly intervals 37 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES myts2 lt ts data matrix rnorm 100 ncol 2 start c 1987 4 freq 12 myts2 Series 1 Series 2 Apr 1987 1 66394678 1 3009008 May 1987 0 48923748 0 8199132 Jun 1987 0 21643666 0 1581245 Jul 1987 2 21148119 0 4926389 Aug 1987 0 26117051 1 1255435 The function tsp returns the start and end time and also the frequency without printing the complete data of the time series tsp myts2 1 1987 250 1991 333 12 000 2 2 6 Lists A list is like a vector However an element of a list can be an object of any type and structure Consequently a list can contain another list and therefore it can be used to construct arbitrary data structures Lists are often used for output of statistical routi
69. APTER 9 MISCELLANEOUS 9 2 R LANGUAGE OBJECTS 9 2 R Language objects 9 2 1 Calls and Expressions In R you can use the language to create new language constructions or adjust existing ones For example performing symbolic manipulation in R such as calculating deriva tives symbolically or writing functions that accept expressions as input In R language objects are either objects of type name expression or call To deal with R language objects you should prevent the system from beginning the usual evaluation of expressions and calls Suppose we have the following statements x lt seq 3 3 1 100 y lt sin x The evaluated version of sin x is stored in the object y so y contains 100 numbers That this result originated from sin x is not visible from y anymore To keep an unevaluated version the normal evaluation needs to be interrupted This can be done by the function substitute y lt substitute sin x y sin x The object y is a so called call object The object y can still be evaluated using the function eval eval y 1 0 14112001 0 20082374 0 25979004 0 31780241 0 37464782 6 0 48400786 0 53612093 0 58626538 0 63425707 0 67991980 11 0 76359681 0 80130384 0 83606850 0 86776314 0 89627139 In order to print the object y for instance in a graph y must be deparsed first This can be done using the function deparse which converts y to a character object x
70. An introduction to R Longhow Lam Under Construction Jan 2010 some sections are unfinished longhowlam at gmail dot com Contents 1 3 Introduction A re oe A Sere ae ee eee eee 1 2 The R environment A A A he eee es gts 1 3 Obtaining and installing TR ter ee de eae a LA Your irst R sesSion Y con Se ler tie fa SR eee A ee CO A a OG Be ae 15 gt Theavailable eles ao Laia Aaa Ca ee Oa heer Re t51 Theon line Helps o Lease Pete a A ede a auth dived 1 5 2 The R mailing lists and the R Journal 1 6 The R workspace managing objects o ao a a a a 1 74 R Packages a a i g eae ie as a ahs ch A ea a dy de p A 1 8 Conflicting objects aoaaa he ee e Sa a 1 9 Editors for Seri pes o erara mao ee oo Bee twee pee ad cae es 1 9 1 The editor in RGui 2 4 2 42 20 Wao A Awe we eet ed 19 2 Other editoris na ae es e hy oe hh a ee eh Bk de a Data Objects Das Datatypes ar a t tee a a a E Sk ee ead ews ye gers DAM WGubles 2 8 Teach A Bs ee A et a A E NN Dede ao AA oe Ac Mee A oo Beak Bt DAA AsO OCA a Sack E Got Pun ge E A oS he Gee ad GP Bee ee GEN 2 155 Character Genes Hae es Bee ay ee eee ae ee aY ab 2160 Factor ts ts 7 Lts e ea a OP bo agi ee fe 2 1 7 Dates and Times 0 0 0 0 0 000000 00008 8G 2 1 8 Missing data and Infinite values 2 2 Data SErUCTUTES LL A ee ee a E Bek Qed NECtOPSihs 2 co et a Aha ae Ana ew bak en Be 22 2 Matrics bio ca os He a Gy A Dios ITA S P45 Atle otk e
71. BSCRIPTS x lt LS y Ej 2 3 4 5 6 1 x y matrix 1 36 ncol 6 x gt 19 L4 FALSE FALSE FALSE FALSE FALSE FALSE 2 FALSE FALSE 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 4 FALSE TRUE TRUE TRUE TRUE TRUE 5 TRUE TRUE TRUE TRUE TRUE TRUE 1 20 21 22 23 24 25 26 27 28 29 6 TRUE TRUE TRUE TRUE TRUE TRUE 30 31 32 33 34 35 36 Note that the result of subscripting with a logical matrix is a vector This mechanism can be used to replace elements of a matrix For example x lt matrix rnorm 100 ncol 10 x x gt 0 lt 0 3 A matrix r with two columns A row of r consists of two numbers each row of r selects a matrix element of x The result is a vector with the selected elements from x x lt matrix 1 36 nco1 6 x yde Li 2 3 4 5 6 1 2 L 3 4 1 5 1 O 0M gt WN 7 8 9 10 Ti 12 13 14 15 16 17 18 19 20 21 22 23 24 6 25 26 27 28 29 30 31 32 33 34 35 36 lt cbind c 1 2 5 c 3 4 4 r ta 25 3 xir Lal 2 1 2 5 3 4 4 1 13 20 28 93 CHAPTER 4 DATA MANIPULATION 4 3 MANIPULATING DATA FRAMES 4 A single number or one vector of numbers In this case the matrix is treated like a vector where all the columns are stacked x lt matrix 1 36 nco1 6 x 3 x 9 x 36 1 3 1 9 1 36 21
72. CT 99 CHAPTER 6 EFFICIENT 6 4 SOME COMPILED CODE macro Enough memory is allocated with allocVector VECSXP n in our example we will return a vector of the same length as the input vector The R_fcall object is created with the function lang2 which creates an executable pair list and together with a call to SETCADR we can then call the function eval which will evaluate our R function fn A call to PROTECT must always be accompanied with a call to UNPROTECT in this example we had two calls to PROTECT so we call UNPROTECT 2 before we exit the C code A numerical integration example In R the function integrate can calculate the integral foe for a one dimensional function f using a numerical integration algorithm integrate sin 0 pi 2 with absolute error lt 2 2e 14 As an illustration we create our own version using existing C code Our version will also take a function name and the values a and b as input arguments The following steps are done Adding the interface to R function The C code from numerical recipes implements the Romberg adaptive method it consists of four functions e The function qromb implements the Romberg method e The functions polint and trapzd these are auxiliary functions used by qromb e The function func this is the function that is going to be integrated In addition to these four C functions we add a C function Integrate that will act as the interface to R The dll that w
73. ELECT from Tablel where Col4 A sqlQuery conn myq ID Coll Col2 Col3 Col4 1 1 John 123 1973 09 12 A You can have multiple connections to multiple databases that can be useful if you need to collect and merge data from several sources The function odbcDataSources lists all the available data sources Don t forget to close a connection with odbcClose conn 3 4 The Foreign package 46 4 Data Manipulation The programming language in R provides many different functions and mechanisms to manipulate and extract data Let s look at some of those for the different data structures 4 1 Vector subscripts A part of a vector x can be selected by a general subscripting mechanism x subscript The simplest example is to select one particular element of a vector for example the first one or the last one x lt c 6 7 2 4 x 1 1 6 x length x 1 4 Moreover the subscript can have one of the following forms A vector of positive natural numbers The elements of the resulting vector are deter mined by the numbers in the subscript To extract the first three numbers x 1 10536 x i 3 1 10 5 3 To get a vector with the fourth first and again fourth element of x 47 CHAPTER 4 DATA MANIPULATION 4 1 VECTOR SUBSCRIPTS x c 4 1 4 1 6 10 6 One or more elements of a vector can be changed by the subscripting mechanism To change the third element of a vector proceed as follows x 8 lt 4 To c
74. EXP eval SEXP expr SEXP rho which is the equivalent of the interpreted R code eval expr envir rho See sec tion 5 9 of the R manual Writing R Extensions The internal R pointer type SEXP is used to pass functions expressions environments and other language elements from R to C It is defined in the file Rinternals h 98 CHAPTER 6 EFFICIENT 6 4 SOME COMPILED CODE A small example We will give a small example first that does almost nothing but shows some important concepts The example takes an R function and evaluates this R function in C with input argument xinput First the necessary C code include lt R h gt include lt Rinternals h gt SEXP EvalRExpr SEXP fn SEXP xinput SEXP rho SEXP ans R_fcall int n length xinput PROTECT R_fcall lang2 fn R_NilValue PROTECT ans allocVector VECSXP n SETCADR R_fcall xinput ans eval R_fcall rho Rprintf Length of xinput d n n UNPROTECT 2 return ans When this is build into a dll that exports the function EvalRExpr then we can load the dll in R and use Call to run the function z lt c 121 144 225 myf lt function x A 2 sqrt x Call EvalRExpr myf as double z new env Length of xinput 3 1 22 24 30 First in the C code the R objects ans and R_fcall of type SEXP are defined To protect the ans object from the R garbage collection mechanism it is created with the PROTE
75. Example graphs of multi dimensional data sets 110 CHAPTER 7 GRAPHICS 7 2 MORE PLOT FUNCTIONS 7 2 4 Graphical Devices Before a graph can be made a so called graphical device has to be opened In most cases this will be a window on screen but it may also be an eps or pdf file Type Devices for an overview of all available devices The devices in R are e windows The graphics driver for Windows on screen to printer and to Windows metafile e postscript Writes PostScript graphics commands to a file e pdf Write PDF graphics commands to a file e pictex Writes LaTeX PicTeX graphics commands to a file e png PNG bitmap device e jpeg JPEG bitmap device e bmp BMP bitmap device e xfig Device for XFIG graphics file format e bitmap bitmap pseudo device via GhostScript if available When a plot command is given without opening a graphical device first then a default device is opened Use the command options devices to see what the default device is usually it is the windows device We could however also open a device ourselves first The advantages of this are e We can open the device without using the default values e When running several high level plot commands without explicitly opening a device only the last command will result in a visible graph since high level plot commands overwrite existing plots This can be prevented by opening separate devices for separate plots windows width 8 plot rnorm 100
76. I o o 1 1 101 A 9 1 1 1 1 8 7 1 01 1 1 o o l 1 LO N 1 9 1 o mt 1 lerhh day 1 day7 day13 day 19 day1 day4 day7 day10 day13 day16 day19 1e 03 5e 03 5e 04 Figure 7 11 Graphs resulting from previous code examples of customizing axes A call to a trellis display function differs from a call to a normal plot routine It resembles a call to one of the statistical modeling functions such as 1m or glm The call has the following form TrellisFunction formula data data frame other graphical parameters Depending on the specific trellis display function the formula may not have a response variable To create a scatterplot of the Price variable against the Weight variable and a histogram of the Weight variable in the car test frame data frame proceed as follows cars lt read csv cars csv row names 1 library lattice xyplot Price Weight data cars histogram Weight data cars 124 CHAPTER 7 GRAPHICS 7 4 TRELLIS GRAPHICS Trellis function description barchart Bar charts plot bwplot Box and whisker plot densityplot Kernel density plots smoothed density estimate dotplot Plot of labeled data histogram Histogram plot qq Quantile quantile plot xyplot Scatterplot wireframe 3D surface plot levelplot Contour plot stripplot 1 dimensional scatterplot cloud 3D scatterplot splom Scatterplot matrices Table 7 1 Trellis display functions 7 4 2 Multi panel graphs To
77. IO R2HTML svMisc svIDE eliana Uso 10 956 nilo MOVIS Ogg 9 1 8 Conflicting objects It is not recommended to do but R allows the user to give an object a name that already exists If you are not sure if a name already exists just enter the name in the R console and see if R can find it R will look for the object in all the libraries packages that are currently attached to the R system R will not warn you when you use an existing name gt mean 10 gt mean 1 10 The object mean already exists in the base package but is now masked by your object mean To get a list of all masked objects use the function conflicts gt conflicts a body lt mean You can safely remove the object mean with the function rm without risking deletion of the mean function Calling rm removes only objects in your working environment by default 15 CHAPTER 1 INTRODUCTION 1 9 EDITORS FOR R SCRIPTS 1 9 Editors for R scripts 1 9 1 The editor in RGui The console window in R is only useful when you want to enter one or two statements It is not useful when you want to edit or write larger blocks of R code In the RGui window you can open a new script go to the File menu and select New Script An empty R editor will appear where you can enter R code This code can be saved it will be a normal text file normally with a R extension Existing text files with R code can be opened in the RGui window To run code
78. ION 4 3 MANIPULATING DATA FRAMES 4 3 2 Adding columns to a data frame The function cbind can be used to add additional columns to a data frame For example the vector maxvel with the maximum velocities of the cars can be added to the cars data frame as follows new cars lt cbind cars Max Vel maxvel The left hand side of the specifies the column name in the new cars data frame and the right hand side is the vector you want to add Or alternatively use the following syntax cars max vel maxvel The function cbind can also be used on two or more data frames For example cbind dataframel dataframe2 4 3 3 Combining data frames Use the function rbind to combine or stack two or more data frames Consider the following two data frames rand dfl and rand df2 rand df1 lt data frame norm rnorm 5 binom rbinom 5 10 0 1 unif runif 5 rand df1 norm binom unif 1 1 1477095 2 0 6230449 2 0 6689266 O 0 9921276 3 0 3738174 2 0 7115776 4 2 2641381 2 0 9318150 5 1 7682772 O 0 6455379 rand df2 lt data frame chisq rchisq 5 3 binom rbinom 5 10 0 1 unif runif 5 57 CHAPTER 4 DATA MANIPULATION 4 38 MANIPULATING DATA FRAMES rand df2 chisq binom unif 1 1 955729 1 0 4543552 2 12 661964 1 0 8731595 3 7 433911 1 0 9460346 4 3 642188 0 0 6632598 5 6 134571 1 0 7688208 These two data frames have two columns in common binom and unif When w
79. R 8 STATISTICS 8 3 LINEAR REGRESSION MODELS Histogram of cars residuals Frequency o 4 Sample Quantiles 600 400 200 0 Figure 8 1 A histogram and a qq plot of the model residuals to check normality of the residuals T T 200 cars residuals T 400 add1 cars lm Weight Mileage Disp Single term additions Model Weight Mileage Df Sum of Sq RSS lt none gt 4078578 Disp 1 1297541 2781037 dropi This function is used to see what the result is in terms of sums of squares and AIC 672 651 600 Normal Q Q Plot 600 1 400 200 0 400 600 Theoretical Quantiles residual sums of squares of dropping a term variable from the model dropi cars lm Mileage Single term deletions Model Weight Mileage Df Sum of Sq lt none gt Mileage 1 RSS 4078578 10428530 14507108 ATC 612 746 148 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS Residuals vs Fitted o Cook s distance S 600 1 Eagla Summit 4 Ford LTD Crown Veiga VE Ierbird VE 400 0 08 Ford LTD Crown Victoria V o 200 0 06 Chevrolet Qaprice V8 3 o E E 2 BSB o4 5 E Dn E 3J o oo 2 R 4 8 l o o o o s oo SJ Y o 1 o o o 8 OSubaru Loyale 4 i a Jih Lo dll Il o T T T T T T T T T T T 2000 2500 3000 3500 0 0 20 30 40 50 60 Fit
80. S as list my expr El 3 sin rnorm 10 This is a list with one component Let us zoom in on this component and print it as a list as list my expr 1 KERRE I L2 413 C 3 sin rnorm 10 In this list the first element of which is an object of class name Its second element is of class numeric and its third element is of class call If we zoom in on the third element of the above list we get as list my expr 1 1 31 1 1 s n EST rnorm 10 Here the first component is of class name and the second component is an object of class call Working with expressions in this way can be of use in case one has a function testf which calls another function that depends on calculations that occur inside testf like in the following example testf lt function n lt rbinom 1 10 0 5 expr lt expression rnorm 10 expr 111 21 lt n x lt eval expr x F test 191 CHAPTER 9 MISCELLANEOUS 9 2 R LANGUAGE OBJECTS Indeed this could have been achieved in a much simpler manner such as in the code below But it s the idea that counts here testf lt function n lt rbinom 1 10 0 5 x lt rnorm n x 9 2 3 Functions as lists It may be surprising that we can also transform function objects to list objects Lets look at a simple function myf lt function x y temp1 x ty temp2 x y tmp1 temp2 We use
81. S aR SH BIOS BIE hei 221 Data frames te agen Wow Gg Se ele eae wk a prada 2 2 5 Vimesseries objects aci ek RA ee a ee we a DDO ISS weak St oS aren the bore eater ee hath Nee Se ath A op Ph ata hele Fea 224 The Str TUNGHON 0 2 cela ee Brcko O Importing data 3 1 Text files 00 0 NN 11 12 12 13 15 16 16 16 19 19 19 20 21 21 22 23 25 at 28 28 32 34 35 37 38 41 42 3 1 1 The scan function 0 0 00 0000 004 e E hus ee le tas aos en eet tock en NaN Mee A nk ai eae aa ed Sr Base Do otek teh Sty Her Poets Hehe GO site Se A shit a Se a See Sts 3 4 The Foreign package Ena lb A a as Data Manipulation 4l Vector a do A a deg O el one Sea dh eg GG ee S S BS 4 2 Matrix SUbS ripts s dd edt AG bd Rh SoD Se eee A AA 4 3 Manipulating Data frames yo ea a a amp AP eee 4 3 1 Extracting data from data frames 4 3 2 Adding columns to a data drame Vias ad a 4 3 3 Combining data frames 2 0 a BS a ed ae e 4 3 4 Merging data frames ooa aa a 4 3 5 Aggregating data frames o ooo a a at 4 3 6 Stacking columns of data frames lt a 0 205 Avot Reshapme datas 200 a a de ew Sok a Gana Ae ao de AL ABU A a eee n a a E O 4 5 Character manipulation a a Nicks Ral 4 5 1 The functions nchar substring and paste 4 5 2 Finding patterns in character objects 4 5 3 Replacing characters 10d Pao Se tbe shia a ke Ait se at Swe eg 4 5 4 Splittin
82. URVIVAL ANALYSIS newAges data frame Age c 10 30 60 pred lt survfit IDU analysisi newdata newAges se T pred n events median 0 95LCL 0 95UCL 1 418 76 Inf 135 Inf 2 418 76 135 118 Inf 3 418 76 97 60 Inf plot pred 1ty 1 3 Survival curves for three subjects 0 2 0 0 T T T l T T 0 20 40 60 80 100 120 140 months Figure 8 9 Three subjects with age 10 30 and 60 8 6 2 Parametric models for survival analysis In a parametric modeling approach the distribution of the time T to an event is modeled with a parametric model For example a log normal or Weibull distrbution The general form resembles the ordinary linear regression model and is given by f T ao a1 Xi ap Xp ow for some distribution W A possible choice for f is the log function this corresponds to the accelerated failure time model The code below fits an accelerated failure time model for the IDU data with a Weibul distribution 169 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION some times are negative for convenience we set it to 1 IDUdata IncubationTime IDUdata IncubationTime lt 0 1 IDU param lt survreg Surv IncubationTime AidsStatus Age data IDUdata dist weibull summary IDU param Call survreg formula Surv IncubationTime AidsStatus Age data IDUdata dist weibull Value Std Error Z p Intercept 5 7839 0 3934 14 70 6 38e 49 Age 0 0135
83. YSIS age lt 41 88 T age lt 26 6 good age gt 431 51 bad bad good Figure 8 6 Binning the age variable two intervals in this case 8 6 Survival analysis In R survival analysis also called churn analysis in marketing of duration data also called time to event data can be done with either non parametric approaches Kaplan Meier or Cox proportional hazards or with parametric approaches accelerated failure time models Typical for duration data is right censoring In a study the analyst can often not wait until all machines fail At the time of study some machines may still work and the only information that we have is that the machine has survived a certain time span This is called right censoring The R package survival contains both non parametric the coxph function and para metric modeling the function survreg functions Another implementation of the ac celerated failure time approach can be found in the package Design the psm function The modeling functions for survival analysis require the usage of an extra packaging function in the left hand side of a formula object This function is used to indicate if a certain event was censored or not For example suppose we have a data frame with the columns time and status The packaging function Surv connects time and status for right censored data Where status 1 means an event and status 0 for censored Surv time status Age The
84. a x lt rnorm 10 y lt rnorm 10 z lt rnorm 10 set plotting color to red par col red plot x y draw extra blue points points x z col blue draw red points again points y z The Plot and figure regions the margins A graph consists of three regions A plot region surrounded by a figure regions that is in turn surrounded by four outer margins The top left bottom and right margins See figure 7 7 Usually the high level plot functions create points and lines in the plot region Outer margin 3 Figure region Plot region Outer margin 2 Outer margin 1 Outer margin 4 Figure 7 7 The different regions of a plot The outer margins can be set with the oma parameter the four default values are set to zero The margins surrounding the plot region can be set with the mar parameter Experiment with the mar and oma parameters to see the effects 114 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH Default values par c mar oma mar 1 5 1 4 1 4 1 2 1 oma 14 00006 set to different values par oma c 1 1 1 1 par mar c 2 5 2 1 2 1 1 plot rnorm 100 Multiple plots on one page Use the parameter mfrow or mfcol to create multiple graphs on one layout Both parameters are set as follows par mfrow c r k par mfcol c r k where r is the number of rows and k the number of columns The graphical parameter mfrow fills the layout by row
85. a Analysis Springer 2009 Leland Wilkinson The Grammar of Graphics Springer 2005 W N Venables and B D Ripley Modern Applied Statistics with S Springer September 2003 J Maindonald and J Braun Data Analysis and Graphics Using R An Example based Approach Cambridge University Press 2007 T Hastie R Tibshirani J H Friedman The Elements of Statistical Learning Springer 2001 M Prins and P Veugelers The european seroconverter study and the tricontinen tal seroconverter study comparison of progression and non progression in injecting drug users with documented dates of hiv 1 seroconversion AIDS vol 11 p 621 1997 T M Therneau and P M Grambsch Modeling Survival Data Extending the Cox Model Springer 2000 Douglas M Bates Donald G Watts Nonlinear regression analysis and its applica tions Wiley Interscience 2007 Friedrich Leisch Sweave user manual 2006 208 Index Accelerated failure time model 164 aggregate 60 apply 86 area under ROC 157 array 35 arrows 119 as difftime 26 as list 41 attributes 62 axes 121 axis 121 bar plot 107 binning 162 break 80 browser 83 by 90 C 92 e 28 calls 189 cars example data 54 cbind 56 character 22 character manipulation 63 chol 34 Churn analysis 164 coarse classification 162 color palette 117 color symbols 117 compiled code 92 compilers 95 complex 21 concordant 158
86. a parameter for each level of the factor variable as the model could be over parameterized and the X X matrix in formula 8 1 would be singular One can avoid overparameterization by using a so called contrast matrix to impose restrictions on the parameters By default R uses the so called treatment contrast matrix for unordered factor variables and a polynomial contrast matrix is used for ordered factor variables There are other contrasts The estimated parameter values in a treatment contrast are easy to interpret One factor level is left out then the parameter values of the other levels represent the difference between that level and the level that is left out Consider the following example where we create an artificial data frame with one nu meric response column and one factor column with four levels A B C and D as the regression variable The corresponding y values of each factor level have a certain mean as calculated in the code below yi lt rnorm 100 5 y2 lt rnorm 100 10 y3 lt rnorm 100 30 y4 lt rnorm 100 50 y lt c y1l y2 y3 y4 x lt as factor c rep A 100 rep B 100 rep C 100 rep D 100 testdata lt data frame x y Im y x data testdata Gall lm formula y x data testdata Coefficients Intercept xB xC xD 4 930 Sett 25 107 45 087 152 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS Level A has been left out You can see that paramete
87. a web page or document e Sweave is a tool that can process a document with chunks of R code see 15 It parses the document evaluates the chunks of R code and puts the resulting output text and graphs back in the document in such a way that the resulting document is in its native format The formats that are implemented are BTX HTML and ODF Open Document Format 9 7 1 A simple ATEX table In a monthly report that is created in ATRX the output of a linear regression in R is needed load the xtable package library xtable specify the file that will contain the regression output in Latex format mydir C Documents and Settings Longhow Mijn Documenten R RCourse outfile lt paste mydir carslm tex sep H 1m lm EH Fit a linear regression out lt Im Price Mileage Weight HP data cars transform the regression output object to an xtable object add a label so that the table can be referenced in Latex out latex lt xtable 1m out caption Regression output label tab001 type latex sink the xtable object to the latex file sink outfile print 1m out latex H redirect output to normal screen 205 CHAPTER 9 MISCELLANEOUS 9 7 CREATING FANCY OUTPUT sink Once the latex file has been created it can be imported in the the BI Xreport with the input command in latex See Table 9 1 Estimate Std Error t value Pr gt t
88. ackages rpart party and ran domForest The latter two have methods to build multiple trees that improve predictive accuracy compared to a single tree We consider the modeling function rpart in the rpart package that constructs a single tree This package also contains the example data frame car test frame we use it to construct a tree that predicts the type of a car given the mileage and price of that car library rpart fit lt rpart Type Mileage Price data car test frame basic overview of the rules fit n 60 node split n loss yval yprob x denotes terminal node 1 root 60 45 Compact 0 25 0 05 0 22 0 22 0 15 0 12 2 Price gt 9152 5 49 34 Compact 0 31 0 061 0 27 0 041 0 18 0 14 4 Mileage gt 20 5 37 22 Compact 0 41 0 027 0 32 0 054 0 19 0 8 Mileage lt 23 5 19 7 Medium 0 32 0 053 0 63 0 0 0 9 Mileage gt 23 5 18 9 Compact 0 5 0 0 0 11 0 39 0 5 Mileage lt 20 5 12 5 Van 0 0 17 0 083 0 0 17 0 58 3 Price lt 9152 5 11 0 Small 000100 detailed listing of an rpart object displaying only a part summary fit Call rpart formula Type Mileage Price data car test frame n 60 CP nsplit rel error xerror xstd 1 0 2444444 O 1 0000000 1 1555556 0 05851383 2 0 1555556 1 0 7555556 0 9555556 0 07756585 3 0 1333333 2 0 6000000 0 8444444 0 08294973 161 CHAPTER 8 STATISTICS 8 5 TREE MODELS 4 0 0100000 3 0 4666667 0 6000000 0 08563488 Node number 1 60 obse
89. an cause the estimation procedure to fail or estimated model parameters may have very large confidence intervals Consider the following model the so called Hill equation qa del Fx ko qa Given the data points in Figure 8 12 we see that two sets of paramters fit the data equally well The solid and dashed lines corresponds to 0 8 3 1 Vin 1 1 08 1 k 0 3 1 respectively Either more data at lower x values are needed or a different model must be used The following R code simulates some data from the model and fits the model with the simulated data 174 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION our exp y 1 5 2 0 25 1 0 0 5 0 5 10 15 20 25 30 our exp x Figure 8 11 Simulated data and nls predictions Create the model function HillModel lt function x alpha Vm k Vm x alpha k alpha x alpha Simulate data and put it in a data frame ki 0 3 Vmi 1 108 alphai 0 8 x lt runif 45 1 6 5 datap lt HillModel x alpha1 Vm1 k1 rnorm 45 0 0 09 simdata lt data frame x datap HH Fit the model out lt nls datap HillModel x alpha Vm k data simdata start list k 0 3 Vm 1 108 alpha 0 8 Print output summary out vcov out Formula datap HillModel x alpha Vm k Parameters Estimate Std Error t value Pr gt t k 0 1229 1 4140 0 087 0 93116 Vm 1 0108 0 3184 3 174 0 00281 alpha 0 9399 5 5311 0 170 0 86588
90. and a hyphen without a space e The equal character These objects can then be used in other calculations To print the object just enter the name of the object There are some restrictions when giving an object a name Object names cannot contain strange symbols like A dot and an underscore _ are allowed also a name starting with a dot Object names can contain a number but cannot start with a number e R is case sensitive X and x are two different objects as well as temp and temP gt x sin 9 75 gt y log x x72 gt x 1 0 005494913 e y 1 5 203902 gt m lt matrix c 1 2 4 1 ncol 2 gt m gt Ei 2 ii 1 4 25 2 1 gt solve m 4 2 1 0 1428571 0 5714286 2 0 2857143 0 1428571 To list the objects that you have in your current R session use the function 1s or the function objects gt 1lsQ 1 Wy ty 10 CHAPTER 1 INTRODUCTION 1 5 THE AVAILABLE HELP So to run the function 1s we need to enter the name followed by an opening and anda closing Entering only 1s will just print the object you will see the underlying R code of the the function 1s Most functions in R accept certain arguments For example one of the arguments of the function 1s is pattern To list all objects starting with the letter x gt 2 9 gt y2 10 gt 1s pattern x 1 Wy you If you assign a value to an object that already exists then the contents of the ob
91. and mfcol fills the layout by column When the mfrow parameter is set an empty graph window will appear and with each high level plot command a part of the graph layout is filled We have seen an example in the previous section see figure 7 6 A more flexible alternative to set the layout of a plotting window is to use the function layout An example three plots are created on one page the first plot covers the upper half of the window The second and third plot share the lower half of the window first argument is a matrix with integers specifying the next n graphs nf layout rbind clas c 2 3 If you are not sure how layout has divided the window use layout show to display the window splits layout show nf plot rnorm 100 type 1 hist rnorm 100 qqnorm runif 100 115 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH 1 rnorm 100 o 4 T 4 a 7 I T T T T T T 0 20 40 60 80 100 Index Histogram of rnorm 100 Normal Q Q Plot 7 n gt 8 g 3 El 7 6 o oJ o Eos E i 3 o 2 1 0 1 2 rnorm 100 Theoretical Quantiles Figure 7 8 The plotting area of this graph is divided with the layout function The matrix argument in the layout function can contain 0 s zero s leaving a certain sub plot empty For example nf layout rbind eliti c 0 2 Other settings The following list
92. and then new style classes Note that many of the existing routines still make use of the old style classes When creating new classes it is recommended to use new style classes 9 1 2 Old style classes Generic and specific methods In R there are a number of generic methods that can be applied to objects For example any object can be printed by using the generic print method 180 CHAPTER 9 MISCELLANEOUS 9 1 OBJECT ORIENTED print mydf print myfit or simply mydf myfit A data frame will be printed in a different way than an object of class lm It may not be surprising that the generic print function does not do the actual printing but rather looks at the class of an object and then calls the specific print method of this class The function print therefore does not show much of the code that does the actual printing print Function X se UseMethod print lt environment namespace base gt A generic function has this form a one liner with a call to the function UseMethod For example if the class of the object myfit is lm then print myfit will call the function print lm If the class of the object is someclass then R will look for the function print someclass If that function does not exists then the function print default will be called The function methods returns all specific methods for a certain class gt methods class 1m 1 add1 1mx alias lmx anova lm case names 1m
93. as one of the following four forms 1 A pair rows cols where rows is a vector representing the row numbers and cols is a vector representing column numbers Rows and or cols can be empty or negative The following examples will illustrate the different possibilities x lt matrix 1 36 ncol 6 the element in row 2 and column 6 of x x 2 6 1 32 HH the third row of x x 3 1 3 9 15 21 27 33 the element in row 3 and column 1 and the element in row 3 and column 5 51 CHAPTER 4 DATA MANIPULATION 4 2 MATRIX SUBSCRIPTS x 3 c 1 5 11 3 27 show x except for the first column x 1 Cdad E 21 31 L41 5 Ci 7 13 19 25 31 2 8 14 20 26 32 3 9 15 21 27 33 4 10 16 22 28 34 SJ 11 17 23 29 35 6 12 18 24 30 36 A negative pair results in a so called minor matrix where a column and row is omitted x 3 4 1 2 0 3 4 E 5 ES 1 7 13 25 31 2 2 8 14 26 32 3 4 10 16 28 34 4 5 11 17 29 35 5 6 12 18 30 36 The matrix x remains the same unless you assign the result back to x x lt x 3 4 As with vectors matrix elements or parts of matrices can be changed by using the matrix subscript mechanism and the assignment operator together To change one element x 4 5 lt 5 To change a complete column x lt matrix rnorm 100 ncol 10 x 1 lt 1 10 2 A logical matrix with the same dimension as x 52 CHAPTER 4 DATA MANIPULATION 4 2 MATRIX SU
94. ase R system can be divided into two groups High level plot functions These functions produce complete graphics and will erase existing plots if not specified otherwise Low level plot functions These functions are used to add graphical objects like lines points and texts to existing plots The most elementary plot function is plot In its simplest form it creates a scatterplot of two input vectors x lt rnorm 50 y lt rnorm 50 plot x y To add titles to the existing plot use the low level function title title Figure 1 Use the option type 1 l as in line in the plot function to connect consecutive points This option is useful to plot mathematical functions For example 103 CHAPTER 7 GRAPHICS 7 2 MORE PLOT FUNCTIONS Figure 1 aj o o o o o 4 o o o e o 5 o 6 oo O o o 5 o o S o y Oo S o z y o o 00 o o A o o 00 o o o o o a I o T T T T T T T 1 5 1 0 0 5 0 0 0 5 1 0 1 5 Figure 7 1 A scatterplot with a title u lt seq 0 4 pi by 0 05 y lt sin u plot u v type 1 xlab x axis ylab sin title figure 2 In case of drawing functions or expressions the function curve can be handy it takes some work away the above code can be replaced by the following call to produce the same graph curve sin x 0 4 pi 7 2 More plot functions In this section we just mention some useful plot functions we refer to the help files of the correspondi
95. asses c numeric numeric factor character There is an advantage in using colClasses especially when the data set is large If you don t use colClasses then during a data import R will store the data as character vectors before deciding what to do with them Character strings in a text files may be quoted and may contain the the separator symbol To import such text files use the quote argument Suppose we have the following comma separated text file that we want to read Goli Gol Cors 12 45 Davis Joe 23 78 White Jimmy Use the read csv function as follows to import the above text read csv myfile quote Coll Col2 Co13 1 12 465 Davis Joe 2 23 78 White Jimmy 43 CHAPTER 3 IMPORTING DATA 3 2 EXCEL FILES 3 1 1 The scan function The read table function uses the more low level function scan This function may also be called directly by the user and can sometimes be handy when read table cannot do the job It reads the data into a vector or list the user can then manipulate this vector or list For example if we use scan to read the text file above we get scan myfile what character sep strip white TRUE 1 Cold oo Gols i De AEM 6 Davis Joe 23 wise White Jimmy Read 9 items 3 2 Excel files To read and write Excel files you can use the package xlsReadWrite This package provides the functions read xls and write xls If the data is in the first sheet and s
96. ast four functions can also be used on more than one vector in which case the sum product minimum or maximum is taken over all elements of all vectors z lt MOST y lt 45 21 sum x y prod x y max x y min x y chop off last part of a vector x lt 10 100 length x 20 Note that sum x y is equal to sum c x y The function cumsum x generates a vector with the same length as the input vector The i th element of the resulting vector is equal to the sum of the first i elements of the input vector Example cumsum rep 2 10 1 2 4 6 8 10 12 14 16 18 20 49 CHAPTER 4 DATA MANIPULATION 4 1 VECTOR SUBSCRIPTS To sort a vector in increasing order use the function sort You can also use this function to sort in decreasing order by using the argument decrease TRUE x lt 2 6 4 5 5 8 8 1 3 0 length x 1 10 sort x 1110123455688 sort x decr TRUE 0188855143210 With the function order you can produce a permutation vector which indicates how to sort the input vector in ascending order If you have two vectors x and y you can sort x and permute y in such a way that the elements have the same order as the sorted vector x x lt rnorm 10 create 10 random numbers y lt 1310 create the numbers 1 2 3 10 z lt order x create a permutation vector sort x sort x 1 1 069 0 603 0 554 0 872 0 942 0 972 1 083 1 924 2 194 2 456 y z change the order of elements of y
97. ata Source Name DSN using the adminis trative tools in Windows Once that is done R can import data from the corresponding database e Go to the Control Panel select Administrative Tools and select Data Sources ODBC e In the tab User DSN click the Add button select the MS Access driver and click Finish e Now chose a name for the data source say MyAccessData and select the MS Access database file Now the DSN has been set up and we can import the data from the database into R First make a connection object using the function odbcConnect library RODBC conn lt odbcConnect MyAccessData conn RODB Connection 1 Details case nochange DSN MyAccessData DBQ C DOCUMENTS AND SETTINGS LONGHOW LAM My 45 CHAPTER 3 IMPORTING DATA 3 4 THE FOREIGN PACKAGE Documents LonghowStuff Courses R MyAccess mdb Driverld 25 FIL MS Access MaxBufferSize 2048 PageTimeout 5 If you have established a connection successfully the connection object will display a summary of the connection To display table information use sqlTables conn which will display all tables including system tables To import a specific table use the function sqlFetch sqlFetch conn Table1 ID Coli Col2 Co13 Col4 1 al John 123 1973 09 12 A 2 2 Martin 456 1999 12 31 B 3 3 Clair 345 1978 05 22 B Use the function sqlQuery to submit an SQL query to the database and retrieve the result myq lt S
98. aved in a file RData This is a binary file located in the working directory of R which is by default the installation directory of R During your R session you can also explicitly save the workspace image Go to the File menu and then select Save Workspace or use the save image function save to the current working directory save image just checking what the current working directory is getwd save to a specific file and location save image C Program Files R R 2 5 0 bin RData If you have saved a workspace image and you start R the next time it will restore the workspace So all your previously saved objects are available again You can also explicitly load a saved workspace file that could be the workspace image of someone else Go the File menu and select Load workspace 1 7 R Packages One of the strengths of R is that the system can easily be extended The system allows you to write new functions and package those functions in a so called R package or R library The R package may also contain other R objects for example data sets or documentation There is a lively R user community and many R packages have been written and made available on CRAN for other users Just a few examples there are packages for portfolio optimization drawing maps exporting objects to html time series analysis spatial statistics and the list goes on and on In section 9 5 1 we ll give
99. bstring returns a substring of a character object For example x lt c Gose Longhow David substring x first 2 last 4 1 ose ong ayi The function paste will paste two or more character objects For example to create a character vector with number 1 number 2 number 10 proceed as follows paste number 1 10 sep 1 number 1 number 2 number 3 number 4 5 number 5 number 6 number 7 number 8 9 number 9 number 10 The argument sep is used to specify the separating symbol between the two character objects paste number 1 10 sep 1 number 1 number 2 number 3 number 4 5 number 5 number 6 number 7 number 8 9 number 9 number 10 Use sep for no space between the character objects 64 CHAPTER 4 DATA MANIPULATION 4 5 CHARACTER MANIPULATION 4 5 2 Finding patterns in character objects The functions regexpr and grep can be used to find specific character strings in char acter objects The functions use so called regular expressions a handy format to specify search pattern See the help for regexpr to find out more about regular expressions Let s extract the row names from our data frame cars car names lt row names cars We want to know if a string in car names starts with Volvo and if there is the position it has in car names Use the function grep as follows grep Volvo car names 1
100. cal summaries and tests A number of functions return statistical summaries and tests The following table con tains a list of only some of the statistical functions in R The names of the functions usually speak for themselves Function purpose ac x plot F auto or partial correlation coefficients chisq test x chi squared goodness of fit test cor x y correlation coefficient ks test z Kolmogorov Smirnov goodness of fit test mad x median absolute deviation mean x mean mean x trim a trimmed mean median x median quantile x probs sample quantile at given probabilities range x the range i e the vector c min x max x stem x stem and leaf plot t test x One or two sample Student s t test var x variance of x or covariance matrix of x var x y covariance var test x y test on variance equality of x and y Table 8 1 Some functions that calculate statistical summaries The remainder of this sub section will give some examples of the above functions 135 CHAPTER 8 STATISTICS 8 1 BASIC STATISTICAL FUNCTIONS quantiles The quantile function needs two vectors as input The first one contains the obser vations the second one contains the probabilities corresponding to the quantiles The function returns the empirical quantiles of the first data vector To calculate the 5 and 10 percent quantile of a sample from a N 0 1 distribution proceed as follows x lt rnorm 100
101. conditioning plots 123 130 conflicting objects 15 209 conflicts 15 control flow 77 Cox proportional hazard model 164 csv files 42 cumulative sum 49 curve 104 cut 68 data frames 35 databases 45 debug 82 debugging 80 delimited files 42 deriv 178 difftime 26 dim 32 double 19 duplicated 50 eclipse 16 eval 189 Excel files 44 expressions 189 Facetting 133 factor 22 factor variables 152 FALSE 21 figure region 114 font 117 for 79 formula objects 141 Fortran 92 free variables 75 glm 155 Graphical Devices 110 Index Index grep 65 gsub 67 head 55 help 11 help 19 HTML 204 if 77 ill conditioned models 174 import data 42 integer 20 is infinite 27 is na 27 is nan 27 join 59 jpeg 112 Kendall s tau a 158 language objects 189 lapply 87 Latex 204 layout 115 layout show 115 lazy evaluation 76 legends 120 length 49 level 23 levels 23 lexical scope 75 line type 117 line width 117 Linear regression 142 lines 119 lists 38 local variables 73 logical 21 logistic regression 154 loops 77 low level plot functions 119 lrm 159 margins 114 masked objects 15 210 mathematical expressions in graphs 120 Mathematical operators 29 matrix 31 merge 59 model diagnostics 146 mtext 120 multicollinearity 149 multiple plots per page 115 NA 27 NaN 27 nchar 64 Non linear
102. ctions that can be called e printing and error handling e numerical and mathematical e memory allocation As an example we slightly modify the above arsim c file tinclude lt R h gt void arsim double x long n double phi 97 CHAPTER 6 EFFICIENT 6 4 SOME COMPILED CODE long i Rprintf Before the loop n if n gt 100 MESSAGE vector is larger than 100 WARN for i 1 i lt n i x i phi x i 1 x i Rprintf After the loop n Note that if you have loaded the dll with dyn load you must not forget to unload it with the function dyn unload if you want to build a newer version R has locked the dll and the compiler is not able to build a new version After a successful build we can run arsimC again which now gives some extra output out2 lt arsimC rnorm 500 phi 0 75 Before the loop After the loop Warning message vector is larger than 100 6 4 3 Evaluating R expressions in C A handy thing to do in C is evaluating R expressions or R functions This enables you for example to write a numerical optimization routine in C and pass an R function to that routine Within the C routine calls to the R function can then be made this is just the way or example the R function optim works To evaluate R expressions or R functions in C it is better to use the Call or External interfaces the eval function is the function in C that can be used to evaluate R expres sions S
103. curate the test The area under the curve AUR is a measure of how accurate the model can predict good A value of 1 is a perfect predictor while a value of 0 5 is very bad predictor The code below shows how to create an ROC and how to calculate the AUROC library ROCR pred lt prediction test glm fitted testdata y perf lt performance pred tpr fpr plot perf colorize T lwd 3 abline a 0 b 1 performance pred measure auc y values 111 1 0 8353878 8 4 3 Predictive ability of a logistic regression Another way to look at the quality of a logistic regression model is to look at the predictive ability of the model When we predict P Y good with the model a pair of observations with different observed responses one good the other bad is said to be concordant if the observation with the bad has a lower predicted probability than the observation with the good If the observation with the bad has a higher predicted 158 CHAPTER 8 STATISTICS 8 4 LOGISTIC REGRESSION True positive rate Worthless model T T l T T T 0 0 0 2 0 4 0 6 0 8 1 0 False positive rate Figure 8 4 The ROC curve to assess the quality of a logistic regression model probability than the observation with the good then the pair is discordant If the pair is neither concordant nor discordant it is a tie Four measures of association for assessing the predictive abil
104. d 91 6 3 Using Compiled code rs asad ide Gee wae OAM a A nd er Be ae 92 6 3 1 The C and Fortran interfaces lona io Bg ence es 93 6 3 2 The Call and External interfaces 94 6 4 Some Compiled Code examples oaoa a a 94 6 4 1 The arsim example Las e a A eek ee 94 64 2 Using inclide RA a DA A Boe de a 96 6 4 3 Evaluating R expressions in C aoaaa o 98 Graphics 103 del Tntrod ction DL A e ar a ote ae dd koa a de 103 G2 More plot functions af pine gna a e pr eaa Ya Bin Ya de Gy 104 TL The plot funcion 2 29 0 ae SR A A a eek at 105 alan A A se ay Gad dds Gane bP ee es 106 7 2 3 Twoor more variables o 109 12d Graphical Devices us Weeds tito Dd di GA Se tke nae a da de S 111 7 3 Modifying a graph mai a A o pd do Dd 112 7 3 1 Graphical parameters dias a Bk Se So 112 7 3 2 Some handy low level functions 119 7 3 3 Controlling the axes lp a ea ed Os eB de Poe 121 TA relis Graphics co 4 a AN oP Ea oa a ES E e 123 TAT troqueladas e hee 123 T42 Multi panel graphs any rep ie elds oe eS oie AR ee Ske 125 7 4 3 Trellis panel functions 22 4 4 2 4 24 A ew es oe 128 TAA Conditioning plots ss dvs oie RE aa de yd 130 7 5 The ggplot2 package e pb cc GS O O eee eee is 131 Job Wie apo acosa ES woe erase ene aoe aos 131 AR e a AA ead De Gnd Ae et ad He BR 133 7 5 3 Plots with several layers ut pira 134 Statistics 135 8 1 Basic statistical functions pala Ao EE a 1
105. dered testdata x levels c C B D X The order in the levels is specified by the levels argument You can check the order by printing the levels of the variable 153 CHAPTER 8 STATISTICS 8 4 LOGISTIC REGRESSION levels testdata x Et nou p p DA Im y x contrast list x contr treatment data testdata Call lm formula y x data testdata contrasts list x contr treatment Coefficients Intercept X2 x3 x4 30 04 19 93 19 98 25 11 In the above example we used the function ordered to define the level C as the lowest level Consequently level C is left out in the regression and the remaining parameters are interpreted as difference between that level and the level C The reorder function is used to order factor variables based on some other data Sup pose we want to order the levels of x in such a way that the lowest level has the smallest variance in y then we use reorder as follows testdata x lt reorder testdata x testdata y var levels testdata x 1 p nyu nou p Level D has the smallest variance in y and will be left out in a regression where we use a treatment contrast for the regression variable x 8 4 Logistic regression Logistic regression can be used to model data where the response variable is binary a factor variable with two levels For example a variable Y with two categories yes no or good bad Many companies have build score cards based on
106. e into another character representation You need to provide a conversion specification that starts with a followed by a single letter 25 CHAPTER 2 DATA OBJECTS 2 1 DATA TYPES first creating four characters x lt c 1jan1960 2jan1960 3 mar1960 30jul1960 z lt strptime x d b Y zt lt as POSIXct z zt 1 1960 01 01 W Europe Standard Time 2 1960 01 02 W Europe Standard Time 3 1960 03 31 W Europe Daylight Time 4 1960 07 30 W Europe Daylight Time pasting 4 character dates and 4 character times together dates lt c 02 27 92 02 27 92 01 14 92 02 28 92 tities lt c 28 03 20 22729256 01 08 30 18721703 x lt paste dates times z lt strptime x m 4d y H 4M S zt lt as POSIXct z zt 1 1992 02 27 23 03 20 W Europe Standard Time 2 1992 02 27 22 29 56 W Europe Standard Time 3 1992 01 14 01 03 30 W Europe Standard Time 4 1992 02 28 18 21 03 W Europe Standard Time An object of type POSIXct can be used in certain calculations a number can be added to a POSIXct object This number will be the interpreted as the number of seconds to add to the POSIXct object zt 13 1 1992 02 27 23 03 33 W Europe Standard Time 2 1992 02 27 22 30 09 W Europe Standard Time 3 1992 01 14 01 03 43 W Europe Standard Time 4 1992 02 28 18 21 16 W Europe Standard Time You can subtract two POSIXct objects the result
107. e Country Reliability Mileage Type Weight Disp HP Nissan Van 4 14799 Japan NA 19 Van 3690 146 106 The result is a data frame with one row To select more rows use a vector of names cars c Nissan Van 4 Dodge Grand Caravan V6 If the given row name does not exist R will return a row with NA s cars Lada Price Country Reliability Mileage Type Weight Disp HP NA NA NA NA NA NA NA NA NA Rows from a data frame can also be selected using row numbers Select cases 10 trough 14 from the cars data frame cars 10 14 Price Country Reliability Mileage Type Weight Disp HP Subaru Justy 3 5866 Japan NA 34 Small 1900 T3 73 Toyota Corolla 4 8748 Japan USA 5 29 Small 2390 97 102 Toyota Tercel 4 6488 Japan 5 35 Small 2075 89 78 Volkswagen Jetta 4 9995 Germany 3 26 Small 2330 109 100 Chevrolet Camaro V8 11545 USA 1 20 Sporty 3320 305 170 The first few rows or the last few rows can be extracted by using the functions head or tail 55 CHAPTER 4 DATA MANIPULATION 4 3 MANIPULATING DATA FRAMES head cars 3 Price Country Reliability Mileage Type Weight Disp HP Eagle Summit 4 8895 USA 4 33 Small 2560 97 113 Ford Escort 4 7402 USA 2 33 Small 2345 114 90 Ford Festiva 4 6319 Korea 4 37 Small 1845 81 63 tail cars 2 Price Country Reliability Mileage Type Weight Disp HP Nissan Axxess 4 13949 Japan NA 20 Van 3185 146 138 Nissan Van 4 14799 Japan NA 19 Van 3690 146 106 To subset specific cases from a data frame you ca
108. e dealing with an object of type character 22 CHAPTER 2 DATA OBJECTS 2 1 DATA TYPES 2 1 6 Factor The factor data type is used to represent categorical data i e data of which the value range is a collection of codes For example e variable sex with values male and female e variable blood type with values A AB and O An individual code of the value range is also called a level of the factor variable So the variable sex is a factor variable with two levels male and female Sometimes people confuse factor type with character type Characters are often used for labels in graphs column names or row names Factors must be used when you want to represent a discrete variable in a data frame and want to analyze it Factor objects can be created from character objects or from numeric objects using the function factor For example to create a vector of length five of type factor do the following sex lt c male male female male female The object sex is a character object You need to transform it to factor sex lt factor sex sex 1 male male female male female Use the function levels to see the different levels a factor variable has levels sex 1 female male Note that the result of the levels function is of type character Another way to generate the sex variable is as follows sex lt c 1 1 2 1 2 The object sex is an integer variable you need to transform
109. e latter We will make use of the NetBeans IDE This is powerful yet easy to use environment for creating java GUI applications it is freely available from www netbeans org So to replicate the example in this section download the following software e Java development Kit JDK e The NetBeans IDE e The R package rJava 202 CHAPTER 9 MISCELLANEOUS 9 6 CALLING R FROM JAVA Oar a o Lo a o gt 2H gt amp o o I he A o a E T 1 0 0 5 0 0 0 5 1 0 1 0 0 5 0 0 0 5 1 0 x x O y 905 Lo gt a o gt o Ww 7 i o 2 T 1 Figure 9 2 Some Lissajous plots The next figure shows a small application that allows the user to import a text file create explorative plots and fit a regression model The NetBeans project and java code files are available from the website of this document The code is not that difficult Most of the work is done in the JRI package which contains an REngine object that you can embed in your java code A brief description of the java gui A global REngine object re is defined and created in the java code Rengine re new Rengine args false new TextConsole Throughout the java program the object re can be used to evaluate R expressions For example if the Import Data button is clicked an import file dialog appears that will return a filename then the following java code is called String evalstr infile lt filename
110. e only need to combine the common columns of these data frames you can use the subscripting mechanism and the function rbind rand comb lt rbind rand dfil cl unit binon 1 rand df2 c unif binom rand comb unif binom 6230449 9921276 ILLES 9318150 6455379 4543552 8131595 9460346 6632598 7688208 oono BARUN 0 060 Oo oOo OC Oo Oo E O oF EF EY ON NO NM b o The functions rbind expects that the two data frames have the same columns The function rbind fill in the reshape package can stack two or more data frames with any columns It will fill a missing column with NA library reshape rbind fill rand df1 rand df2 rand df1 norm binom unif chisq 1 3 0309036 1 0 39182298 NA 2 1 5897306 O 0 04189106 NA 3 1 3976871 2 0 09756326 NA 4 0 4867048 O 0 70522637 NA 5 1 7282814 O 0 42753294 NA 6 NA O 0 98808959 5 6099156 7 NA 1 0 56966460 2 5105316 8 NA 1 0 53950251 1 0920222 9 NA O 0 01064824 0 2301267 10 NA 1 0 87821054 3 8488757 58 CHAPTER 4 DATA MANIPULATION 4 3 MANIPULATING DATA FRAMES 4 3 4 Merging data frames Two data frames can be merged into one data frame using the function merge The join operation in database terminology If the original data frames contain identical columns these columns only appear once in the merged data frame Consider the following two data frames testi lt read delim test1 txt sep testl name year BA HR al Dick 1963 0
111. e the call stack when an error occurred 5 4 3 Stepping through a function With traceback you will now in which function the error occurred it will not tell you where in the function the error occurred To find the error in the function you can use the function debug which will tell R to execute the function in debug mode If you want to step through everything you will need to set debug flag for the main function and the functions that the main function calls debug testf debug myf Now execute the function testf R will display the body of the function and a browser environment is started testf 9 debugging in testf 9 debug myf pp Browse 1 gt 82 CHAPTER 5 WRITING FUNCTIONS 5 4 DEBUGGING YOUR R In the browser environment there are a couple of special commands you can give e n executes the current line and prints the next one e c executes the rest of the function without stopping e Q quits the debugging completely so halting the execution and leaving the browser environment e where shows you where you are in the function call stack In addition to these special commands the browser environment acts like an interactive R session that means you could enter commands like 1s show all objects in the local environment the current function print object or just object prints the value of the object 675 98876 just some calculations e object lt 89 assigning a new value
112. e to manipulate R objects directly Many users think of R as a statistics system We prefer to think of it of an environment within which statistical techniques are implemented R can be extended easily via packages There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics R has its own LaTeX like documentation format which is used to supply comprehensive documentation both on line in a number of formats and in hardcopy 1 3 Obtaining and installing R R can be downloaded from the Comprehensive R Archive Network CRAN You can download the complete source code of R but more likely as a beginning R user you want to download the precompiled binary distribution of R Go to the R web site http www r project org and select a CRAN mirror site and download the base dis tribution file under Windows R 2 7 0 win32 exe At the time of writing the latest version is 2 7 0 We will mention user contributed packages in the next section The base file has a size of around 29MB which you can execute to install R The installation wizard will guide you through the installation process It may be useful to CHAPTER 1 INTRODUCTION 1 4 YOUR FIRST R SESSION install the R reference manual as well by default it is not installed You can select it in the installation wizard 1 4 Your first R session Start the R syst
113. e will create exports this function SEXP Integrate SEXP fn SEXP a SEXP b SEXP rho SEXP ans double mys mys qromb REAL a 0 REAL b 0 fn rho PROTECT ans allocVector REALSXP 1 100 CHAPTER 6 EFFICIENT 6 4 SOME COMPILED CODE REAL ans 0 mys UNPROTECT 1 return ans F The lower bound a and the upperbound b are of type SEXP and are passed from R to the C code they are converted to double and passed to the gromb function This function returns the result in the double variable mys which we transform to a variable of type SEXP so that it can be passed to R The only modification to the existing C code qromb is the addition of two input param eters fn and rho which will be needed when we want to evaluate the R function that is given by fn In fact the function qromb calls polint and trapzd that will call the function fn so these functions also need to be given fn and rho Modifying the function func Normally when you want to use the function qromb in a stand alone C program then the function to integrate is programmed in the function func Now this function needs to be adjusted in such a way that it evaluates the function fn that you have given from R double func const double x SEXP fn SEXP rho SEXP R_fcall fn_out x_input PROTECT R_fcall lang2 fn R_NilValue PROTECT x_input allocVector REALSXP 1 PROTECT fn_out allocVector VECSXP 1 REAL x_input 0 x
114. ecifies the length of tick marks as a fraction of the smaller of the width or height of the plotting region In the extreme casetck 1 grid lines are drawn To draw logarithmic x or y axis use log x or log y if both axis need to be loga rithmic use log xy adding an extra axis with grid lines this is on top of the existing axis axis side 1 at c 5 10 15 20 labels rep 5 tck 1 lty 2 Example of logarithmic axes x lt runif 100 1 100000 y lt runif 100 1 100000 plot x y log xy col grey 7 4 Trellis Graphics 7 4 1 Introduction Trellis graphics add a new dimension to traditional plotting routines They are extremely useful to visualize multi dimensional data You create a trellis graphic by using one of the trellis display functions these are in the package lattice Visualization of multi dimensional data can be achieved through a multi panel layout where each panel displays a particular subset of the data The following table Table 7 1 displays some of the trellis display functions in R in the lattice package 123 CHAPTER 7 GRAPHICS 7 4 TRELLIS GRAPHICS o a T 7 l N lt 4 a 4 N l o ai PET l i i Fal FT TA 3 2 1 0 1 2 2 0000000 0 8571429 e_ i i i 38 eN o 1 1 1 oo i i i i 1 1 1 a 1 1 1 1 3 1 1 1 7 l l I LO b 1 1 1 0 N 1 1 1 1 1 1 1 So hi o 1 1 1 RJ l l o o 10 1 S 90 1 con oq Y 1 1 1 i Q i e 1 o oo
115. em the main window RGui with a sub window R Console will appear as in figure 1 1 R Console Copyright C 2006 The R Foundation for Statistical Computing ISBN 3 900051 07 0 R is free software and comes with ABSOLUTELY NO WARRANTY You are welcome to redistribute it under certain conditions Type license or licence for distribution details Natural language support but running in an English locale R is a collaborative project with many contributors Type contributors for more information and citation on how to cite R or R packages in publications Type demo for some demos help for on line help or help start for an HTML browser interface to help Type a to quit R Loading required package tcltk Loading Tcl Tk interface done Loading required package svMisc Loading required package R2HTML Previously saved workspace restored gt 1 Figure 1 1 The R system on Windows In the Console window the cursor is waiting for you to type in some R commands For example use R as a simple calculator gt print Hello world 1 Hello world gt 1 sin 9 1 1 412118 gt 234 87754 1 0 002666545 CHAPTER 1 INTRODUCTION 1 4 YOUR FIRST R SESSION gt 1 0 05 78 1 1 477455 gt 23 76 log 8 23 atan 9 1 2 019920 Results of calculations can be stored in objects using the assignment operators e An arrow lt formed by a smaller than character
116. ent x is a vector of characters and split is a character vector containing regular expressions that are used for the split If it is NULL as in the above example the character strings are split into single characters If it is not null R will look at the elements in x if the split string can be matched the characters left of the match will be in the output and the characters right of the match will be in the output strsplit x c Some text another string Amsterdam is a nice city split 7 1 1 Some text LCT 1 another string 131 1 Amsterdam is a nice city 4 6 Creating factors from continuous data The function cut can be used to create factor variables from continuous variables The first argument x is the continuous vector and the second argument breaks is a vector of breakpoints specifying intervals For each element in x the function cut returns the interval as specified by breaks that contains the element 68 CHAPTER 4 DATA MANIPULATION 4 6 CREATING FACTORS FROM is 1215 breaks lt c 0 5 10 15 20 cut x breaks 1 0 5 0 5 0 5 0 5 0 5 5 101 5 10 5 10 5 101 10 5 107 10 161 10 15 10 15 10 15 10 15 Levels 0 5 5 10 10 15 15 20 The function cut returns a vector of tye factor each element of this vector shows the interval to which the corresponding element of the original vector corresponds If only one number is specified f
117. esents the side of the plot for the axis 1 for bottom 2 for left 3 for top and 4 for right Use the pos argument to specify the x or y position of the axis x lt rnorm 100 y lt rnorm 100 plot x y axes F axis side 1 pos 0 axis side 2 pos 0 The location of the tick marks and the labels at the tick marks can be specified with the arguments at and labels respectively Placing tick marks at specified locations x lt rnorm 100 y lt rnorm 100 plot x y axes F xtickplaces lt seq 2 2 1 8 ytickplaces lt seq 2 2 1 6 axis side 1 at xtickplaces axis side 2 at ytickplaces Placing labels at the tick marks x lt 1 20 y lt rnorm 20 plot x y axes F xtickplaces lt 1 20 ytickplaces lt seq 2 2 1 6 xlabels lt paste day 1 20 sep axis side 1 at xtickplaces labels xlabels axis side 2 at ytickplaces 122 CHAPTER 7 GRAPHICS 7 4 TRELLIS GRAPHICS Notice that R does not plot all the axis labels R has a way of detecting overlap which then prevents plotting all the labels If you want to see all the labels you can adjust the character size use the cex axis parameter x lt 1220 y lt rnorm 20 plot x y axes F xtickplaces lt 1 20 ytickplaces lt seq 2 2 1 6 xlabels lt paste day 1 20 sep axis side 1 at xtickplaces labels xlabels cex axis 0 5 axis side 2 at ytickplaces Another useful parameter that you can use is tck It sp
118. eters Although you can do this calculation directly on the components of the list as in the code below fiti sum lt summary fit1 fiti sum cov unscaled fit1 sumfsigma 2 betal beta2 betal 0 004516374 0 03408523 beta2 0 034085234 0 28903582 173 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION In this case it is more convenient to use the function vcov which is a generic function that also accepts a model object other than that generated by nis vcov fit1 betal beta2 betal 0 004516374 0 03408523 beta2 0 034085234 0 28903582 Use the function predict to calculate model predictions and standard errors of these predictions Suppose we want to calculate prediction on the values of x from 0 to 10 Then we proceed as follows x lt seq 0 30 1 100 pred data lt data frame x x x pred lt predict fiti newdata pred data The output object x pred is a vector which contains the predictions You can insert the predictions in the pred data data frame and plot the predictions together with the simulated data as follows pred data ypred lt x pred plot our exp x our exp y lines pred data x pred data ypred 8 7 1 Ill conditioned models Similair to the problem of multicollinearity in linear regression as described in section 8 3 3 nonlinear models can be ill conditioned too However with nonlinear models it may not only be a data issue but the nonlinear model it self may be ill conditioned Such a model c
119. ether argument values are required or optional With optional arguments the specification of the arguments in the function header is argname defaultvalue In the following function for example the argument x is required and R will give an error if you don t provide it The argument k is optional having the default value 2 power lt function x k 2 xk power 5 1 25 power Error in power argument x is missing with no default However we can specify a different value for k power 5 3 1 125 12 CHAPTER 5 WRITING FUNCTIONS 5 2 ARGUMENTS AND VARIABLES 5 2 2 The argument The three dots argument can be used to pass arguments from one function to another For example graphical parameters that are passed to plotting functions or numerical parameters that are passed to numerical routines Suppose you write a small function to plot the sin function from zero to xup plotsin lt function xup 2 pi x lt seq O xup 1 100 plot x sin x type 1 plotsin col red The function plotsin now accepts any argument that can be passed to the plot function like col xlab etc without needing to specify those arguments in the header of plotsin 5 2 3 Local variables Assignments of variables inside a function are local unless you explicitly use a global assignment the lt lt construction or the assign function This means a normal as signment within a
120. f a list in that the number of slots and their names and classes are specified when a class is created objects are extracted from slots by the operator Exact matching of slot names is used unlike the operator for lists A new style class definition is created with the function setClass Its first argument is the name of the class and its representation argument specifies the slot For example a class fungi to represent the spatial location of fungi in a field might look like setClass fungi representation x numeric y numeric species character Once a class definition has been created it can be examined by the function getClass 185 CHAPTER 9 MISCELLANEOUS 9 1 OBJECT ORIENTED getClass fungi Slots Name x y species Class numeric numeric character To list all the classes in R or in your workspace use the function getClassses class definitions in your workspace getClasses where 1 1 fungi The class fungi can also be created by combining other classes For example setClass xyloc representation x numeric y numeric setClass fungi representation xyloc species character A class can be removed with the function removeClass To create or instantiate an object from the fungi class use the function new fieldi lt new fungi x runif 10 y runif 10 species sample letters 1 5 rep T 10 fieldi An object of class fungi Slot
121. g characters raid Se ole BP ce ek Os 4 6 Creating factors from continuous data 2 000008 Writing functions 3d mr ducho 44 9 a Sa ek AOS ee BU a A aa ed es 5 2 Arguments and variables 4 que gt eck Eta a 5 2 1 Required and optional arguments 6 5 22 A A A ee eee ck dk BY oes 5 2 3 Local variables Lic db ak RAS A A Sk ee 5 2 4 Returning an object ars eae ek a ee 32 00 Lhe SGOping les eR a A Rel ale A ae oh ee Ae ad 5 2 6 Lazy evaluation 4455 nek A ees pe ky 5 9 Control TOW 2 as Boa a Soe Sina oe hag do A 5 3 1 Tests with if and switch dejarte Pub eae ee 5 3 2 Looping with for while and repeat 5 4 Debugging your R functions cra ad A e de 5 4 1 The traceback function so ech eS ee e a 5 4 2 The warning and stop functions da 444 aes Wal ees 5 4 3 Stepping through a function less 84 fe a Saas 5 4 4 The browser function a Cad ad ee Efficient calculations 6 1 Vectorized computations Lada A ad 47 47 l 54 54 57 57 59 60 61 61 62 63 64 65 67 68 68 70 70 72 72 73 73 74 75 76 77 77 79 80 80 81 82 83 84 6 2 The apply and outer functions EI A A SR Ge Se ee be eS 86 6 2 1 the apply function ta a a tea yn te ee aan Sean 86 6 2 2 the lapply and sapply functions Ain e 4am e dee as 87 6 2 3 The tapply function natal a e as Se e a A 89 6 24 The by function srani a ee PRG A Oe eee ao 90 6 2 5 Lhe outer ICON Ar ka op ee oe ete ee a ee ee Se
122. graphs Trellis graphs described in section 7 4 can also be used to visualize multi dimensional data The code below demonstrate some of the above functions define some data x lt y lt seq 4 pi 4 pi len 27 r lt sqrt outer x 2 y72 z lt cos r72 exp r 6 set a 2 by 2 layout par mfrow c 2 2 image z axes FALSE main Math can be beautiful xlab expression cos r 2 e r 6 dotchart t VADeaths xlim c 0 100 cex 0 6 main Death Rates in Virginia plots thermometers where a proportion of the thermometer is filled based on Ozone value symbols airquality Temp airquality Wind thermometers cbind OLOT 0 3 airquality 0zone max airquality 0zone na rm TRUE 109 CHAPTER 7 GRAPHICS 7 2 MORE PLOT FUNCTIONS inches 0 15 myf lt function x y lt sin x cos y x lt y lt seq 0 2 pi len 25 z lt outer x y myf persp x y z theta 45 phi 45 shade 0 2 Math can be beautiful cos r e 10 15 20 airquality Wind 5 0 airquality Temp Death Rates in Virginia 50 rpan male rban Male uraj Female ural Male 55 rpan male rban Male ural ale ural Male 60 tpan male rban Male ural ale ural Male 65 rban Female rban Male ural Female ural Male 70 7 tpan male rban Male ura ale ural Male SS Wo WN NS SONS OS Oss NS X Figure 7 6
123. h with a so called formula object and additional arguments Formula objects play a very important role in statistical modeling in R they are used to specify the model to be fitted The exact meaning of a formula object depends on the modeling function We will look at some examples in the following sections The general form is given by response expression 141 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS Sometimes the term response can be omitted expression is a collection of variables combined by operators Some examples of formula objects myformi lt y x1 x2 myform2 lt log y sqrt x1 x2 x3 myform1 y xl x2 myform2 log y sqrt x1 x2 x3 data class myform2 formula A description of formulating models using formulas is given in the various chapters of 9 The next sections will give some examples of different statistical models in R 8 3 Linear regression models 8 3 1 Formula objects R can fit linear regression models of the form y Bo Piti Bptyp e where 6 B0 Bp are the intercept and p regression coefficients and 2 p the p regression variables The error term e has mean zero and is often modeled as a normal distribution with some variance For two regression variables you can use the function 1m with the following formula y 7 xi x2 By default R includes the intercept of the linear regression model To omit the intercept use the formula
124. hange the first three elements x i 3 lt 4 The last two constructions are examples of a so called replacement in which the left hand side of the assignment operator is more than a simple identifier Note also that the recycling rule applies so the following code works with a warning from R nee e A logical vector The result is a vector with only those elements of x of which the logical vector has an element TRUE x lt c 10 4 6 7 8 y lt x gt 9 y 1 TRUE FALSE FALSE FALSE FALSE x y 1 10 or directly x x gt 9 1 10 To change the elements of x which are larger than 9 to the value 9 do the following x x gt 9 lt 9 Note that the logical vector does not have to be of the same length as the vector you want to extract elements from 48 CHAPTER 4 DATA MANIPULATION 4 1 VECTOR SUBSCRIPTS A vector of negative natural numbers All elements of x are selected except those that are in the subscript Esa 2 00 x 1 2 gives x 3 x 4 11 36 Note the subscript vector may address non existing elements of the original vector The result will be NA Not Available For example x lt c 1 2 3 4 5 x 7 1 NA x 1 6 1 1234 5 NA Some useful functions There are several useful R functions for working with vectors length x sum x prod x max x min x These functions are used to calculate the length the sum the product the minimum and the maximum of a vector respectively The l
125. he function read table This function is a has many arguments Arguments to specify the header the column separator the number of lines to skip the data types of the columns etc The functions read csv and read delim are functions to read comma separated values files and tab delimited files These functions call read table with specific arguments Suppose we have a text file data txt that contains the following text Author John Davis Date 18 05 2007 Some comments Coli Col2 Col3 Col4 23 45 A John 34 41 B Jimmy 12 99 B Patrick The data without the first few lines of text can be imported to an R data frame using the following R syntax myfile lt C Temp R Data txt mydf lt read table myfile skip 3 sep header TRUE mydf Coli Col2 Col3 Col4 1 23 45 A John 34 41 B Jimmy 3 12 99 B Patrick By default R converts character data in text files to factor type In the above example the third and fourth columns are of type factor To leave character data as character data type in R use the stringsAsFactors argument 42 CHAPTER 3 IMPORTING DATA 3 1 TEXT FILES mydf lt read table myfile skip 3 sep header TRUE stringsAsFactors FALSE To specify that certain columns are character and other columns are not you must use the colClasses argument and provide the type for each column mydf lt read table myfile skip 3 sep header TRUE stringsAsFactors FALSE colCl
126. he java program will run String filename filename C Temp Test jpg String evalstr plotfile lt filename re eval evalstr re eval jpeg plotfile width 550 height 370 re eval pairs indata colsel re eval dev off So the REngine object re is used to evaluate the pairs function and store the result in a jpeg file This jpeg file is picked up by the java gui in a JLabel object so that it is visible Then when the user clicks on Fit Model a dialog will appear where the user selects the response and the regression variables The R engine is called to fit the linear regression model The output is displayed in the results window 9 7 Creating fancy output and reports The R system contains several functions and methods that facilitate the user to create fancy output and reports First a short overview of these methods then some exam ples 204 CHAPTER 9 MISCELLANEOUS 9 7 CREATING FANCY OUTPUT e Instead of the normal output to screen the function sink redirects the output of R to a connection or external file e The package xtable contains functions to transform an R object to an xtable object which can then be printed to HTML or IXT X e The package R2HTML contains functions to create HTML code from R objects e The functions jpeg and pdf see section 7 2 4 export R graphics to external files in jpeg and pdf format These files can then be included in
127. he length of the sequence or the increment u lt seq from 3 to 3 by 0 5 u fy 30 2 5 2 0 lt 1 5 1s040 5 00 Ub Lead tab 2 0 Deb 3 0 30 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES The following commands have the same result u lt seq 3 3 length 13 w lt 6 6 2 The function seq can also be used to generate vectors with POSIXct elements a sequence of dates The following examples speak for them selves seq as POSIXct 2003 04 23 by month length 12 1 2003 04 23 W Europe Daylight Time 2003 05 23 W Europe Daylight Time 3 2003 06 23 W Europe Daylight Time 2003 07 23 W Europe Daylight Time seq ISOdate 1910 1 1 ISOdate 1999 1 1 years 1 1910 01 01 12 00 00 GMT 1911 01 01 12 00 00 GMT 3 1912 01 01 12 00 00 GMT 1913 01 01 12 00 00 GMT The function rep repeats a given vector The first argument is the vector and the second argument can be a number that indicates how often the vector needs to be repeated rep 1 4 4 1 1234123412341234 The second argument can also be a vector of the same length as the vector used for the first argument In this case each element in the second vector indicates how often the corresponding element in the first vector is repeated rep 1 4 c 2 2 2 2 1 11223344 rep 1 4 1 4 1 1223334444 For information about other options of the function rep type help rep To generate vectors with random elements you can use the function
128. he most convenient data structure for data analysis in R In fact most statistical modeling routines in R require a data frame as input One of the built in data frames in R is mtcars mtcars only a small part of mtcars mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21 0 6 160 0 110 3 90 2 620 16 46 O 1 4 4 Mazda RX4 Wag 21 0 6 160 0 110 3 90 2 875 17 02 0 1 4 4 Datsun 710 22 8 4 108 0 93 3 85 2 320 18 61 1 1 4 1 Hornet 4 Drive 21 4 6 258 0 110 3 08 3 215 19 44 1 0 3 1 Hornet Sportabout 18 7 8 360 0 175 3 15 3 440 17 02 O 0 3 2 35 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES The data frame contains information on different cars Usually each row corresponds with a case and each column represents a variable In this example the carb column is of data type double and represents the number of carburetors See the help file for more information on this data frame mtcars Data frame attributes A data frame can have the attributes names and row names The attribute names contains the column names of the data frame and the attribute row names contains the row names of the data frame The attributes of a data frame can be retrieved separately from the data frame with the functions names and row names The result is a character vector containing the names rownames mtcars 1 5 only the first five row names 1 Mazda RX4 Mazda RX4 Wag Datsun 710 4 Hornet 4 Drive Hornet Sportabout names
129. hs c 4 1 create the scatterplot with different colors plot cars Price cars Mileage peh 16 cex 1 5 col order cars Weight do some calculations for the color legend determine minimum and maximum weight values zlim range cars Weight finite TRUE lets use 20 color values in the color legend levels pretty zlim 20 start the second plot that is the color legend plot new plot window xlim c 0 1 ylim range levels xaxs ai yaxs 1 use the function rect to draw multiple colored rectangles rect 0 levels length levels 1 levels 1 col terrain colors length levels 1 draw an axis on the right hand side of the legend axis 4 118 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH x o o Compact A Large Medium BH x Small ote i Sporty e o x v Van o 3 rs x xo o ee x e 8 7 o 8 4 e S x o E 5 E x o 8 e e 2 G o o o o e o e 5 5 i x xo o e e o o x o 8 y ee S o oo o e re A o o o eee ee o o o o o o e o e e 8 4 o vv A 8 y o o o S o e amp v vay ee 10000 15000 20000 25000 10000 15000 20000 25000 cars Price cars Price Figure 7 9 Examples of different symbols and colors in plots To set the color palette back to default use palette default 7 3 2 Some handy low level functions Once you have created a plot you may want to add something to it This ca
130. iance inflation factors 151 vector 28 vector subscripts 47 vif 151 warning 81 while 79 working directory 12 workspace image 12
131. ich can be used to extract information or to create diagnostic plots from the cars 1m object generic function meaning summary object returns a summary of the fitted model coef object extracts the estimated model parameters resid object extracts the model residuals of the fitted model fitted object returns the fitted values of the model deviance object returns the residual sum of squares anova object returns an anova table predict object returns predictions plot object create diagnostic plots Table 8 3 List of functions that accept an lm object These functions are generic They will also work on objects returned by other statistical modeling functions The summary function is useful to get some extra information of the fitted model such as t values standard errors and correlations between parameters summary cars 1m Gall lm formula Weight Mileage data cars Residuals Min 1Q Median 3Q Max 569 274 159 073 8 793 191 494 570 241 Coefficients Estimate Std Error t value Pr gt tl Intercept 5057 830 180 402 28 04 lt 2e 16 Mileage 87 742 2205 12 18 lt 2e 16 x Signif codes 0 0 001 0 01 0 05 0 1 1 Residual standard error 265 2 on 58 degrees of freedom Multiple R Squared 0 7189 Adjusted R squared 0 714 F statistic 148 3 on 1 and 58 DF p value lt 2 2e 16 Model diagnostics The object cars 1m object can be used for further a
132. iled to html and Windows help files Each function should have a help file it is the help that will be displayed when a user uses the help function help LissajousPlot2 201 CHAPTER 9 MISCELLANEOUS 9 6 CALLING R FROM JAVA Build the package Now the necessary steps are completed the package can be build Open a DOS box go to the directory that contains the Lissajous directory and run the command gt Remd build binary Lissajous When the build is successful you should see the zip file Lissajous_1 0 zip Install and use the package In the RGui window go to the menu Packages and select Install package s from local zip files Then select the Lissajous_1 0 zip file R will install the package To use the package it should be attached to your current R session library Lissajous help Lissajous par mfrow c 2 2 LissajousPlot 300 2 5 LissajousPlot 300 14 4 LissajousPlot2 300 10 2 7 5 LissajousPlot2 300 10 100 25 6 9 6 Calling R from Java The package rJava can be used to call java code within R which is comparable with the technology described in section 6 3 The other way around is also possible calling R from java code This is implemented in the JRI package which will also be installed if you install the rJava package This could become useful when you want to extend your java programs with the numerical power of R or build java GUI s around R This section demonstrates th
133. in an R editor select the code and use lt Ctrl gt R to run the selected code You can see that the code is parsed in the console window any results will be displayed there 1 9 2 Other editors The built in R editor is not the most fancy editor you can think of It does not have much functionality Since writing R code is just creating text files you can do that with any text editor you like If you have R code in a text file you can use the source function to run the code in R The function reads and executes all the statements in a text file In the console window source C Temp MyRfile R There are free text editors that can send the R code inside the text editor to an R ses sion Some free editors that are worth mentioning are Eclipse www eclipse org Tinn R http www sciviews org Tinn R and JGR speak Jaguar http jgr markushelbig org Eclipse Eclipse is more than a text editor it is an environment to create test manage and maintain large pieces of code Built in functionality includes e Managing different text files in a project e Version control recall previously saved versions of your text file e Search in multiple files 16 CHAPTER 1 INTRODUCTION 1 9 EDITORS FOR R SCRIPTS The eclipse environment allows user to develop so called perspectives or plug ins Such a plug in customizes the Eclipse environment for a certain programming lan guage Stephan Wahlbrink has written an Eclipse p
134. ion plot x y xlim c 3 3 112 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH will set the minimum and maximum values of the x axis It is also possible to use the function par to set graphical parameters Some graphical parameters can only be set with this function A call to the function par has the following form par gp1 valuel gp2 value2 In the above code the graphical parameter gp1 is set to value1 graphical parameter gp2 is set to value2 and so on Note that some graphical parameters are read only and cannot be changed Run the function par with no arguments to get a complete listing of the graphical parameters and their current values par xlog 1 FALSE ylog 1 FALSE adj 1 0 5 fann 1 TRUE ask 1 FALSE bg 1 transparent etc We will discuss some useful graphical parameters See the help file of par for a more detailed description and a list of all the graphical parameters Once you set a graphical parameter with the par function that graphical parameter will keep its value until you e Set the graphical parameter to another value with the par function e Close the graph R will use the default settings when you create a new plot When you specify a graphical parameter as an extra parameter to a graphical function the current value of the graphical parameter will not be changed Some example code 113 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH define some dat
135. ion create meaningful help files and make it easy to install and use the package To achieve this there are certain steps that you need to undertake This section only gives you a kick in the right direction For a complete description look at the R manual Writing R Extensions Before you can create an R package you should install the following tools first the tools can be found on http www murdoch sutherland com Rtools e Perl a scripting language e Rtools exe a collection of command line tools and compilers e Microsoft HTML help workshop to create help files e MikTex a LaTex and pdftex package When the tools are installed your PATH variable should be edited so that commands can be found You need to be careful in specifying the order of the directories For example if you also have the MAKE utility of Borland then make sure that your sys tem finds the R MAKE first when building R packages Depending on the installation directories your path may look like PATH C Rtools bin C perl bin C Rtools MinGW bin C Program Files HTML Help Workshop C Program Files R R 2 5 0 bin C texmf miktex bin lt others gt A good starting point to create an R package is the function package skeleton We create a package Lissajous with two functions that plot Lissjous figures to demonstrate the necessary steps Define the R functions First create a script file that defines the functions that you want to package I
136. ion of Tinn more features were added 17 CHAPTER 1 INTRODUCTION 1 9 EDITORS FOR R SCRIPTS and it has become a very nice environment to edit and maintain code Tinn R is the special R version of Tinn It allows color highlighting of the R language and sending R statements to an R Console window A Tinn R Untitled1 E File project Edit Format Search Options Tools R View Window Web Help A BEAS CaiQas OisBa4 RF it be SOE B al ox J 9 F n n sAOK2 G 2 9e0 R gt Untitled1 ME AER A Computer Project Tags R card R explorer 5 myf lt function x a WO Gwui44 oo 7 Jii in 2 length x GlobalEnv 9 x i x i x i 1 ES 4 Al 2 e 10 d tl g TE x fe myf 2 Ey 12 lt il gt o pE O am a Total 6 Index Lin 3 12 Col 11 Normal mode smNormal Tinn R hotkeys active a R Console E ISR File Edit Misc Packages Help A gt y lt 1 100 a gt myf lt function x al for i in 2 length x x i x i x i 1 x oe gt trObjList env name GlobalEnv pattern group path C Documents and Settin gt a Y ES g Figure 1 3 The Tinn R and an the R Console environment JGR JGR Java GUI for R is a universal and unified Graphical User Interface for R It includes among others an integrated editor help system Type on spreadsheet and an object browser 18 2 Data Objects In this section we wi
137. is a so called difftime object t2 lt as POSIXct 2004 01 23 14 33 ti lt as POSIXct 2003 04 23 d lt t2 t1 d Time difference of 275 6479 days A difftime object can also be created using the function as difftime and you can add a difftime object to a POSIXct object Due to a bug in R this can only safely be done with the function POSIXt 26 CHAPTER 2 DATA OBJECTS 2 1 DATA TYPES POSIXt Zt d 1 1992 11 29 14 36 20 W Europe Standard Time 2 1992 11 29 14 02 56 W Europe Standard Time 3 1992 10 15 17 36 30 W Europe Daylight Time 4 1992 11 30 09 54 03 W Europe Standard Time To extract the weekday month or quarter from a POSIXct object use the handy R functions weekdays months and quarters Another handy function is Sys time which returns the current date and time weekdays zt 1 Thursday Thursday Tuesday Friday There are some R packages that can handle dates and time objects For example the packages zoo chron tseries its and Rmetrics Especially Rmetrics has a set of powerful functions to maintain and manipulate dates and times See 2 2 1 8 Missing data and Infinite values We have already seen the symbol NA In R it is used to represent missing data Not Available It is not really a separate data type it could be a missing double or a missing integer To check if data is missing use the function is na or use a direct comparison with the
138. ity of a model are avail able These measures are based on the number of concordant pairs ne the number of discordant pairs ng let the total number of pairs t and the number of observations N 1 The measure called c also an estimate of the area under ROC c ne 0 5 x t ne ng t 2 Somer s D D ne ng t 3 Kendall s tau a defined as ne ma 0 5 N N 1 4 Goodman Kruskal Gamma defined as ne N N Na Ideally we would like n to be very high and ng very low So the larger these measures the better the predictive ability of the model The function 1rm in the Design package can calculate the above measures 159 CHAPTER 8 STATISTICS 8 5 TREE MODELS library Design lrm y X1 X2 X3 data testdata Logistic Regression Model lrm formula y X1 X2 X3 data testdata Frequencies of Responses bad good 198 802 Obs Max Deriv Model L R d f P C Dxy Gamma Tau a R2 Brier 1000 2e 08 233 9 3 O 0 824 0 649 0 65 0 206 0 331 0 12 Coef S E Wald Z P Intercept 1 9018 0 3089 6 16 0 0000 X1 0 8474 0 3147 2 69 0 0071 X2 3 1342 0 3403 9 21 0 0000 X3 3 9718 0 3784 10 50 0 0000 8 5 Tree models Tree based models are not only used for predictive modeling They can be used to screening variables assessing the adequacy of linear models and summarizing large mul tivariate data sets When the response variable is a factor variable with two or more levels then the
139. ject will be overwritten with the new value without a warning Use the function rm to remove one or more objects from your session gt rm x x2 To conclude your first session we create two small vectors with data and a scatterplot z2 lt c 1 2 3 4 5 6 z3 lt c 6 8 3 5 7 1 plot z2 z3 title My first scatterplot After this very short R session which barely scratched the surface we hope you continue using the R system The following chapters of this document will explain in detail the different data types data structures functions plots and data analysis in R 1 5 The available help 1 5 1 The on line help There is extensive on line help in the R system the best starting point is to run the function help start This will launch a local page inside your browser with links to the R manuals R FAQ a search engine and other links In the R Console the function help can be used to see the help file of a specific function help mean 11 CHAPTER 1 INTRODUCTION 1 6 THE R WORKSPACE Use the function help search to list help files that contain a certain string gt help search robust Help files with alias or concept or title matching robust using fuzzy matching hubers MASS Huber Proposal 2 Robust Estimator of Location and or Scale rlm MASS Robust Fitting of Linear Models summary rlm MASS Summary Method for Robust Linear Models line stats Robust Line Fitting runmed stats Run
140. ll discuss the different aspects of data types and structures in R Operators such as c and will be used in this section as an illustration and will be discussed in the next section If you are confronted with an unknown function you can ask for help by typing in the command help function name A help text will appear and describe the purpose of the function and how to use it 2 1 Data types 2 1 1 Double If you do calculations on numbers you can use the data type double to represent the numbers Doubles are numbers like 3 1415 8 0 and 8 1 Doubles are used to represent continuous variables like the weight or length of a person x lt 8 14 yo R 850 z lt 81 0 12 9 Use the function is double to check if an object is of type double Alternatively use the function typeof to ask R the type of the object x typeof x 1 double is double 8 9 1 TRUE test lt 1223 456 is double test 1 TRUE 19 CHAPTER 2 DATA OBJECTS 2 1 DATA TYPES Keep in mind that doubles are just approximations to real numbers Mathematically there are infinity many numbers the computer can ofcourse only represent a finite number of numbers Not only can numbers like 7 or y2 not be represented exactly less exotic numbers like 0 1 for example can also not be represented exactly One of the conseqeunces of this is that when you compare two doubles with each other you should take some care Consider the following surprising result
141. logistic regression where they try to separate good customers from bad customers The logistic regression model calculates the probability of an outcome P Y good and P Y bad 1 P Y good given some regression variables Xj X as follows exp ao a1 X ap Xp PY d 8 2 good 1 exp ap a1 X Qp Xp CZ The regression coefficients ao Qp are estimated with a data set 154 CHAPTER 8 STATISTICS 8 4 LOGISTIC REGRESSION 8 4 1 The modeling function glm The function glm can be used to fit a logistic regression model Let s generate a data set and demonstrate the different aspects of building a logistic regression model nrecords 1000 ncols 3 x lt matrix runif nrecords ncols ncol ncols y lt 2Z x 1 See 2 dex 3 y exp y 1 exp y ppp lt runif nrecords y lt ifelse y gt ppp good bad testdata lt data frame y x To get an idea which variables in your data set have an influence on your binary re sponse variable plot the observed fraction yes no against the potential regression variables e Divide the variable X into say ten buckets equal intervals e For each bucket calculate the observed fraction good bad e Plot bucket number against observed fraction Some R code that plots observed fractions for a specific regression variable obs prob lt function x observed fraction of go
142. lot lt function y ylab deparse substitute y ylab y lt logy plot y ylab ylab 5 3 Control flow The following shows a list of constructions to perform testing and looping These con structions can also be used outside a function to control the flow of execution 5 3 1 Tests with if and switch The general form of the if construction has the form if test true statements else false statements where test is a logical expression like x lt 0 x lt 0 x gt 8 R evaluates the logical expression if it results in TRUE then it executes the true statements If the logical expression results in FALSE then it executes the false statements Note that it is not neccesary to have the else block TT CHAPTER 5 WRITING FUNCTIONS 5 3 CONTROL FLOW Simple example Adding two vectors in R of different length will cause R to recycle the shorter vector The following function adds the two vectors by chopping of the longer vector so that it has the same length as the shorter myplus lt function x y ni lt length x n2 lt length y af nl gt 12 1 z lt x 1 n2 y elset Z lt x y t ini zZ myplus 1 10 1 3 1 246 The switch function has the following general form switch object valuei expri value2 expr2 value3 expr3 other expressions If object has value valuel then expr1 is executed if it has value2 then expr2 is execu
143. ls one which means that Volvo is part of the character string in element 37 of car names Again a quick check car names 37 1 Volvo 240 4 In the above result you could immediately see that element 37 of car names is a match If character vectors become too long to see the match quickly use the following trick index lt 1 length car names index Volvo match gt 0 1 37 The result of the function regexpr contains the attribute match length which gives the length of the matched text In the above example match Volvo consists of 5 characters This attribute can be used together with the function substring to extract the found pattern from the character object Consider the following example which uses a regular expression the match length at tribute and the function substring to extract the numeric part and character part of a character vector x lt c 10 Sept Oct 9th Jan 2 4th of July w lt regexpr 0 9 x The regular expression 0 9 matches an integer 66 CHAPTER 4 DATA MANIPULATION 4 5 CHARACTER MANIPULATION wW HILS Si attr match length 002111 The 1 means there is a match on position 1 of 10 Sept The 5 means there is a match on position 5 of Oct 9th The 5 means there is a match on position 5 of Jan 2 The 1 means there is a match on position 1 of 4th of July In the attribute match length the 2 indicates the length of the match
144. lt c A A B B x lt c 1 2 3 4 y lt 4 3 2 1 myf lt data frame gr x y ageregate myf list myf gr mean Group 1 gr x y 1 ANA 1 5 3 5 2 B NA 3 5 1 5 R will apply the function on each column of the data frame This means also on the grouping column gr This column is of type factor numerical calculations can not be performed on factors hence the NA s You can leave out the grouping columns when calling the aggregate function aggregate mys ls CC s 9 1 list myf gr mean Group 1 E y 1 A 1 5 3 5 B35 1 5 60 CHAPTER 4 DATA MANIPULATION 4 3 MANIPULATING DATA FRAMES 4 3 6 Stacking columns of data frames The function stack can be used to stack columns of a data frame into one column and one grouping column Consider the following example group1 lt rnorm 3 group2 lt rnorm 3 group3 lt rnorm 3 df lt data frame group1 group2 group3 stack df values ind 0 63706989 groupl 0 76002786 groupl 05912762 groupl 20074146 group2 11071470 group2 43529956 group2 35128903 group3 0 39660149 group3 0 65003395 group3 oono Ae UNBE O O So by default all the columns of a data frame are stacked Use the select argument to stack only certain columns stack df select c group1 group3 4 3 7 Reshaping data The function reshape can be used to transform a data frame in wide format into a data frame in long format In a wide format data frame the differe
145. lug in for R called StatEt See www walware de goto statet and see 1 This plug in adds extra R specific function ality e Start an R console or terminal within Eclipse Color coding of key words Run R code in Eclipse by sending it to the R console Insert predefined blocks of R code templates Supports writing R documentation files Rd files StatET ImportingData R Eclipse SDK File Edit Navigate Search Project Run Window Help Hand T eR fey Q 9 9 gt 5 Navigator amp N R ImportingData R 53 4 15 REdipseFiles B project R ImportingData R R Integrate R R ManualScripts R R Rexamples R See z R TestR R 6lodbcClose conn 62 objects lt 5sqlTypeInfo conn sqlTypeInfo conn TABLE TYPE 6sqlFetch conn Tablei Smyq lt 9sqlQuery conn nun nn on N Console X RConsole R Console C Program Files R R 2 5 0 pin Rterm exe Loading required package svMisc Loading required package R2HTML Previously saved workspace restored gt x lt 100 6 1 100 39 30 97 96 95 S 993 352 S 50 89 88 87 81 19 i TN 74 72 70 68 62 60 59 58 57 56 55 53 51 49 43 41 20 39 37 36 34 32 30 24 22 21 20 19 18 17 15 13 11 Figure 1 2 R integrated in the Eclipse development environment Tinn R Tinn stands for Tinn is not Notepad it is a text editor that was originally developed to replace the boring Notepad With each new vers
146. me difference of 0 00999999 secs As we can see the C code is much faster than the R code the following graph also shows that 6 4 2 Using include lt R h gt There are many useful functions in R that can be called from your C code You need to insert include lt R h gt in your code and tell your compiler where to find this file Normally this file is located in the directory C Program Files R R 2 5 0 include In addition to that you need to link your code with functionality that is in R dll The way to do this depends on your compiler In Microsoft Visual C proceed as follows 96 CHAPTER 6 EFFICIENT 6 4 SOME COMPILED CODE elapsed time in seconds 60 80 40 20 T T T T l T 0e 00 1e 06 2e 06 3e 06 4e 06 5e 06 vector length Figure 6 2 Calculation times of arsimR solid line and arsimC dashed line for in creasing vectors e Open a dos box and go to the bin directory of the R installation e cd C Program Files R R 2 5 0 bin modify as required e pexports R dll gt R exp e lib def R exp out Rdl1 1lib Here lib is the library command that comes with Visual C You can download the free Visual C express edition from the Microsoft site The pexports tool is part of the MinGW utils package Now the file Rdll lib is created and when you create your dll the compiler needs to link this lib file as well See 5 and the R manual Writing R extensions for more information The type of R fun
147. mple importing or manipulating data but need not to be visible in the final report e results hide will hide any output However it will generate the R statements in the final document when the echo option is not set to FALSE e fig TRUE create a figure in the report when the R code contains plotting com mands Save the file when you are ready and use Sweave in R to process this file library R2HTML mydir C Documents and Settings Longhow Mijn Documenten R RCourse myfile lt paste mydir data_report sep Sweave myfile driver RweaveHTML Writing to file data_report html Processing code chunks 1 term Robj 2 term Robj png file data_report html is completed The result is an ordinary HTML file that can be opened by a web browser 207 Bibliography Oo 0 N DD SO A 10 11 12 13 14 15 Longhow Lam A guide to Eclipse and the R plug in StatET www splusbook com 2007 Diethelm Wiirtz S4 timedate and timeseries classes for R Journal of Statistical Software Robert Gentleman and Ross Ihaka Lexical scope and statistical computing Jour nal of Computational and Graphical Statistics vol 9 p 491 2000 W N Venables and B D Ripley S Programming Springer 2000 D Samperi Repp R C interface classes using c libraries from R 2006 P Murrell R Graphics Chapman amp Hall 2005 Hadley Wickham ggplot2 Elegant Graphics for Dat
148. mymean lt function x y tapply x y mean lapply cars mymean cars Country Price France Germany Japan Japan USA Korea Mexico Sweden USA 15930 000 14447 500 13938 053 10067 571 7857 333 8672 000 18450 000 12543 269 Country France Germany Japan Japan USA Korea Mexico Sweden USA NA NA NA NA NA NA NA NA Reliability France Germany Japan Japan USA Korea Mexico Sweden USA NA NA NA 4 857143 NA 4 000000 3 000000 NA 6 2 4 The by function The by function applies a function on parts of a data frame Lets look at the cars data again suppose we want to fit the linear regression model Price Weight for each type of car First we write a small function that fits the model Price Weight for a data frame myregr lt function data lm Price Weight data data This function is then passed to the by function outreg lt by cars cars Type FUN myregr outreg cars Type Compact Cadiz lm formula Price Weight data data Coefficients Intercept Weight 2254 765 3 757 90 CHAPTER 6 EFFICIENT 6 2 THE APPLY AND OUTER cars Type Large Call lm formula Price Weight data data Coefficients Intercept Weight 17881 2839 0 5183 The output object outreg of the by function contains all the separate regressions it is a so called by object Individual regression objects can be accessed by treating the by object as a list outreg 1 Call lm formula Price Weight data da
149. n R will overwrite an existing directory Note that previously edited DESCRIPTION and Rd help files are overwritten If the function has finished the directory Lissajous and some subdirectories are created Edit and create some files The DESCRIPTION file is a basic description of the package R has created a skeleton that the user can edit We use the following file Package Lissajous Type Package Title Create Lissajous figures Version 1 0 Date 2007 05 09 Author Longhow Lam Maintainer Longhow Lam lt longhow lam businessdecision com gt Description Create Lissajous figures License no restrictions This information appears for example when you display the general help of a package help package Lissajous The INDEX file is not created it is an optional file that lists the interesting objects of the package We use the file LissajousPlot Plot a Lissajous figure LissajousPlot2 Plot another Lissajous figure Create help and documentation The function package skeleton has also created initial R help files for each function the Rd files in the man subdirectory R help files need to be written in R documentation format A markup language that closely resembles LaTex The initial files should be edited to provide meaningful help Fortunately the initial Rd files created by R provide a good starting point Open the files and modify them When the package is build these documentation files are comp
150. n also use a logical vector When you provide a logical vector in a data frame subscript only the cases which correspond with a TRUE are selected Suppose you want to get all cars from the cars data frame that have a weight of over 3500 First create a logical vector tmp tmp lt cars Weight gt 3500 Use this vector to subset cars tmp Price Country Reliability Mileage Type Weight Disp HP Ford Thunderbird V6 14980 USA 1 23 Medium 3610 232 140 Chevrolet Caprice V8 14525 USA 1 18 Large 3855 305 170 Ford LTD Crown Victoria V8 17257 USA 3 20 Large 3850 302 150 Dodge Grand Caravan V6 15395 USA 3 18 Van 3735 202 150 Ford Aerostar V6 12267 USA 3 18 Van 3665 182 145 Mazda MPV V6 14944 Japan 5 19 Van 3735 181 150 Nissan Van 4 14799 Japan NA 19 Van 3690 146 106 A handy alternative is the function subset It returns a the subset as a data frame The first argument is the data frame The second argument is a logical expression In this expression you use the variable names without proceeding them with the name of the data frame as in the above example subset cars Weight gt 3500 amp Price lt 15000 Price Country Reliability Mileage Type Weight Disp HP Ford Thunderbird V6 14980 USA 1 23 Medium 3610 232 140 Chevrolet Caprice V8 14525 USA 1 18 Large 3855 305 170 Ford Aerostar V6 12267 USA 3 18 Van 3665 182 145 Mazda MPV V6 14944 Japan 5 19 Van 3735 181 150 Nissan Van 4 14799 Japan NA 19 Van 3690 146 106 56 CHAPTER 4 DATA MANIPULAT
151. n be done with low level plot functions Adding lines The function lines and abline are used to add lines on an existing plot The function lines connects points given by the input vector The function abline draws straight lines with a certain slope and intercept plot 2 2 2 2 lines c 0 2 c 0 2 col red abline a 1 b 2 lty 2 abline v 1 lty 3 col blue lwd 3 The functions arrows and segments are used to draw arrows and line segments 119 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH three arrows starting from the same point but all pointing to a different direction arrows E0 0 02 ena y ClO 075 51 c 1 2 1 5 1 7 length 0 1 Adding points and symbols The function points is used to add extra points and symbols to an existing graph The following code adds some extra points to the previous graph points rnorm 4 rnorm 4 pch 3 col blue points rnorm 4 rnorm 4 pch 4 cex 3 lwd 2 points rnorm 4 rnorm 4 pch K col green Adding titles and text The functions title legend mtext and text can be used to add text to an existing plot title main My title sub My subtitle text 0 0 some text text 1 1 Business amp Decision srt 45 The first two arguments of text can be vectors specifying x y coordinates then the third argument must also be a vector This character vector must have the same length and contains the texts that will be printed at the
152. n our case we have the following function definitions 199 CHAPTER 9 MISCELLANEOUS 9 5 CREATING AN R PACKAGE LissajousPlot lt function nsteps a b t lt seq 0 2 pi 1 nsteps x lt sin a t y lt cos b t plot x y type 1 LissajousPlot2 lt function nsteps tend a b c t lt seq 0 tend l nsteps y c sin a t 1 sin bx t c cos a t 1 sin b t plot x y type 1 X Test the functions make sure the functions produce the results you expect Run the function package skeleton The function package skeleton creates the necessary files and sub directories that are needed to build the R package It allows the user to specify which objects will be placed in the package Specify a name and location for the package package loc C RPackages package skeleton Lissajous path package loc force T Creating directories Creating DESCRIPTION Creating Read and delete me Saving functions and data Making help files Done Further steps are described in C RPackages Lissajous Read and delete me The above call will put all objects in the current workspace in the package use the list argument to specify only the objects that you want to put in the package package skeleton Lissajous path package loc list c LissajousPlot LissajousPlot2 force T 200 CHAPTER 9 MISCELLANEOUS 9 5 CREATING AN R PACKAGE If force T the
153. nalysis For example model diag nostics 146 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS e Are residuals normally distributed e Are the relations between response and regression variables linear e Are there outliers Use the Kolmogorov Smirnov test to check if the model residuals are normally dis tributed Proceed as follows cars residuals lt resid cars 1m ks test cars residuals pnorm mean mean cars residuals sd sd cars residuals One sample Kolmogorov Smirnov test data cars residuals D 0 0564 p value 0 9854 alternative hypothesis two sided Or draw a histogram or qqplot to get a feeling for the distribution of the residuals par mfrow c 1 2 hist cars residuals qqnorm cars residuals A plot of the residuals against the fitted value can detect if the linear relation between the response and the regression variables is sufficient A Cooke s distance plot can detect outlying values in your data set R can construct both plots from the cars 1m object par mfrow c 1 2 plot cars lm which 1 plot cars lm which 4 Updating a linear model Some useful functions to update or change linear models are given by addi This function is used to see what in terms of sums of squares and residual sums of squares the result is of adding extra terms variables to the model The cars data set also has an Disp variable representing the engine displacement 147 CHAPTE
154. nes in R The output object is often a collection of parameter estimates residuals predicted values etc For example consider the output of the function 1sfit In its most simple form the function fits a least square regression x lt 1 5 y lt x rnorm 5 0 0 25 2 lt lsfit x y Zz coef Intercept X 013915120 9235291 residuals 1 0 006962623 0 017924751 0 036747141 0 155119026 0 093484512 intercept 1 T 38 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES In this example the output value of 1sfit x y is assigned to object z This is a list whose first component is a vector with the intercept and the slope The second component is a vector with the model residuals and the third component is a logical vector of length one indicating whether or not an intercept is used The three components have the names coef residuals and intercept The components of a list can be extracted in several ways e component number z 1 means the first component of z use double square brackets e component name z name indicates the component of z with name name To identify the component name the first few characters will do for example you can use z r instead of z residuals test lt z r test 1 0 0069626 0 0179247 0 0367471 0 1551190 0 0934845 z r 4 fourth element of the residuals 1 0 155119026 Creating lists A list can also be constructed by using the function
155. ng functions for detailed information 104 CHAPTER 7 GRAPHICS 7 2 MORE PLOT FUNCTIONS sin 0 5 0 0 0 5 1 0 figure 2 x axis Figure 7 2 Line plot with title can be created with type 1 or the curve function 7 2 1 The plot function The plot function is very versatile function As we will see in in section 9 1 about object oriented programming the plot function is a so called a generic function Depending on the class of the input object the function will call a specific plot method Some examples e plot xf creates a bar plot if xf is a vector of data type factor e plot xf y creates box and whisker plots of the numeric data in y for each level of xf e plot x df all columns of the data frame x df are plotted against each other e plot myts creates a time series plot if myts is a ts time series object e plot xdate yval if xdate is a Date object R will plot yval with a suitable X axis e plot xpos y creates a scatterplot where xpos is a POSIXct object and y is a numeric vector e plot f low up creates a graph of the function f between low and up The code below shows some examples of the different uses of the function plot 105 CHAPTER 7 GRAPHICS 7 2 MORE PLOT FUNCTIONS oO 7 o am aaa pas i e S Le i i aE TJ i i 6 i i ae AE 13 pea fae Ae e a T T A B Cc A B Cc a y Aves Y 7 ES
156. ning Medians Robust Scatter Plot Smoothing Type help F00 package PKG to inspect entry FOO PKG TITLE The R manuals are also on line available in pdf format In the RGui window go the help menu and select manuals in pdf 1 5 2 The R mailing lists and the R Journal There are several mailing lists on R see the R website The main mailing list is R help web interfaces are available where you can browse trough the postings or search for a specific key word If you have a connection to the internet then the function RSiteSearch in R can be used to search for a string in the archives of all the R mailing lists RSiteSearch MySQL Another very useful webpage on the internet is www Rseek org a sort of R search enigine Also take a look at the R Journal at http journal r project org 1 6 The R workspace managing objects Objects that you create during an R session are hold in memory the collection of objects that you currently have is called the workspace This workspace is not saved on disk unless you tell R to do so This means that your objects are lost when you close R and not save the objects or worse when R or your system crashes on you during a session When you close the RGui or the R console window the system will ask if you want to save the workspace image If you select to save the workspace image then all the objects 12 CHAPTER 1 INTRODUCTION 1 7 R PACKAGES in your current R session are s
157. nt measurements of one subject are in multiple columns whereas a long format data frame has the different measurements of one subject in multiple rows df wide lt data frame Subject c 1 2 mi c 4 5 m2 c 5 6 7 8 m3 c 3 6 6 7 df wide Subject mi m2 m3 1 1 4 5 6 3 6 61 CHAPTER 4 DATA MANIPULATION 4 4 ATTRIBUTES 2 2 gt Les Cut df long lt reshape df wide varying list c mi m2 m3 idvar Subject direction long v names Measurement df long Subject time Measurement isl 1 1 4 0 Zak 2 1 5 0 1 2 1 2 5 6 22 2 2 7 8 LS 1 3 3 6 233 2 3 BE 4 4 Attributes Vectors matrices and other objects in general may have attributes These are other objects attached to the main object Use the function attributes to get a list of all the attributes of an object x lt rnorm 10 attributes x NULL In the above example the vector x has no attributes You can either use the function attr or the function structure to attach an attribute to an object attr x description lt The unit is month x 1 1 3453003 1 4395975 1 0163646 0 6566600 0 4412399 6 1 2427861 1 4967771 0 6230324 0 5538395 1 0781191 attr description 1 The unit is month The first argument of the function attr is the object the second argument is the name of the attribute The expression on the right hand side of the assignment operator will be the attribute value Use the st
158. ntaining the function definition of meank meank lt function x k 4 xt lt quantile x c k 1 k mean x x gt xt 1 x lt xt 2 The following statement will create the function meank in R note the use of double slashes source C SFunctions meank txt Now you can run the function meank test 1 0 00175423 If you want to put a comment inside a function use the symbol Anything between the symbol and the end of the line will be ignored test lt function x This line will be ignored It is useful to insert code explanations for others and yourself sqrt 2 x 71 CHAPTER 5 WRITING FUNCTIONS 5 2 ARGUMENTS AND VARIABLES Writing large functions in R can be difficult for novice users You may wonder where and how to begin how to check input parameters or how to use loop structures Fortunately the code of many functions can be viewed directly For example just type the name of a function without brackets in the console window and you will get the code Don t be intimidated by the lengthy code Learn from it by trying to read line by line and looking at the help of the functions that you don t know yet Some functions call internal functions or pre compiled code which can be recognized by calls like C Internal or Call 5 2 Arguments and variables 5 2 1 Required and optional arguments When calling functions in R the syntax of the function definition determines wh
159. od the second level in this case out lt table x out 2 length x plotfr lt function y x n 10 tmp lt cut x n p lt tapply y tmp obs prob plot p lines p title paste deparse substitute y and deparse substitute x 155 CHAPTER 8 STATISTICS 8 4 LOGISTIC REGRESSION par mfrow c 2 2 plotfr testdata y testdata X1 plotfr testdata y testdata X2 plotfr testdata y testdata X3 testdata y and testdata X1 testdata y and testdata X2 3 E R n a o a0 e 9 S a T T T T T T T T T T 2 4 6 8 10 2 4 6 8 10 Index Index testdata y and testdata X3 2 o a 34 o Index Figure 8 3 Explorative plots giving a first impression of the relation between the binary y variable and x variables The plots in figure 8 3 show strong relations For variable X3 there is a negative relation just as we have simulated The interpretation of the formula object that is needed as input for glm is the same as in 1m So for example the operator is also used here for specifying interaction between variables The following code fits a logistic regression model and stores the output in the object test glm test glm glm y X1 X2 X3 family binomial data testdata summary test glm Call glm formula y X1 X2 X3 family binomial data testdata Deviance Residuals Min 1Q 2 5457 0 5829 Median 3Q 0 3867 0 6721 Max 1 9312 1
160. ois Sk Gy tke nine 8 Ge et Sw ds 9 2 R Language objects ooa die te as A BS tee ae i oc 9 2 1 Calls and Expressions 0d aon Gebhard ed te God op ee ea Sos 9 2 2 Expressions as Lists arpa ey A a Ses ES OF 8 Eunctons ASTISHS uste te tes a A 9 Calle O or e ads y a e oe ek EA 9 3 1 The call system and X functions ho da a ad 9 3 2 Using SAS data sets and SAS ODS 9 4 Defaults and preferences in R Starting R 9 4 1 Defaults and preferences oao aaa Gna ee AP apra 94 2 Startis Re she e O il 9 5 Creating an R Package ain 4 ew dere a Re ae 9 5 1 A private package lhe hock A he A ee Ged a OS oa A real TA PACKACE dera iringane enn Sarde Bh de ST 9 07 Calling R from Java Ls e a A A BENE Bhd a BAe 9 7 Creating fancy output and reports ia eS 9 7 1 A simple BTFX table sgn eds bee Set ee Be oe dao 9 7 2 An simple HTML report Sob a See Rene Y a oe Bibliography Index List of Figures 1 1 1 2 1 3 6 1 6 2 7 1 T2 7 3 7 4 7 5 7 6 T T 7 8 7 9 7 10 7 11 7 12 7 13 7 14 7 15 7 16 T 17 7 18 8 1 8 2 8 3 8 4 8 5 8 6 The R system on Windows 4 3 2 3 G0 a a de e 9 R integrated in the Eclipse development environment 17 The Tinn R and an the R Console environment 18 A surface plot created with the function persp 92 Calculation times of arsimR solid line and arsimC dashed line for a mo tests lb as
161. on indefinitely 79 CHAPTER 5 WRITING FUNCTIONS 5 4 DEBUGGING YOUR R Simple example mycalc lt function tmp lt 0 n lt 0 while tmp lt 100 tmp lt tmp rbinom 1 10 0 5 n lt n 1 cat It took cat n cat iterations to finish n repeat some expressions In the repeat loop some expressions are repeated infinitely so repeat loops will have to contain a break statement to escape them 5 4 Debugging your R functions 5 4 1 The traceback function The R language provide the user with some tools to track down unexpected behavior during the execution of user written functions For example e A function may throw warnings at you Although warnings do not stop the exe cution of a function and could be ignored you should check out why a warning is produced e A function stops because of an error Now you must really fix the function if you want it to continue to run until the end e Your function runs without warnings and errors however the number it returns does not make any sense The first thing you can do when an error occurs is to call the function traceback It will list the functions that were called before the error occurred Consider the following two functions 80 CHAPTER 5 WRITING FUNCTIONS 5 4 DEBUGGING YOUR R myf lt function z 1 x lt log z if x gt 0 print PPP J else print QQQ testf lt function pp
162. or palette The graphical parameter col can be a vector This can be used to create a scatterplot of Price and Mileage where each point has a different color depending on the Weight value of the car To do this we first need to change the color palette in R The color palette specifies which colors corresponds with the numbers 1 2 3 in the specification col number The current palette can be printed with the function palette palette 1 black red green3 blue cyan 6 magenta yellow gray 117 CHAPTER 7 GRAPHICS 7 3 MODIFYING A GRAPH This means plot rnorm 100 col 2 will create a scatterplot with red points The function can also be used to change the palette Together with a few auxiliary functions heat colors terrain colors gray it is easy to create a palette of colors say from dark to light red palette heat colors Ncars palette 1 red FF2400 FF4900 FF6D00 FF9200 FFB600 7 FFDBOO yellow FFFF40 FFFFBF So in the color palette col 1 represents red col 2 a slightly lighter red and so on Then in the plot function we specify col order cars Weight the largest value has order number Nears The following code uses several plot functions to create a colored scatterplot and a color legend split the screen in two the larger left part will contain the scatter plot the right side contains a color legend layout matrix c 1 2 nc 2 widt
163. or the argument breaks that number is used to divide x into equal length intervals cut x breaks 5 1 0 986 3 79 0 986 3 79 0 986 3 79 3 79 6 6 3 79 6 6 6 3 79 6 6 6 6 9 4 6 6 9 4 6 6 9 4 9 4 12 2 Lit 9 412 292 5 4 12 2 19 2515 172 15 12 2 15 Levels 0 986 3 79 3 79 6 6 6 6 9 4 9 4 12 2 12 2 15 The names of the different levels are created by R automatically they have the form a b You can change this by specifying an extra labels argument x lt rnorm 15 cut x breaks 3 labels c low medium high 1 high medium medium medium medium high low high low low 11 high low low medium high Levels low medium high 69 5 Writing functions 5 1 Introduction Most tasks are performed by calling a function in R In fact everything we have done so far is calling an existing function which then performed a certain task resulting in some kind of output A function can be regarded as a collection of statements and is an object in R of class function One of the strengths of R is the ability to extend R by writing new functions The general form of a function is given by functionname lt functionl argl arg2 1 Body of function a collection of valid statements In the above display arg1 and arg2 in the function header are input arguments of the function Note that a function doesn t need to have any input arguments The body of the function consists of valid R statemen
164. order of the levels is also used in linear models If one or more of the regression variables are factor variables the order of the levels is important for the interpretation of the parameter estimates see section 8 3 4 24 CHAPTER 2 DATA OBJECTS 2 1 DATA TYPES 2 1 7 Dates and Times To represent a calendar date in R use the function as Date to create an object of class Date temp lt c 12 09 1973 29 08 1974 z lt as Date temp d m Y Z 1 1973 09 12 1974 08 29 data class z 1 Date format z d m ZY 1 12 09 1973 29 08 1974 You can add a number to a date object the number is interpreted as the number of day to add to the date z 19 1 1973 10 01 1974 09 17 You can subtract one date from another the result is an object of class difftime dz z 2 z 1 dz data class dz Time difference of 351 days 1 difftime In R the classes POSIXct and POSIX1t can be used to represent calendar dates and times You can create POSIXct objects with the function as POSIXct The function accepts characters as input and it can be used to not only to specify a date but also a time within a date ti lt as POSIXct 2003 01 23 t2 lt as POSIXct 2003 04 23 15 34 tl t2 1 2003 01 23 W Europe Standard Time 1 2003 04 23 15 34 00 W Europe Daylight Time A handy function is strptime it is used to convert a certain character representation of a date and tim
165. ors returning NULL in x call 2 operator is deprecated for atomic vectors returning NULL in object coefficients No coefficients So it is recommended to use a so called constructor function To create an object of certain class use only the constructor function for that class The constructor function can then be designed in such a way that it only returns a proper object of that class If you want an lm object use the function 1m which can act as a constructor function for the class Im For our bigMatrix class we create the following constructor function bigMatrix lt function m if data class m matrix class m bigMatrix return m else 4 warning not a matrix return m mi lt bigMatrix ppp m2 lt bigMatrix matrix rnorm 50 2 ncol 50 Defining new generic and specific methods Two specific methods can be created for our bigMatrix class print bigMatrix and plot bigMatrix Printing a big matrix results in many numbers on screen The spe cific print method for bigMatrix only prints the dimension and the first few rows and columns 183 CHAPTER 9 MISCELLANEOUS 9 1 OBJECT ORIENTED print bigMatrix lt function x nr 3 nc 5 cat Big matrix n cat dimension cat dim x cat n print x 1isnr T ncl m2 Big matrix dimension 50 50 1 2 3 1 2 3 4 5 0 7012566 0 7327267 0 706452 0 2355600 1 2577592 1 6390
166. packaging function allows the user to specify a different type of censoring For example left censored data is specified as follows 164 CHAPTER 8 STATISTICS 8 6 SURVIVAL ANALYSIS Surv time status type left Age Sex The right hand side of the formula has the same interpretation as in linear regression models 8 6 1 The Cox proportional hazards model To demonstrate the function coxph from the package survival that fits a Cox propor tional hazards model we use data that is analyzed in the paper of M Prins and P J Veugelers 12 The data results from a multi center cohort study among injecting drug users one of the things the researches wanted to know was how long before a HIV infected person would develop AIDS the incubation time The data and a complete description can be downloaded from www splusbook com In R we import the data into the IDUdata data frame and print only the first 10 persons of the data frame and show only a few columns We also need to calculate the incubation time as the difference between the AIDS time and the entry time The time unit used for the incubation time is months IDUdata lt read csv IDUdata txt IDUdata IncubationTime IDUdata AIDStime IDUdata Entrytime gt IDUdata 1 10 c 3 5 6 7 8 9 13 Sex PosDate Age Entrytime AidsStatus Event IncubationTime 1 1 194 24 251 0 0 36 2 il 166 25 215 1 3 60 3 2 225 26 225 1 2 28 4 1 289 29 238 0 0 63 5 1 297 33 199 0 0 102 6 1 192 2
167. r community writing new R packages that are made available to others If you have any questions or comments on this document please do not hesitate to contact me The best explanation of R is given on the R web site http www r project org The remainder of this section and the following section are taken from the R web site R is a language and environment for statistical computing and graphics It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories formerly AT amp T now Lucent Technologies by John Chambers and colleagues R can be considered as a different implementation of S There are some important differences but much code written for S runs unaltered under R R provides a wide variety of statistical linear and non linear modeling classical statis tical tests time series analysis classification clustering and graphical techniques and is highly extensible The S language is often the vehicle of choice for research in statistical methodology and R provides an Open Source route to participation in that activity One of R s strengths is the ease with which well designed publication quality plots can be produced including mathematical symbols and formulae where needed Great care has been taken over the defaults for the minor design choices in graphics but the user retains full control R is available as Free Software under the terms of the Free Software Foundation
168. r value xC is about 25 representing the difference in mean between level A and level C When using a treatment contrast the lowest level is left out of the regression By default this is the level with the name that comes first in alphabetical order The parameters estimates for the remaining levels represent the difference between that level and the lowest level Consider the above example code again but rename level A to level X and fit the linear model again x lt as factor c rep X 100 rep B 100 rep C 100 rep D 100 testdata lt data frame x y Im y x data testdata Calls lm formula y x data testdata Coefficients Intercept xC xD xX 10 107 19 930 39 910 5 177 Now level B is lowest level and is left out So the parameter estimate for xC represents the difference in mean between level B and C which is about 20 If you are using a treatment contrast the lowest level will be left out When you don t want to leave out that particular level you can use the so called SAS contrast This is the treatment contrast but leaving out the last factor level Im y x data testdata contrasts list x contr SAS Call lm formula y x data testdata contrasts list x contr SAS Coefficients Intercept x1 x2 x3 4 930 ool ard 29 107 45 087 Or alternatively you can reorder the factor Suppose you want to leave out level C in the above example proceed as follows testdata x lt or
169. reate a separate panel for each level of a factor variable or for each level combination of multiple factor variables To create Trellis graphics based on numeric conditioning variables you can use the func tions equal count or shingle to create conditioning intervals of numeric variables These intervals can then be used in a Trellis display function Lets look at our cars example data frame that contains information on 60 different cars Suppose we want to create a histogram of the variable Mileage conditioned on the variable Weight We then proceed as follows weight int lt equal count cars Weight number 4 overlap 0 126 CHAPTER 7 GRAPHICS 7 4 TRELLIS GRAPHICS Percent of Total 30 20 30 20 30 20 Figure 7 13 A trellis plot with two conditioning variables This creates the conditioning intervals The variable Weight is divided into four equal intervals without overlap weight int Data 1 2560 2345 1845 2260 2440 2285 2275 2350 2295 1900 2390 2075 2330 3320 2885 16 3310 2695 2170 2710 2775 2840 2485 2670 2640 2655 3065 2750 2920 2780 2745 31 3110 2920 2645 2575 2935 2920 2985 3265 2880 2975 3450 3145 3190 3610 2885 46 3480 3200 2765 3220 3480 3325 3855 3850 3195 3735 3665 3735 3415 3185 3690 Intervals min max count 1 1842 5 2562 5 15 2 2572 5 2887 5 16 3 2882 5 3222 5 16 4 3262 5 3857 5 15 127 CHAPTER 7 GRAPHICS 7 4 TRELLIS GRAPHICS Ove
170. regression 170 NULL 28 object oriented programming 180 ODBC 45 option 196 order 50 ordered factors 24 package 13 package creation 198 paste 64 pie plot 107 plot 103 plot region 114 polynomial contrast 152 POSIXct 25 POSIXIt 25 predictive ability 158 preferences 196 probability distributions 139 proc time 85 ragged arrays 89 random sample 139 rbind 57 rbind fill 58 read table 42 Receiver Operator curve 157 Recycling 29 regexpr 65 regular expressions 65 rep 31 Index Index repeat 80 replacing characters 67 reports 204 reshape 61 reshape package 58 return 75 round 29 S3 classes 180 S4 classes 180 sample 139 sapply 87 scan 44 scoping rules 75 search 13 search path 13 segments 119 sensitivity marix 177 sequences 30 sessionInfo 14 singular value decomposition 151 177 solve 34 Somer s D 158 sort 50 stack 61 stacking data frames 57 start up of R 197 statistical summary functions 135 stop 81 str 41 strptime 25 strsplit 68 structure 41 sub 67 subset 56 subset 56 substring 64 survreg 169 svd 34 Sweave 204 switch 78 symbols 117 Sys time 27 system time 85 211 tail 55 tapply 89 terms 143 text files 42 time series 37 Tinn R 17 titles 103 traceback 80 transpose 34 treatment contrast 152 tree models 160 Trellis plots 123 TRUE 21 tsp 38 typeof 19 unique 50 var
171. rent way then we can write a new function myfungishow that shows our fungi object differently The function setMethods sets the new function myfungishow as the specific show method for the fungi class We have the following R code myfungishow lt function object tmp rbind x format round object x 2 y format round object y 2 187 CHAPTER 9 MISCELLANEOUS 9 1 OBJECT ORIENTED species object species dimnames tmp 21 rep length object x print tmp quote F setMethod show fungi myfungishow field1 x 0 97 0 55 0 44 0 03 0 92 0 46 0 49 0 92 0 30 0 19 y 0 44 0 15 0 35 0 79 0 73 0 42 0 04 0 65 0 68 0 18 species c e a Cc d e e e b Note that the setMethod function copies the function myfungishow into the class infor mation In fact after a call to setMethod the function myfungishow can be removed This totally different from the old style classes where the specific method was searched for by a naming convention print fungi To see the specific show method for the fungi class use the function getMethods getMethods show w 1 An object of class MethodsList Slot methods fungi Method Definition function object tmp rbind x format round object x 3 y format round object y 2 species object species dimnames tmp 2 rep length object x print tmp quote F Signatures object target fungi defined fungi 188 CH
172. rice against Mileage conditioned on the Type variable and suppose that in addition we want a separate symbol for the highest price We create our own panel function panel maxsymbool lt function x y biggest lt y max y panel points x biggest y biggest panel points x biggest y biggest pch M The above function first finds out what the maximum y value is it then plots the points without the maximum y value and then plots the maximum y value using a different symbol Note that we use the function panel points instead of the normal low level points function The normal low level functions can not be used inside a function that is going to be used as panel function This is because lattice panel functions need to use grid graphics So use the panel versions panel points panel text panel abline panel lines and panel segments Once a panel function is defined you should pass it to the trellis display function xyplot Price Mileage Type data cars panel panel maxsymbol The following example fits a least squares line through the points of each panel Addi tional graphical parameters can also be passed on The next example enables the user to specify the type of line using the graphical parameter lty panel lsline lt function x y coef lt lsfit x y coef panel points x y panel abline coef 1 coef 2 xyplot Price Mileage Type data cars panel panel lsline lty 2 129
173. rlap beetween adjacent intervals 1 020 To draw the histograms use the following R code histogram Mileage weight int data cars i i 1 ji 1 1 weightintl weight int EEE i r 60 4 o weight int Welght int 60 5 E Percent of Total 40 4 20 T T T T T T T T 20 25 30 35 Mileage Figure 7 14 Histogram of mileage for different weight classes 7 4 3 Trellis panel functions Trellis graphs are constructed per panel A general trellis display function calls a panel function that does the actual work The name of the default panel function that is called by the general trellis display function is panel name where name is the name of the general trellis display function So if we call the trellis display function xyplot then this will in turn call the function panel xyplot If you look at the code of the general trellis display function xyplot you won t see a lot The corresponding panel function panel function panel xyplot does all the work xyplot function x data 128 CHAPTER 7 GRAPHICS 7 4 TRELLIS GRAPHICS UseMethod xyplot lt environment namespace lattice gt panel xyplot function a lot R of code A powerful feature of trellis graphs is that you can write your own panel function and pass this function on to the general trellis display function This is done using the argument panel of the trellis display function Suppose we want to plot P
174. ructure function to attach more than one attribute to an object 62 CHAPTER 4 DATA MANIPULATION 4 5 CHARACTER MANIPULATION x lt structure x atri 8 atr2 test x 1 1 3453003 1 4395975 1 0163646 0 6566600 0 4412399 6 1 2427861 1 4967771 0 6230324 0 5538395 1 0781191 attr description 1 The unit is month atte latri 1 8 attr atr2 1 test When an object is printed the attributes if any are printed as well To extract an at tribute from an object use the functions attributes or attr The function attributes returns a list of all the attributes from which you can extract a specific component attributes x description 1 The unit is month atri 1 8 atr2 1 test In order to get the description attribute of x use attributes x description 1 The unit is month Or type in the following construction attr x description 1 The unit is month 4 5 Character manipulation There are several functions in R to manipulate or get information from character ob jects 63 CHAPTER 4 DATA MANIPULATION 4 5 CHARACTER MANIPULATION 4 5 1 The functions nchar substring and paste xs coa bl tc mychari lt This is a test mychar2 lt This is another test charvector lt at pr ql eest The function nchar returns the length of a character object for example nchar mychar1 1 15 nchar charvector 1 1114 The function su
175. rvations complexity param 0 2444444 predicted class Compact expected loss 0 75 class counts 15 3 13 13 9 7 probabilities 0 250 0 050 0 217 0 217 0 150 0 117 left son 2 49 obs right son 3 11 obs Primary splits Price lt 9152 5 to the right improve 10 259180 0 missing Mileage lt 27 5 to the left improve 7 259083 0 missing Surrogate splits Mileage lt 27 5 to the left agree 0 933 adj 0 636 0 split A graphical representation can be obtained with the following code plot fit text fit Price gt 9152 T Mileagg gt 20 5 Small Mileage lt 23 5 Aedium Compact Figure 8 5 Plot of the tree Type is predicted based on Mileage and Price 8 5 2 Coarse classification and binning When building regression models binning or coarse classification is sometimes used 162 CHAPTER 8 STATISTICS 8 5 TREE MODELS Binning is a procedure that creates a nominal factor variable from a continuous nu meric variable I e each value of a numeric variable gets mapped to a certain interval or category There are a couple of reasons why we want to do this First nonlinear effects can be captured in a very simple way second the binned variable is less sensitive to outliers Coarse classification is a procedure to group all the possible outcomes of a nominal fac tor variable into a smaller set of outcomes The main reason to do this is because there may be too many outcomes and so some outcomes are
176. s case the function ifelse is more efficient tmp lt Sys time x lt rnorm 15000 x lt ifelse x gt 1 1 1 tmp Sys time Time difference of 0 02999997 secs The function ifelse has three arguments The first is a test a logical expression the second is the value given to those elements of x which pass the test and the third argument is the value given to those elements which fail the test 85 CHAPTER 6 EFFICIENT 6 2 THE APPLY AND OUTER The cumsum function To calculate cumulative sums of vector elements use the function cumsum For example e 1210 y lt cumsum x y 1 1 3 6 10 15 21 28 36 45 55 The function cumsum also works on matrices in which case the cumulative sums are calculated per column Use cumprod for cumulative products cummin for cumulative minimums and cummax for cumulative maximums Matrix multiplication In R a matrix multiplication is performed by the operator This can sometimes be used to avoid explicit looping An m by n matrix A can be multiplied by an n by k matrix B in the following manner C lt A B So element C i j of the matrix C is given by the formula Ci j X Ain Peg k If we choose the elements of the matrices A and B cleverly explicit for loops could be avoided For example column averages of a matrix Suppose we want to calculate the average of each column of a matrix Proceed as follows A lt matrix rnorm 1000 ncol 10 n lt
177. s rnorm or runif There are more of these functions x lt rnorm 10 10 random standard normal numbers y lt runif 10 4 7 10 random numbers between 4 and 7 31 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES 2 2 2 Matrices Generating matrices A matrix can be regarded as a generalization of a vector As with vectors all the elements of a matrix must be of the same data type A matrix can be generated in several ways For example e Use the function dim x lt 1 8 dim x lt c 2 4 x Et La 1 31 L 4 1 1 3 5 7 E2 2 4 6 8 e Use the function matrix x lt matrix 1 8 2 4 byrow F xX 1 L 2 3 L 4 t 1 3 5 7 2 2 4 6 8 By default the matrix is filled by column To fill the matrix by row specify byrow T as argument in the matrix function 1 Use the function cbind to create a matrix by binding two or more vectors as column vectors The function rbind is used to create a matrix by binding two or more vectors as row vectors cbind c 1 2 3 c 4 5 6 Lead 2 E 1 4 2 2 5 3 3 6 rbind c 1 2 3 0c 4 5 6 Lal Lal 13 lis 1 2 3 2 4 5 6 32 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES Calculations on matrices A matrix can be regarded as a number of equal length vectors pasted together All the mathematical functions that apply to vectors also apply to matrices and are applied on each matrix element x x 2 All operations are applied on each matrix element l
178. sim in the examples section To return results to R modify one or more input arguments of the C function The value of the C function is a list with each component matching one argument to the C function If you name these arguments as we did in the preceding example the return list has named components Your R function can use the returned list for further computations or to construct its own return value which generally omits those arguments which are not altered by the C code Thus if we wanted to just use the returned value of x we could call C as follows C arsim x as double x n as integer length x x All arguments of the C routine called via CO must be pointers All such routines should be void functions if the routine does return a value it could cause R to crash R has many classes that are not immediately representable in C To simplify the interface between R and C the types of data that R can pass to C code are restricted to the following classes 93 CHAPTER 6 EFFICIENT 6 4 SOME COMPILED CODE e single integer e double complex e logical character e raw list The following table shows the correspondence between R data types and C types R data type C data type logical long integer long double double complex Rcomplex character char raw char 6 3 2 The Call and External interfaces The Call and External interfaces are powerful interfaces that
179. singular values of S 0 This can also be interpreted as A yPt2 U V AO dp So the i th singular value d shows the effect of changes of the parameters in the direction given in the i th row of V If a singular value drops below a certain critical value or is relatively small compared to the largest singualr value then the model shows signs of ill conditioning This certainly obvious if a singular value is nearly zero a small change in the parameter will have no effect on the measurement space Note that for a linear regression model y XP the sensitivity matrix S is just the matrix X and that small singular values correspond to the multicollinearity problem see section 8 3 3 For the Hill model the code below uses the function deriv for the calculation of the sensitivity matrix and the function svd for its singular value decomposition calc symbolic derivatives with respoect to the parameters ModelDeriv lt deriv expression Vm x alpha k alpha x alpha name c Vm alpha k evaluate the derivative at a certain x and parameter values sensitivity lt eval ModelDeriv envir list x seq from 1 6 to 5 1 50 k 0 3 178 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION Vm 1 108 alpha 0 8 the gradient matrix is given as an attribute extract it and calculate the singular value decomposition sensitivity lt attributes sensitivity gradient svd sensitivity
180. sults proceed as follows fiti lt nls y betal x beta2 x start list betal 2 5 beta2 7 data our exp 172 CHAPTER 8 STATISTICS 8 7 NON LINEAR REGRESSION summary fit1 Formula y betal x beta2 x Parameters Estimate Std Error t value Pr gt t betal 2 9088 0 0672 43 28 lt 2e 16 beta2 7 3143 0 5376 13 61 lt 2e 16 Signif codes 0 0 001 0 01 0 05 0 1 1 Residual standard error 0 1414 on 98 degrees of freedom So the first argument of nls is a formula object Unlike the formula specification in Im or coxph the operators here have the normal mathematical meaning The second argument is required and specifies the initial values of the parameters They are used to initiate the optimization algorithm to estimate the parameter values The third argument is the data frame with the data Note that the nls function can sometimes fail to find parameter estimates One of the reasons could be poor initial values for the parameters For example fiti lt nls y a x b x start list a 25000 b 600 data our exp Error in nls y a x b x start list a 25000 b 600 data our exp JE singular gradient The output of nls is a list of class nls Use the generic summary function to get an overview of the fit The output of summary is also a list it can be stored and used to calculate the variance matrix of the estimated param
181. t 1 1621 1 3 4 Fa 1 27 125 343 2 8 64 216 512 max x returns the maximum of all matrix elements in x 1 8 You can multiply a matrix with a vector The outcome may be surprising x lt matrix 1 16 nco1 4 y lt 7210 x y La L21 1 31 LA 1 7 35 63 91 2 1 16 48 80 112 3 27 63 99 135 4 40 80 120 160 x lt matrix 1 28 nco1 4 y lt 7 10 x y LACA L31 LA 1 7 80 135 176 2 16 63 160 207 3 1 27 80 119 240 4 40 99 144 175 5 35 120 171 208 6 48 91 200 243 7 63 112 147 280 As an exercise try to find out what R did To perform a matrix multiplication in the mathematical sense use the operator The dimensions of the two matrices must agree In the following example the dimensions are wrong 33 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES x lt matrix 1 8 ncol 2 x X Error in x 4 4 x non conformable arguments A matrix multiplied with its transposed t x always works x Lah t x yt ee Lal 1 4 1 26 32 38 44 2 32 40 48 56 3 38 48 58 68 4 44 56 68 80 R has a number of matrix specific operations for example Function name Operation chol x Choleski decomposition col x matrix with column numbers of the elements diag x create a diagonal matrix from a vector ncol x returns the number of columns of a matrix nrow x returns the number of rows of a matrix qr x QR matrix decomposition row x matrix with row numbers of
182. ta Coefficients Intercept Weight 2254 765 3 757 6 2 5 The outer function The function outer performs an outer product given two arrays vectors This can be especially useful for evaluating a function on a grid without explicit looping The function has at least three input arguments two vectors x and y and the name of a function that needs two or more arguments for input For every combination of the vector elements of x and y this function is evaluated Some examples are given by the code below ES 13 y lt 1 33 z lt outer x y FUN Z 11 E21 1 3 1 0 1 2 2 1 o 1 3 2 1 0 xX lt EQUAL Bs Os ip y lt 19 91 CHAPTER 6 EFFICIENT 6 3 USING COMPILED CODE z lt outer x y paste sep 11 1 21 L3 RAL 1 8 0 6 LAI Gel GI i MALY Ae tag gar Agh gr o tagi Nagi 2 B1 B2 B3 B4 B5 B6 B7 B8 B9 3 1 tgi ngat Seat ngan gpi Agenori nggi tggi 4 DL po epa pay par get tp7 tpat epee x lt seq 4 4 1 50 y s X myf lt function x y 1 sin x cos y z lt outer x y FUN myf persp x y z theta 45 phi 45 shade 0 45 Figure 6 1 A surface plot created with the function persp 6 3 Using Compiled code Sometimes the use of explicit for loops cannot be avoided When these loops form a real bottleneck in computation time you should consider implementing these loops in C or Fortran and link them to R This feature is used a lot within
183. tance if the length of the output vector depends on a certain calculation myf lt function x n lt as integer sum x out lt 1 n out testdf lt as data frame matrix runif 25 nco1 5 sapply testdf myf X 1 I 1 2 88 CHAPTER 6 EFFICIENT 6 2 THE APPLY AND OUTER SX 2 fad 40 X 3 1 1 23 X 4 1 12 X 5 4 1 The result will then be an object with a list structure 6 2 3 The tapply function This function is used to run another function on the cells of a so called ragged array A ragged array is a pair of two vectors of the same size One of them contains data and the other contains grouping information The following data vector x en grouping vector y form an example of a ragged array x lt rnorm 50 y lt as factor sample c A B C D size 50 replace T A cell of a ragged array are those data points from the data vector that have the same label in the grouping vector The function tapply calculates a function on each cell of a ragged array tapply x y mean trim 0 3 A B C D 0 4492093 0 1506878 0 4427229 0 1265299 Combining lapply and tapply To calculate the mean per group in every column of a data frame one can use sapply lapply in combination with tapply Suppose we want to calculate the mean per group of every column in the data frame cars then we can use the following code 89 CHAPTER 6 EFFICIENT 6 2 THE APPLY AND OUTER
184. tarts at row 1 where the first row represent the column headers then the call to read xls is simple library xlsReadWrite myfile lt C RFiles ExcelData xls mydf lt read xls myfile mydf Coli Col2 Col3 Col4 1 12 A 26919 john 2 23 A 33077 martin 3 5 B 31788 adam 4 56 C 30176 clair The function read xls uses the R default to determine if strings characters in the Excel data should be converted to factors There are two ways to import strings as character in R all string data is converted to character type mydf lt read xls myfile stringsAsFactors T specify the type of each column mydf lt read xls myfile colClasses c numeric factor isodatetime 44 CHAPTER 3 IMPORTING DATA 3 3 DATABASES character Use the arguments sheet and from to import data from different works sheets and starting rows 3 3 Databases There are several R packages that support the import and export of data from databases e Package RODBC provides an interface to databases that are ODBC compliant These include MS SQLServer MS Access Oracle e Package RMySQL provides an interface to the MySQL database e Package RJDBC provides an interface to databases that are JDBC compliant e Package RSQLite not only interfaces this package with SQLite it embeds the SQLite engine in R We give a small example to import a table in R from an MS Access database using ODBC An important step is to set up D
185. te er Bid ang eA E A 97 A scatterplot with a title le ae bane cern Sate oaks BE eee Re 104 Line plot with title can be created with type 1 or the curve function 105 Different uses of the function plot 4444 4 5 508 e4 e084 408 106 Example distribution plot in R aa ata DAA Ako eee Ae ad 107 Example barplot where the first argument is a matrix 108 Example graphs of multi dimensional data sets 110 The different regions of a plot sua so st As 114 The plotting area of this graph is divided with the layout function 116 Examples of different symbols and colors in plots 119 The graph that results from the previous low level plot functions 121 Graphs resulting from previous code examples of customizing axes 124 Trellis plot Price versus Weight for different types 126 A trellis plot with two conditioning variables 127 Histogram of mileage for different weight classes 128 Trellis plot with modified panel function o oo aaa 130 Trellis plot adding a least squares line in each panel 131 A coplot with two conditioning variables 132 A coplot with a smoothing line Veis Rw 133 A histogram and a qq plot of the model residuals to check normality of di A A E Th yp ed Jeg at 148 Diagnostic plots to check for linearity and for outliers 2 149 Explorative plots giving a first impression of the rela
186. ted and so on If object has no match then other expressions is executed Note that the block other expressions does not have to be present the switch will return NULL in case object does not match any value An expression expri in the above construction can consist of multiple statements Each statement should be separated with a or on a separate line and surrounded by curly brackets Simple example Choosing between two calculation methods mycalc lt function x method m1 switch method ml my mlmethod x rml my rmlmethod x 78 CHAPTER 5 WRITING FUNCTIONS 5 3 CONTROL FLOW 5 3 2 Looping with for while and repeat The for while and repeat constructions are designed to perfom loops in R They have the following forms for i in for_object some expressions In the for loop some expressions are evaluated for each element i in for_object Simple example A recursive filter arsim lt function x phi for i in 2 length x x i lt x i phi x i 1 x arsim 1 10 0 75 1 1 000000 2 750000 5 062500 7 796875 10 847656 6 14 135742 17 601807 21 201355 24 901016 28 675762 Note that the for_object could be a vector a matrix a data frame or a list while condition some expressions In the while loop some expressions are repeatedly executed until the logical condition is FALSE Make sure that the condition is FALSE at some stage otherwise the loop will go
187. ted values Obs number Figure 8 2 Diagnostic plots to check for linearity and for outliers update This function is used to update a model In contrary to add1 and drop1 this function returns an object of class Im The following call updates the cars lm ob ject The Disp construction adds the Disp variable to whatever model is used in generating the cars lm object cars 1m2 lt update cars lm Disp cars 1m2 Gall lm formula Weight Mileage Disp data cars Coefficients Intercept Mileage Disp 3748 444 6 916 3 799 8 3 3 Multicollinearity The linear regression model can be formulated in matrix notation as follows y XP e 149 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS where X has N rows the number of observations and p 1 columns the number of regression coefficients plus an intercept Then for a normally distributed error term e it can be shown that the least squares estimates 3 for the parameter are given by B XIX X y 8 1 When the matrix X does not have full rank so less then p 1 then the matrix X X in equation 8 1 is singular and an inverse does not exists This is the case of perfect multicollinearity which does not happen often in practice The problem of nearly perfect multicollinearity occurs when X X is nearly singular This occurs when two or more regression variables are strongly correlated Consider the following simulated data x1 lt runif 100 1
188. term classification tree is used The tree model produces rules like e IF Price lt 200 AND Weight lt 300 THEN Type is Small e IF Price gt 200 AND Weight gt 300 AND Mileage gt 23 THEN Type is Van When the response variable is numeric the tree is called a regression tree The model produces rules like e IF Price lt 200 AND Weight lt 300 THEN Mileage 34 6 e IF Price gt 200 AND Weight gt 456 AND Type is Van THEN Mileage is 23 8 These rules are constructed from the data by recursively dividing the data into disjoint groups by splitting certain variables A detailed description of an algorithm is described in 10 and 11 The basic ingredients of such an algorithm are e A measure for the quality of a split e A split selection rule How and which variables do we split e A stopping criteria we need to stop splitting at some stage before we end up with individual data points 160 CHAPTER 8 STATISTICS 8 5 TREE MODELS Compared to linear and logistic regression models trees have the following advantages e Easier to interpret especially when there is a mix of numeric and factor variables e Can model response variables that are factor and have more than two levels e More adept at capturing nonadditive behavior 8 5 1 An example of a tree model A nice feature of these rules is that they are easily visualized in a tree graph There are some R packages that deal with trees For example the p
189. the elements solve A b solve the system Ax b solve x calculate the inverse svd x singular value decomposition var x covariance matrix of the columns Table 2 2 Some functions that can be applied on matrices A detailed description of these functions can be found in the corresponding help files which can be accessed by typing for example diag in the R Console 2 2 3 Arrays Arrays are generalizations of vectors and matrices A vector is a one dimensional array and a matrix is a two dimensional array As with vectors and matrices all the elements of an array must be of the same data type An example of an array is the three dimensional array iris3 which is a built in data object in R A three dimensional array can be regarded as a block of numbers dim iris3 dimensions of iris 1 50 4 3 34 CHAPTER 2 DATA OBJECTS 2 2 DATA STRUCTURES All basic arithmetic operations which apply to matrices are also applicable to arrays and are performed on each element test lt iris 2 iris The function array is used to create an array object newarray lt array c 1 8 11 18 111 118 dim c 2 4 3 newarray 3 1 LA 1 27 37 04 cae 11 2 L3 LA 1 iL 183 15 1 a 12 14 16 18 se LI G2 6 38 4 fi Ltd 1418 Hb Ady 2 112 114 116 118 2 2 4 Data frames Data frames can also be regarded as an extension to matrices Data frames can have columns of different data types and are t
190. the following code void arsim double x long n double phi long i for i 1 i lt n i x i phi x i 1 x i Then create a module definition file arsim def and insert the following text LIBRARY arsim EXPORTS arsim This module definition file tells which functions are to be exported by the dll Now compile the two files to a dll There are many free and commercial compilers that can create a dll e The GNU compiler collection free http www mingw org e lcc free http www cs virginia edu lcc win32 e Borland compiler is free the IDE is commercial e Microsoft Visual studio commercial 95 CHAPTER 6 EFFICIENT 6 4 SOME COMPILED CODE Lets use 1cc to create the dll open a DOS box and type in the following lcc arsim c lcclink dll nounderscores arsim obj arsim def The compiler created the file arsim dll that can now be linked to R In R type the following code mydll C DLLLocation arsim d11 dyn load myd11 is loaded arsim TRUE The dll is now linked to R and we can use the C interface function to call the arsim C function For convenience we write a wrapper function arsimC that calls the C function arsimC lt function x phi only return the first component of the list because the C function only modifies x C arsim as numeric x length x as numeric phi 1 tmp lt Sys time arsimC rnorm 10000 phi 0 75 Sys time tmp Ti
191. the function as list to print the function as a list as list myf x y L311 templ x y temp2 x y tmp1 temp2 The result is a list with three components when we transform the third component as a list we get 192 CHAPTER 9 MISCELLANEOUS 9 3 CALLING R FROM SAS as list as list myf 3 CELT cre 121 templ x y 1311 temp2 x y 141 tmp1 temp2 We can even go further print the second component of the last list as a list as list as list as list myf 31 gt 21 1 gt LELI Cal 2 temp1 131 xXx y 9 3 Calling R from SAS The SAS system provides many routines for data manipulations and data analysis It may be hard to convince a long time SAS user to use R for data manipulation or statistics However the graphics in R are superior compared to what SAS GRAPH can offer Some graphs are unavailable in SAS GRAPH or very time consuming to program We will give small examples on how to use R graphs in a SAS session 9 3 1 The call system and X functions In SAS there are two ways to call external programs the call system and the X func tions The following example calls Rcmd BATCH this will start R as a non interactive BATCH session It runs a specified R file with some plotting functions Create an R file with some plot statements the file plotR R The code in the file will instruct R to export the graph 193 CHAPTER 9 MISCELLANEOUS 9 3
192. tion between the binary y variable and x variables o 156 The ROC curve to assess the quality of a logistic regression model 159 Plot of the tree Type is predicted based on Mileage and Price 162 Binning the age variable two intervals in this Case 164 8 7 8 8 8 9 8 10 8 11 8 12 9 1 9 2 9 3 Survival curve 10 will develop AIDS before 45 months and 20 before o EA E A a ae e aa a a aa 166 Scatter plot of the martingale residuals 168 Three subjects with age 10 30 and 60 169 Scatter plot of our simulated data for nls 172 Simulated data and nls predictions o 0a s a e 175 Hill curves for two sets of parameters o oo oo a a 0080 e 176 Result of the specific plot method for class bigMatrix 185 Some Lissajous plots lt gt e c cs seg arassa Soe eRe pda ee de 203 A small java gui that can call R functions 204 1 Introduction 1 1 What is R While the commercial implementation of S S PLUS is struggling to keep its existing users the open source version of S R has received a lot of attention in the last five years Not only because the R system is a free tool the system has proven to be a very effective tool in data manipulation data analysis graphing and developing new functionality The user community has grown enormously the last years and it is an active use
193. ts For example the commands functions and expressions you type in the R console window Normally the last statement of the function body will be the return value of the function This can be a vector a matrix or any other data structure The following short function meank calculates the mean of a vector x by removing the k percent smallest and the k percent largest elements of the vector meank lt function x k 4 xt lt quantile x c k 1 k mean x x gt xt 1 amp x lt xt 2 Once the function has been created it can be ran test lt rnorm 100 meank test 1 0 00175423 70 CHAPTER 5 WRITING FUNCTIONS 5 1 INTRODUCTION The function meank calls two standard functions quantile and mean Once meank is created it can be called from any other function If you write a short function a one liner or two liner you can type the function directly in the console window If you write longer functions it is more convenient to use a script file Type the function definition in a script file and run the script file Note that when you run a script file with a function definition you will only define the function you will create a new object To actually run it you will need to call the function with the necessary arguments You can use your favorite text editor to create or edit functions Use the function source to evaluate expressions from a file Suppose meank txt is a text file saved on your hard disk co
194. ts and preferences in R Starting R 9 4 1 Defaults and preferences The function options is used to get and set a wide range of options in R These options influence the way results are computed and displayed The function options lists all options the function getOption lists one specific option and option optionname value sets a certain option options add smooth 1 TRUE check bounds 1 FALSE 196 CHAPTER 9 MISCELLANEOUS 9 4 DEFAULTS AND chmhelp 1 TRUE continue 1 Uy 1 contrasts unordered ordered contr treatment contr poly defaultPackages 1 datasets utils erDevices graphics stats methods device 1 windows digits MIT One example is the number of digits that is printed by default the number is seven This can be increased sqrt 2 1 1 414214 options digits 15 sqrt 2 1 1 41421356237310 See the help file of the function options for a complete list of all options 9 4 2 Starting R The online help describes precisely which initialization steps are carried out during the start up of R Enter Startup to see the help file If want to start R and set certain options or attach load certain packages automatically then this can be achieved by editing the file Rprofile site This file is located in the etc subdirectory of the R installation directory so something like C Program Files R 2 5 0 etc The following file is just an example file
195. ve model use y x1 x2 x3 72 Xx2 X3 which is equivalent to y 7 xi x2 x3 xiix2 x1 x3 143 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS The function I is used to suppress the specific meaning of the operators in a linear regression model For example if you want to include a transformed x2 variable in your model say multiplied by 2 the following formula will not work y x1 2 x2 The operator already has a specific meaning so you should use the following construc tion y xi 1 2 x2 You should also use the I function when you want to include a centered regression variable in your model The following formula will work however it does not return the expected result y x1 x2 constant Use the following formula instead y x1 I x2 constant 8 3 2 Modeling functions Linear regression models are widely used to model linear relationships between different variables There are many different functions in R to fit and analyze linear regression models The main function for linear regression is 1m and its main arguments are lm formula data weights subset na action As an example we will use our cars data set to fit the following linear regression model Weight o 1 x Mileage In R this model is formulated and fitted as follows cars lm lt lm Weight Mileage data cars 144 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS The result
196. wing so called confusion matrix Observed Good Bad Model Good TP FP predicted Bad FN TN number of goods number of bads Table 8 4 confusion matrix Let ng be the number of observed goods and n the number of observed bads then we have 157 CHAPTER 8 STATISTICS 8 4 LOGISTIC REGRESSION 1 TP true positive is the number of observations for which the model predicted good and that were observed good True positive rate TPR T P n 2 TN true negative is the number of observations for which the model predicted bad and that were observed bad True negative rate TNR TN np 3 FP false positive is the number of observations for which the model predicted good but were observed bad False positive rate FPR 1 TNR 4 FN false negative is the number of observations for which the model predicted bad but were actually observed good False negative rate FNR 1 TPR The ROC curve is a parametric curve for all tresholds t 0 1 the points TPR FPR are calculated Then these points are plotted A ROC curve demonstrates several things 1 It shows the trade off between sensitivity and specificity any increase in sensitivity will be accompanied by a decrease in specificity 2 The closer the curve follows the left hand border and then the top border of the ROC space the more accurate the test 3 The closer the curve comes to the 45 degree diagonal of the ROC space the less ac
197. y lt i x1 x2 Be aware of the special meaning of the operators and in linear model formulae They are not used for the normal multiplication subtraction power and division The operator is used to model interaction terms in linear models The next formula includes an interaction term between the variable x and the variable z 142 CHAPTER 8 STATISTICS 8 3 LINEAR REGRESSION MODELS y xl x2 xiix2 which corresponds to the linear regression model y Bot Pit Boxe P1211122 There is a short hand notation for the above formula which is given by y x1x x2 In general x1 x2 xp is a short hand notation for the model that includes all single terms order 2 interactions order 3 interactions order p interactions To see all the terms that are generated use the terms function myform lt y x1 x2 x3 x4 terms myform ignoring some other output generated by terms attr term labels 1 x1 oi x3 x4 5 lex 31 23 Up Pap ao ees 9 x2 x4 xotxa ELO iix2 x4 13 z1 x3 x4 pS sio did URLS xd The operator is used to generate interaction terms up to a certain order y XItX2 x3 2 The above formula is equivalent to y xi x2 x3 x1 x2 x2 x3 x1 x3 The operator is used to leave out terms in a formula We have already seen that 1 removes the intercept in a regression formula For example to leave out a specific interaction term in the abo
Download Pdf Manuals
Related Search
Related Contents
User Manual Cisco Systems Engine 611 Installation Manual MANUAL KR-600 Bajar PDF 小 松 市 新 ご み 処 理 施 設 運 営 業 務 委 託 契 約 書 (SPC非設立用 Specification for Caravan Repair Hunter HPQ15F-E Use and Care Manual Orion PARAGON 5360 User's Manual Table Of Contents IES-2000/3000 Hardware Installation Guide Copyright © All rights reserved.
Failed to retrieve file