R笔记:哑变量
转自个人微信公众号【Memo_Cleon】的统计学习笔记:R笔记:哑变量。哑变量(Dummy Variables)也称虚拟变量,在回归中是一个很重要的概念。哑变量的引入使得回归模型变得更复杂,但对问题描述更简明而且接近现实。对于二分类变量,实际在模型中的取值只有“0”和“1”两个值,无论是以连续型还是哑变量变量纳入模型结果都是一样的,无非是参照水平是0还是1的问题。对于无序多分类的变量,其赋值大小并不代表自变量间的次序或者程度差异,因此需要将其设置成哑变量,相当于将有n个水平的分类变量设置成为n-1哑变量,结果呈现形式上类似将有n个水平的分类变量拆分为n-1个二分类变量。对于有序多分类变量,应该以哑变量还是以连续性变量引入模型需要视情况而定,需要对两种模型进行比较做出判断。哑变量需要遵循同进同出的原则,即在一个模型中同一个多分类变量的所有哑变量要么全部纳入模型,要么全部不纳入模型。利用R进行回归分析时,大部分函数会把字符变量和因子变量直接按哑变量处理,这一点还是比较方便的,省去了单独设置哑变量的步骤。还有一点,各种函数在处理因子变量时,往往会把低水平作为参照水平,这与SPSS默认高水平为参照水平是不同的,这就要求我们在定义因子的时充分利用levels的属性,具体参见<<因子>>一文。即便在回归中往往不需要我们单独对多分类变量进行哑变量的设置,我们还是要花一点时间来看下R中如何进行哑变量设置。很多函数都可以实现哑变量的设置,如dummy.c {misty}、model.matrix {stats}、dummy {dummies}、class.ind{nnet}等,我们以dummy.c和model.matrix进行演示。数据:<<因变量二分类资料的Probit回归>>dummy.c {misty}:creates k - 1 dummy coded 0/1 variables for a vector with k distinct values.dummy.c(x, ref = NULL, names = "d", as.na = NULL, check = TRUE)x:a numeric vector with integer values, character vector or factor.ref:a numeric value or character string indicating the reference group. By default, the last category is selected as reference group.names:a character string or character vector indicating the names of the dummy variables. By default, variables are named "d" with the category compared to the reference category (e.g., "d1" and "d2"). Variable names can be specified using a character string (e.g., names = "dummy_" leads to dummy_1 and dummy_2) or a character vector matching the number of dummy coded variables (e.g. names = c("x.3_1", "x.3_2")) which is the number of unique categories minus one.as.na:a numeric vector indicating user-defined missing values, i.e. these values are converted to NA before conducting the analysis.check:logical: if TRUE, argument specification is checked.
library(readxl)dumv<-read_excel("D:/Temp/bsrdata.xlsx")library(misty)dummy.c(dumv$race, reference = 1, names = "race_d")
dumv文件中的变量race已经被转换成了哑变量,哑变量以第一水平为参照水平,生成的两个哑变量名称为race_2和race_d3。model.matrix {stats}:creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.可将对象中的因子变量转换成0/1哑变量。model.matrix(object, ...)model.matrix(object, data = environment(object),contrasts.arg = NULL, xlev = NULL, ...)object:an object of an appropriate class. For the default method, a model formula or a terms object.data:a data frame created with model.frame. If another sort of object, model.frame is called first.contrasts.arg:a list, whose entries are values (numeric matrices, functions or character strings naming functions) to be used as replacement values for the contrasts replacement function and whose names are the names of columns of data containing factors.xlev:to be used as argument of model.frame if data is such that model.frame is called....:further arguments passed to or from other methods.
library(readxl)dumv<-read_excel("D:/Temp/bsrdata.xlsx")dumv$race<-factor(dumv$race) #将变量race设置为因子变量dumv$smoke<-factor(dumv$smoke)dumv$ht<-factor(dumv$ht)dumv$ui<-factor(dumv$ui)dumlized<-model.matrix(bwt~age+lwt+race+smoke+ptl+ht+ui+ftv,data=dumv)dumlized
对象中的因子变量已经全部转换成了因子变量。哑变量编码通常取值为0或1,在SPSS中对因子的处理常使用哑变量编码方式。除了哑变量编码(dummy coding)外,常用的另外一种编码方式是效应编码(effect coding),编码取值1、0和-1。编码方式不同,具体参数的意义也会有差别,因此参数估计值会有不同,在JMP中对因子的编码方式常常采用效应编码。转自个人微信公众号【Memo_Cleon】的统计学习笔记:R笔记:哑变量。… E N D …