对于初学者来说,R语言中的factor有些难以理解。如果直译factor为“因子”,使得其更加难以理解。我倾向于不要翻译,就称其为factor,然后从几个例子中理解:
- <spanstyle="font-size:12px;">data<-c(1,2,2,3,1,2,3,3,1,2,3,3,1)
- data
- </span>
显示结果:
- <spanstyle="font-size:12px;">[1]1223123312331</span>
然后运行:
- <spanstyle="font-size:12px;">fdata<-factor(data)
- fdata</span>
显示结果:
- <spanstyle="font-size:12px;">[1]1223123312331
- Levels:123</span>
继续查看class
- <spanstyle="font-size:12px;">class(fdata)
- [1]"factor"
- class(data)
- [1]"numeric"</span>
可以看到,factor()函数将原来的数值型的向量转化为了factor类型。factor类型的向量中有Levels的概念。Levels就是factor中的所有元素的集合(没有重复)。我们可以发现Levels就是factor中元素排重后且字符化的结果!因为Levels的元素都是character。
- <spanstyle="font-size:12px;">levels(fdata)
- [1]"1""2""3"</span>
我们可以在factor生成时,通过labels向量来指定levels,继续上面的程序:
- <spanstyle="font-size:12px;">rdata<-factor(data,labels=c("I","II","III"))
- rdata
- </span>
显示结果:
- <spanstyle="font-size:12px;">[1]IIIIIIIIIIIIIIIIIIIIIIIIIII
- Levels:IIIIII</span>
也可以在factor生成以后通过levels函数来修改:
- <spanstyle="font-size:12px;">rdata<-factor(data,labels=c("e","ee","eee"))
- rdata
- </span>
显示结果:
- <spanstyle="font-size:12px;">[1]eeeeeeeeeeeeeeeeeeeeeeeeeee
- Levels:eeeeee</span>
看到这里,我们马上就会意识到,为什么factor要有levels?因为factor是一种更高效的数据存储方式。对于不同的变量,只需要存储一次就可以,具体的数据内容只要存储相应的整数内容就可以了。因此,read.table()函数会默认把读取的数据以factor格式存储,除非你指定类型。
并且,factors可以指定数据的顺序:
- <spanstyle="font-size:12px;">mons<-c("March","April","January","November","January","September","October","September","November","August","January","November","November","February","May","August","July","December","August","August","September","November","February","April")</span><pretabindex="0"class="GCWXI2KCJKB"id="rstudio_console_output"style="font-family:'LucidaConsole';font-size:10pt!important;outline:none;border:none;word-break:break-all;margin:0px;-webkit-user-select:text;white-space:pre-wrap!important;line-height:15px;color:rgb(0,0,0);font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;orphans:auto;text-align:-webkit-left;text-indent:0px;text-transform:none;widows:auto;word-spacing:0px;-webkit-text-stroke-0px;background-color:rgb(255,255,255);"><prename="code"class="html"><spanstyle="font-size:12px;">mons<-factor(mons)
- </span><prename="code"class="html"><spanstyle="font-size:12px;">table(mons)
- </span>
显示结果:
- <spanstyle="font-size:12px;">mons
- AprilAugustDecemberFebruaryJanuaryJulyMarchMayNovember
- 241231115
- OctoberSeptember
- 13</span>
显然月份是有顺序的,我们可以为factor指定顺序
- mons=factor(mons,levels=c("January","February","March","April","May","June","July","August","September","October","November","December"),ordered=TRUE)
现在运行:
- table(mons)
- mons
- JanuaryFebruaryMarchAprilMayJune
- 321210
- JulyAugustSeptemberOctoberNovemberDecember
- 143151
需要注意的是数值型变量与factor的互相转化:
- fert=c(10,20,20,50,10,20,10,50,20)
- mean(fert)
- [1]23.33333
转化后:
- mean(factor(fert))
- Warningmessage:
- Inmean.default(factor(fert)):参数不是数值也不是逻辑值:回覆NA
那我们这里,是不是可以直接用as.numeric() 转化呢?
- mean(as.numeric(factor(fert)))
- [1]1.888889
发现上面是错误的!
这里需要这么转回去:
- ff<-factor(fert)
- mean(as.numeric(levels(ff)[ff]))
- [1]23.33333