R语言利用edgeR package进行基因差异表达分析举例-数据分析

R语言利用edgeR package进行基因差异表达分析举例

实验数据：

同一组织，分为两组，control vs treat，每组7例sample。数据第一列为基因名，后14列为对应的count。

##bioconductor和edgeR包的安装

source(“http://bioconductor.org/biocLite.R”)

biocLite(“edgeR”)

library(“limma”)

library(“edgeR”)

##读取数据，方法随意

rawdata<-read.delim(“2.txt”,header=T)

head(rawdata) #检查读入是否正确

y<-DGEList(counts=rawdata[,2:15],genes=rawdata[,1])

##过滤与标准化

left<-rowSums(cpm(y)>1)>=4 #过滤标准为至少one count per million (cpm)

y<-y[left,]

y<-DGEList(counts=y$counts,genes=y$genes)

y<-calcNormFactors(y)#默认为TMM标准化

##检查样本的outlier and relationship

y<-plotMDS(y)

##设计design matrix

group<-factor(c(‘H’,’H’,’H’,’H’,’H’,’H’,’H’,’M’,’M’,’M’,’M’,’M’,’M’,’M’))

design <- model.matrix(~group)

y<-DGEList(counts=rawdata[,2:15],genes=rawdata[,1])

##推测dispersion（离散度）

y<-estimateGLMCommonDisp(y,design,verbose=TRUE)

y<-estimateGLMTrendedDisp(y, design)

y<-estimateGLMTagwiseDisp(y, design)

##差异表达基因，to perform quasi-likelihood F-tests:

fit <- glmQLFit(y,design)

qlf <- glmQLFTest(fit,coef=2)

topTags(qlf)#前10个差异表达基因

##or 差异表达基因，to perform likelihood ratio tests:

fit<-glmFit(y, design)

lrt<-glmLRT(fit)

topTags(lrt)#前10个差异表达基因

##火山图

summary(de<-decideTestsDGE(qlf))##qlf或可改为lrt

detags<-rownames(y)[as.logical(de)]

plotSmear(qlf, de.tags=detags)

abline(h=c(-4,4),col=’blue’) #蓝线为2倍差异表达基因，差异表达的数据在qlf中