代谢组学 PCA PLS-DA OPLS-DA 在R语言的实现
主成分分析(Principal Component Analysis,PCA)是一种无监督降维方法,能够有效对高维数据进行处理。但PCA对相关性较小的变量不敏感,而PLS-DA(Partial Least Squares-Discriminant Analysis,偏最小二乘判别分析)能够有效解决这个问题。而OPLS-DA(正交偏最小二乘判别分析)结合了正交信号和PLS-DA来筛选差异变量。
图片来自:Orthogonal Partial Least Squares (OPLS) in R | R-bloggers
安装和加载包
-
# install ropls
-
if (F) {
-
if (!requireNamespace("BiocManager", quietly = TRUE))
-
install.packages("BiocManager")
-
-
BiocManager::install("ropls")
-
}
-
-
# load packages
-
library(ropls)
-
library(ggplot2)
-
library(ggsci)
-
library(Cairo)
-
library(tidyverse)
-
library(extrafont)
-
loadfonts()
示例数据
复制成功
r
-
# load data
-
data(sacurine)
-
names(sacurine)
-
-
# view data information
-
attach(sacurine)
-
strF(dataMatrix)
-
strF(sampleMetadata)
-
strF(variableMetadata)
这个数据集是不同年龄、性别和BMI的183个人的尿液中109种代谢物的浓度差异$^{[1]}$。下面的分析主要以性别为变量来研究不同性别人群尿液种代谢物的差异。下面的分析主要包含PCA、PLS-DA和OPLS-DA。
PCA分析
复制成功
r
-
# PCA analysis
-
pca = opls(dataMatrix)
-
genderFc = sampleMetadata[, "gender"]
-
-
pdf(file = 'figures/PCA.pdf', width = 5, height = 5)
-
plot(pca, typeVc = "x-score",
-
parAsColFcVn = genderFc, parEllipsesL = TRUE)
-
dev.off()
可以看到的是如果用PCA的话,不同性别的人群是混在一起的。
PLS-DA
复制成功
r
-
# PLSDA analysis
-
plsda = opls(dataMatrix,genderFc)
-
-
# sample scores plot
-
sample.score = plsda@scoreMN %>%
-
as.data.frame() %>%
-
mutate(gender = sacurine[["sampleMetadata"]][["gender"]])
-
-
p1 = ggplot(sample.score, aes(p1, p2, color = gender))
-
geom_hline(yintercept = 0, linetype = 'dashed', size = 0.5)
-
geom_vline(xintercept = 0, linetype = 'dashed', size = 0.5)
-
geom_point()
-
geom_point(aes(-10,-10), color = 'white')
-
labs(x = 'P1(10.0%)',y = 'P2(9%)')
-
stat_ellipse(level = 0.95, linetype = 'solid',
-
size = 1, show.legend = FALSE)
-
scale_color_manual(values = c('#008000','#FFA74F'))
-
theme_bw()
-
theme(legend.position = c(0.9,0.8),
-
legend.text = element_text(color = 'black',size = 12, family = 'Arial', face = 'plain'),
-
panel.background = element_blank(),
-
panel.grid = element_blank(),
-
axis.text = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.title = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.ticks = element_line(color = 'black'))
-
ggsave(p1, filename = 'figures/pls.pdf',
-
width = 5, height = 5, device = cairo_pdf)
和PCA相比,PLS-DA的效果相对较好。PLS-DA分析的目的是找到差异变量(本例中的109种代谢物的某几种)。因此,需要找到VIP值大于1的变量(代谢物):
复制成功
r
-
# VIP scores plot
-
vip.score = as.data.frame(plsda@vipVn)
-
colnames(vip.score) = 'vip'
-
vip.score$metabolites = rownames(vip.score)
-
vip.score = vip.score[order(-vip.score$vip),]
-
vip.score$metabolites = factor(vip.score$metabolites,
-
levels = vip.score$metabolites)
-
-
loading.score = plsda@loadingMN %>% as.data.frame()
-
loading.score$metabolites = rownames(loading.score)
-
-
all.score = merge(vip.score, loading.score, by = 'metabolites')
-
-
all.score$cat = paste('A',1:nrow(all.score), sep = '')
-
-
p2 = ggplot(all.score[all.score$vip >= 1,], aes(cat, vip))
-
geom_segment(aes(x = cat, xend = cat,
-
y = 0, yend = vip))
-
geom_point(shape = 21, size = 5, color = '#008000' ,fill = '#008000')
-
geom_point(aes(1,2.5), color = 'white')
-
geom_hline(yintercept = 1, linetype = 'dashed')
-
scale_y_continuous(expand = c(0,0))
-
labs(x = '', y = 'VIP value')
-
theme_bw()
-
theme(legend.position = 'none',
-
legend.text = element_text(color = 'black',size = 12, family = 'Arial', face = 'plain'),
-
panel.background = element_blank(),
-
panel.grid = element_blank(),
-
axis.text = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.text.x = element_text(angle = 90),
-
axis.title = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.ticks = element_line(color = 'black'),
-
axis.ticks.x = element_blank())
-
ggsave(p2, filename = 'figures/pls_VIP.pdf',
-
width = 8, height = 5, device = cairo_pdf)
下面这些物质就是差异代谢物(VIP值大于等于1):
r
-
[1] (gamma)Glu-Leu/Ile
-
[2] 2-acetamido-4-methylphenyl acetate
-
[3] 2-Methylhippuric acid
-
[4] 3-Methylcrotonylglycine
-
[5] 3,4-Dihydroxybenzeneacetic acid
-
[6] 3,5-dihydroxybenzoic acid/3,4-dihydroxybenzoic acid
-
[7] 4-Acetamidobutanoic acid isomer 3
-
[8] 6-(carboxymethoxy)-hexanoic acid
-
[9] Acetaminophen glucuronide
-
[10] Acetylphenylalanine
-
[11] alpha-N-Phenylacetyl-glutamine
-
[12] Asp-Leu/Ile isomer 1
-
[13] Citric acid
-
[14] Dehydroepiandrosterone 3-glucuronide
-
[15] Dehydroepiandrosterone sulfate
-
[16] Gluconic acid and/or isomers
-
[17] Glucuronic acid and/or isomers
-
[18] Glyceric acid
-
[19] Hippuric acid
-
[20] Hydroxybenzyl alcohol isomer
-
[21] Malic acid
-
[22] Methyl (hydroxymethyl)pyrrolidine-carboxylate/Methyl (hydroxy)piperidine-carboxylate
-
[23] Monoethyl phthalate
-
[24] N-Acetyl-aspartic acid
-
[25] Oxoglutaric acid
-
[26] p-Anisic acid
-
[27] p-Hydroxyhippuric acid
-
[28] Pantothenic acid
-
[29] Pentose
-
[30] Phe-Tyr-Asp (and isomers)
-
[31] Pyruvic acid
-
[32] Testosterone glucuronide
-
[33] Threonic acid/Erythronic acid
-
[34] Valerylglycine isomer 1
-
[35] Valerylglycine isomer 2
将它们的VIP值得进行可视化(某些代谢物名称太长,进行转换表示):
OPLS-DA分析
复制成功
r
-
# OPLS-DA analysis
-
oplsda = opls(dataMatrix, genderFc, predI = 1, orthoI = NA)
-
-
# sample scores plot
-
sample.score = oplsda@scoreMN %>%
-
as.data.frame() %>%
-
mutate(gender = sacurine[["sampleMetadata"]][["gender"]],
-
o1 = oplsda@orthoScoreMN[,1])
-
-
p3 = ggplot(sample.score, aes(p1, o1, color = gender))
-
geom_hline(yintercept = 0, linetype = 'dashed', size = 0.5)
-
geom_vline(xintercept = 0, linetype = 'dashed', size = 0.5)
-
geom_point()
-
#geom_point(aes(-10,-10), color = 'white')
-
labs(x = 'P1(5.0%)',y = 'to1')
-
stat_ellipse(level = 0.95, linetype = 'solid',
-
size = 1, show.legend = FALSE)
-
scale_color_manual(values = c('#008000','#FFA74F'))
-
theme_bw()
-
theme(legend.position = c(0.1,0.85),
-
legend.title = element_blank(),
-
legend.text = element_text(color = 'black',size = 12, family = 'Arial', face = 'plain'),
-
panel.background = element_blank(),
-
panel.grid = element_blank(),
-
axis.text = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.title = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.ticks = element_line(color = 'black'))
-
ggsave(p3, filename = 'figures/opls.pdf',
-
width = 5, height = 5, device = cairo_pdf)
可以看到的是OPLS-DA的效果比PLS-DA更好一些。
同样进行差异代谢物筛选:
复制成功
r
-
# VIP scores plot
-
vip.score = as.data.frame(oplsda@vipVn)
-
colnames(vip.score) = 'vip'
-
vip.score$metabolites = rownames(vip.score)
-
vip.score = vip.score[order(-vip.score$vip),]
-
vip.score$metabolites = factor(vip.score$metabolites,
-
levels = vip.score$metabolites)
-
-
loading.score = oplsda@loadingMN %>% as.data.frame()
-
loading.score$metabolites = rownames(loading.score)
-
-
all.score = merge(vip.score, loading.score, by = 'metabolites')
-
-
all.score$cat = paste('A',1:nrow(all.score), sep = '')
-
-
p4 = ggplot(all.score[all.score$vip >= 1,], aes(cat, vip))
-
geom_segment(aes(x = cat, xend = cat,
-
y = 0, yend = vip))
-
geom_point(shape = 21, size = 5, color = '#008000' ,fill = '#008000')
-
geom_point(aes(1,2.5), color = 'white')
-
geom_hline(yintercept = 1, linetype = 'dashed')
-
scale_y_continuous(expand = c(0,0))
-
labs(x = '', y = 'VIP value')
-
theme_bw()
-
theme(legend.position = 'none',
-
legend.text = element_text(color = 'black',size = 12, family = 'Arial', face = 'plain'),
-
panel.background = element_blank(),
-
panel.grid = element_blank(),
-
axis.text = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.text.x = element_text(angle = 90),
-
axis.title = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.ticks = element_line(color = 'black'),
-
axis.ticks.x = element_blank())
-
p4
-
ggsave(p4, filename = 'figures/opls_VIP.pdf',
-
width = 8, height = 5, device = cairo_pdf)
下面的的是差异代谢物:
r
-
[1] (gamma)Glu-Leu/Ile
-
[2] 2-acetamido-4-methylphenyl acetate
-
[3] 2-Methylhippuric acid
-
[4] 3-Methylcrotonylglycine
-
[5] 3,4-Dihydroxybenzeneacetic acid
-
[6] 3,5-dihydroxybenzoic acid/3,4-dihydroxybenzoic acid
-
[7] 4-Acetamidobutanoic acid isomer 3
-
[8] 6-(carboxymethoxy)-hexanoic acid
-
[9] Acetaminophen glucuronide
-
[10] Acetylphenylalanine
-
[11] alpha-N-Phenylacetyl-glutamine
-
[12] Asp-Leu/Ile isomer 1
-
[13] Citric acid
-
[14] Dehydroepiandrosterone 3-glucuronide
-
[15] Dehydroepiandrosterone sulfate
-
[16] Gluconic acid and/or isomers
-
[17] Glucuronic acid and/or isomers
-
[18] Glyceric acid
-
[19] Hippuric acid
-
[20] Hydroxybenzyl alcohol isomer
-
[21] Malic acid
-
[22] Methyl (hydroxymethyl)pyrrolidine-carboxylate/Methyl (hydroxy)piperidine-carboxylate
-
[23] Monoethyl phthalate
-
[24] N-Acetyl-aspartic acid
-
[25] Oxoglutaric acid
-
[26] p-Anisic acid
-
[27] p-Hydroxyhippuric acid
-
[28] Pantothenic acid
-
[29] Pentose
-
[30] Phe-Tyr-Asp (and isomers)
-
[31] Pyruvic acid
-
[32] Testosterone glucuronide
-
[33] Threonic acid/Erythronic acid
-
[34] Valerylglycine isomer 1
-
[35] Valerylglycine isomer 2
模型训练与预测
这些降维的方法都属于机器学习算法,那就可以对模型进行训练,并用这个模型去预测未知的数据。
模型训练很简单:
复制成功
r
-
# model training
-
oplsda.2 = opls(dataMatrix, genderFc, predI = 1, orthoI = NA,subset = "odd")
r
-
OPLS-DA
-
92 samples x 109 variables and 1 response
-
standard scaling of predictors and response(s)
-
R2X(cum) R2Y(cum) Q2(cum)
-
Total 0.26 0.825 0.608
-
RMSEE RMSEP pre ort
-
Total 0.213 0.341 1 2
-
Warning message:
-
'permI' set to 0 because train/test partition is selected.
可以看到使用了92个样本进行训练。
先看看模型在训练集上的准确率:
复制成功
r
-
# 模型在训练集上的准确率
-
trainVi = getSubsetVi(oplsda.2)
-
tab = table(genderFc[trainVi], fitted(oplsda.2))
-
print(paste('模型准确率:',round(sum(diag(tab))/sum(tab)*100, 2),'%', sep = ''))
r
-
M F
-
M 50 0
-
F 0 42
-
[1] "模型准确率:100%"
看模型在训练集上的准确率是没有什么意义的,要是在训练集的表现都不好,那模型一定不好。那看看模型在未知数据上的预测准确率吧:
复制成功
r
-
# model on test data
-
tab2 = table(genderFc[-trainVi],predict(oplsda.2, dataMatrix[-trainVi, ]))
-
print(paste('模型准确率:',round(sum(diag(tab2))/sum(tab2)*100, 2),'%', sep = ''))
r
-
M F
-
M 43 7
-
F 7 34
-
[1] "模型准确率:84.62%"
这个准确率已经很棒了。
差异代谢物其他筛选方法
除了用(O)PLS-DA中的VIP值对代谢物进行筛选,还可以用别的方法进行筛选,如log2FC
等。通常是绘制火山图。
复制成功
r
-
# volcano plot
-
df = dataMatrix %>% as.data.frame()
-
df$gender = sacurine[["sampleMetadata"]][["gender"]]
-
df = df[order(df$gender),]
-
df = df[,-110]
-
-
M.mean = apply(df[1:100,],2,FUN = mean)
-
F.mean = apply(df[101:183,],2,FUN = mean)
-
-
FC = M.mean / F.mean
-
log2FC = log(FC,2)
-
-
pvalue = apply(df, 2, function(x)
-
{t.test(x[1:100],x[101:183])$p.value})
-
-
p.adj = p.adjust(pvalue, method = 'BH')
-
p.adj.log = -log10(p.adj)
-
-
colcano.df = data.frame(log2FC,p.adj, p.adj.log)
-
colcano.df$cat = ifelse(colcano.df$log2FC >= 1 & colcano.df$p.adj < 0.05,'Up',
-
ifelse(colcano.df$log2FC <= -1 & colcano.df$p.adj < 0.05,'Down','NS'))
-
-
p5 = ggplot(colcano.df, aes(log2FC, p.adj.log))
-
geom_point()
-
labs(y = '-log10(p-value.adj)')
-
theme_bw()
-
theme(legend.position = 'none',
-
legend.text = element_text(color = 'black',size = 12, family = 'Arial', face = 'plain'),
-
panel.background = element_blank(),
-
panel.grid = element_blank(),
-
axis.text = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.text.x = element_text(angle = 90),
-
axis.title = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
-
axis.ticks = element_line(color = 'black'),
-
axis.ticks.x = element_blank())
-
p5
-
ggsave(p5, filename = '20201214PLSDA分析/figures/volcano.pdf',
-
width = 5, height = 5, device = cairo_pdf)
可能是这个例子中的代谢物太少了,导致算完以后都没有差异代谢物了。
参考文献
[1] Thévenot E A, Roux A, Xu Y, et al. Analysis of the human adult urinary metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for univariate and OPLS statistical analyses[J]. Journal of proteome research, 2015, 14(8): 3322-3335.
这篇好文章是转载于:学新通技术网
- 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
- 本站站名: 学新通技术网
- 本文地址: /boutique/detail/tanhihcfgi
-
photoshop保存的图片太大微信发不了怎么办
PHP中文网 06-15 -
photoshop扩展功能面板显示灰色怎么办
PHP中文网 06-14 -
word里面弄一个表格后上面的标题会跑到下面怎么办
PHP中文网 06-20 -
《学习通》视频自动暂停处理方法
HelloWorld317 07-05 -
TikTok加速器哪个好免费的TK加速器推荐
TK小达人 10-01 -
Android 11 保存文件到外部存储,并分享文件
Luke 10-12 -
excel下划线不显示怎么办
PHP中文网 06-23 -
微信公众号没有声音提示怎么办
PHP中文网 03-31 -
excel图片置于文字下方的方法
PHP中文网 06-27 -
微信运动停用后别人还能看到步数吗
PHP中文网 07-22