• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

用组/子集的平均值替换 NA?

用户头像
it1352
帮助1

问题说明

我有一个数据框,其中包含来自蝾螈内脏的各种节肢动物的长度和宽度.因为有些内脏有数千种特定的猎物,我只测量了每种猎物类型的一个子集.我现在想用该猎物的平均长度和宽度替换每个未测量的个体.我想保留数据框并只添加估算列(length2,width2).主要原因是每一行都有关于收集蝾螈的日期和位置的数据列.我可以用随机选择的测量个体来填充 NA,但为了论证,我们假设我只想用平均值替换每个 NA.

I have a dataframe with the lengths and widths of various arthropods from the guts of salamanders. Because some guts had thousands of certain prey items, I only measured a subset of each prey type. I now want to replace each unmeasured individual with the mean length and width for that prey. I want to keep the dataframe and just add imputed columns (length2, width2). The main reason is that each row also has columns with data on the date and location the salamander was collected. I could fill in the NA with a random selection of the measured individuals but for the sake of argument let's assume I just want to replace each NA with the mean.

例如,假设我有一个看起来像这样的数据框:

For example imagine I have a dataframe that looks something like:

id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA

实际上,我有更多的列和大约 25 个不同的分类群,总共有大约 30,000 个猎物.看起来 plyr 包可能是理想的,但我只是不知道如何做到这一点.我不是很懂 R 或编程,但我正在努力学习.

In reality I have more columns and about 25 different taxa and a total of ~30,000 prey items in total. It seems like the plyr package might be ideal for this but I just can't figure out how to do this. I'm not very R or programming savvy but I'm trying to learn.

我不知道自己在做什么,但如果有帮助,我会尝试创建一个小数据集来玩.

Not that I know what I'm doing but I'll try to create a small dataset to play with if it helps.

exampleDF <- data.frame(id = seq(1:100), taxa = c(rep("collembola", 50), rep("mite", 25), 
rep("ant", 25)), length = c(rnorm(40, 1, 0.5), rep("NA", 10), rnorm(20, 0.8, 0.1), rep("NA", 
5), rnorm(20, 2.5, 0.5), rep("NA", 5)), width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), 
rnorm(20, 0.3, 0.01), rep("NA", 5), rnorm(20, 1, 0.1), rep("NA", 5)))

以下是我尝试过的一些事情(但没有奏效):

Here are a few things I've tried (that haven't worked):

# mean imputation to recode NA in length and width with means 
  (could do random imputation but unnecessary here)
mean.imp <- function(x) { 
  missing <- is.na(x) 
  n.missing <-sum(missing) 
  x.obs <-a[!missing] 
  imputed <- x 
  imputed[missing] <- mean(x.obs) 
  return (imputed) 
  } 

mean.imp(exampleDF[exampleDF$taxa == "collembola", "length"])

n.taxa <- length(unique(exampleDF$taxa))
for(i in 1:n.taxa) {
  mean.imp(exampleDF[exampleDF$taxa == unique(exampleDF$taxa[i]), "length"])
} # no way to get back into dataframe in proper places, try plyr? 

又一次尝试:

imp.mean <- function(x) {
  a <- mean(x, na.rm = TRUE)
  return (ifelse (is.na(x) == TRUE , a, x)) 
 } # tried but not sure how to use this in ddply

Diet2 <- ddply(exampleDF, .(taxa), transform, length2 = function(x) {
  a <- mean(exampleDF$length, na.rm = TRUE)
  return (ifelse (is.na(exampleDF$length) == TRUE , a, exampleDF$length)) 
  })

有什么建议吗?

正确答案

#1

不是我自己的技术,我不久前在板上看到了:

Not my own technique I saw it on the boards a while back:

dat <- read.table(text = "id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA", header=TRUE)


library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
     width = impute.mean(width))

dat2[order(dat2$id), ] #plyr orders by group so we have to reorder

编辑带有 for 循环的非 plyr 方法:

Edit A non plyr approach with a for loop:

for (i in which(sapply(dat, is.numeric))) {
    for (j in which(is.na(dat[, i]))) {
        dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i],  na.rm = TRUE)
    }
}

编辑很多个月后这里是一个 data.table &dplyr 方法:

Edit many moons later here is a data.table & dplyr approach:

data.table

library(data.table)
setDT(dat)

dat[, length := impute.mean(length), by = taxa][,
    width := impute.mean(width), by = taxa]

dplyr

library(dplyr)

dat %>%
    group_by(taxa) %>%
    mutate(
        length = impute.mean(length),
        width = impute.mean(width)  
    )

这篇好文章是转载于:学新通技术网

  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /reply/detail/tanhefihha
系列文章
更多 icon
同类精品
更多 icon
继续加载