dplyr中的条件求和（例如“引用另一行中刚刚更新的值”，例如在Excel中）

debugcn 发表于 Dev

7秒

这个问题看起来很简单，应该有一个简单的解决方案，但是我找不到任何解决方案； /

我有一长串按时间索引的记录。时间间隔不是固定的。有一个类别变量，我感兴趣的是计算每种类别的条纹（连续多少天我们有“ A”，而不是例如“ B”。然后“ A”可能返回并开始另一种条纹）。在Excel中执行此操作仅需要参考上一行的if函数。在RI中，可以使用for循环来做到这一点，我在下面的玩具示例中提供了该循环。我主要想知道，如何在dplyr中完成它。

library(tidyverse)
library(lubridate)

set.seed(33)

# I create a date column - 20 dates starting from "2020-01-31", then uneven intervals, from 1 to 5 weeks
date <- rep(ymd(20200131), 20)

# (btw, this, I belive, should also be possible to do without a for loop, and I also cannot come up with a solution, 
# how):
for (i in 2:length(date)){
  date[i] <- date[i-1]+7*sample(1:5, 1)
}

# A categorical column
user <- c(rep("A",3), "B", rep("C",4), rep("B",5), rep("A", 6), "B")

df <- data.frame(date, user)

df$desired_result <-0

for (i in 2:nrow(df)){
  if (df[i, "user"] != df[i-1, "user"]) df[i, "desired_result"] <- 0
  else df[i, "desired_result"] <- as.integer(df[i, "date"] - df[i-1, "date"]) + df[i-1, "desired_result"] 
}

         date user desired_result
1  2020-01-31    A              0
2  2020-03-06    A             35
3  2020-04-03    A             63
4  2020-04-10    B              0
5  2020-04-17    C              0
6  2020-05-08    C             21
7  2020-05-29    C             42
8  2020-06-26    C             70
9  2020-07-03    B              0
10 2020-07-10    B              7
11 2020-07-24    B             21
12 2020-08-28    B             56
13 2020-09-18    B             77
14 2020-10-02    A              0
15 2020-10-09    A              7
16 2020-10-23    A             21
17 2020-11-06    A             35
18 2020-11-13    A             42
19 2020-11-20    A             49
20 2020-12-04    B              0

现在的问题是：如何在dplyr中执行此操作？

# This is wrong: "object 'result' not found":

df %>%  
  as_tibble() %>% 
  mutate(result = if_else(user == lag(user),
                          as.integer(date - lag(date)) + lag(result),
                          0))

# This is wrong: if condition is fulfilled, it adds as.integer(date - lag(date)) to 0, not to the result in the row above. 
# It dosen't proceed like a loop does, from the top of the column to the bottom, doesn't "update" values in the column, 
# as it proceeds.

df %>%  
  as_tibble() %>% 
  mutate(result = 0) %>% 
  mutate(result = if_else(user == lag(user),
                          as.integer(date - lag(date)) + lag(result),
                          0))

# A tibble: 20 x 4
   date       user  desired_result result
   <date>     <fct>          <dbl>  <dbl>
 1 2020-01-31 A                  0     NA
 2 2020-02-14 A                 14     14
 3 2020-03-13 A                 42     28
 4 2020-03-20 B                  0      0
 5 2020-04-03 C                  0      0
 6 2020-05-01 C                 28     28
 7 2020-05-08 C                 35      7
 8 2020-06-12 C                 70     35
 9 2020-07-17 B                  0      0
10 2020-08-21 B                 35     35
11 2020-09-04 B                 49     14
12 2020-09-18 B                 63     14
13 2020-10-16 B                 91     28
14 2020-10-23 A                  0      0
15 2020-11-13 A                 21     21
16 2020-11-27 A                 35     14
17 2020-12-25 A                 63     28
18 2021-01-08 A                 77     14
19 2021-02-12 A                112     35
20 2021-03-05 B                  0      0

我尝试了group_by（），但不适用，因为类别可能会返回并开始新的条纹cumsum（），到目前为止，它也没有帮助我。我强烈认为必须有一个基本的解决方案：）

阿克伦

我们可以rleid对“用户”进行分组操作，然后获得“日期”和lag“日期”之间的差，并获得累计和（cumsum）

library(dplyr)
library(data.table)
df %>%
   group_by(grp = rleid(user)) %>% 
   mutate(desired_result2 = cumsum(as.integer(date - lag(date, 
           default = first(date))))) %>%
   ungroup %>%
   select(-grp)

-输出

# A tibble: 20 x 4
#   date       user  desired_result desired_result2
#   <date>     <chr>          <dbl>           <int>
# 1 2020-01-31 A                  0               0
# 2 2020-02-14 A                 14              14
# 3 2020-03-13 A                 42              42
# 4 2020-03-20 B                  0               0
# 5 2020-04-03 C                  0               0
# 6 2020-05-01 C                 28              28
# 7 2020-05-08 C                 35              35
# 8 2020-06-12 C                 70              70
# 9 2020-07-17 B                  0               0
#10 2020-08-21 B                 35              35
#11 2020-09-04 B                 49              49
#12 2020-09-18 B                 63              63
#13 2020-10-16 B                 91              91
#14 2020-10-23 A                  0               0
#15 2020-11-13 A                 21              21
#16 2020-11-27 A                 35              35
#17 2020-12-25 A                 63              63
#18 2021-01-08 A                 77              77
#19 2021-02-12 A                112             112
#20 2021-03-05 B                  0               0

注意：这里desired_result是来自OP的输出for循环，desired_result2是非环路输出

或者，这是可以做到rle的base R

df$desired_result2 <- with(df, ave(as.numeric(date), with(rle(user), 
   rep(seq_along(values), lengths)), FUN = function(x) 
      cumsum(c(0, diff(x)))))
df$desired_result2
#[1]   0  14  42   0   0  28  35  70   0  35  49
#[11]  63  91   0  21  35  63  77 112   0

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。