这个问题看起来很简单,应该有一个简单的解决方案,但是我找不到任何解决方案; /
我有一长串按时间索引的记录。时间间隔不是固定的。有一个类别变量,我感兴趣的是计算每种类别的条纹(连续多少天我们有“ A”,而不是例如“ B”。然后“ A”可能返回并开始另一种条纹)。在Excel中执行此操作仅需要参考上一行的if函数。在RI中,可以使用for循环来做到这一点,我在下面的玩具示例中提供了该循环。我主要想知道,如何在dplyr中完成它。
library(tidyverse)
library(lubridate)
set.seed(33)
# I create a date column - 20 dates starting from "2020-01-31", then uneven intervals, from 1 to 5 weeks
date <- rep(ymd(20200131), 20)
# (btw, this, I belive, should also be possible to do without a for loop, and I also cannot come up with a solution,
# how):
for (i in 2:length(date)){
date[i] <- date[i-1]+7*sample(1:5, 1)
}
# A categorical column
user <- c(rep("A",3), "B", rep("C",4), rep("B",5), rep("A", 6), "B")
df <- data.frame(date, user)
df$desired_result <-0
for (i in 2:nrow(df)){
if (df[i, "user"] != df[i-1, "user"]) df[i, "desired_result"] <- 0
else df[i, "desired_result"] <- as.integer(df[i, "date"] - df[i-1, "date"]) + df[i-1, "desired_result"]
}
date user desired_result
1 2020-01-31 A 0
2 2020-03-06 A 35
3 2020-04-03 A 63
4 2020-04-10 B 0
5 2020-04-17 C 0
6 2020-05-08 C 21
7 2020-05-29 C 42
8 2020-06-26 C 70
9 2020-07-03 B 0
10 2020-07-10 B 7
11 2020-07-24 B 21
12 2020-08-28 B 56
13 2020-09-18 B 77
14 2020-10-02 A 0
15 2020-10-09 A 7
16 2020-10-23 A 21
17 2020-11-06 A 35
18 2020-11-13 A 42
19 2020-11-20 A 49
20 2020-12-04 B 0
现在的问题是:如何在dplyr中执行此操作?
# This is wrong: "object 'result' not found":
df %>%
as_tibble() %>%
mutate(result = if_else(user == lag(user),
as.integer(date - lag(date)) + lag(result),
0))
# This is wrong: if condition is fulfilled, it adds as.integer(date - lag(date)) to 0, not to the result in the row above.
# It dosen't proceed like a loop does, from the top of the column to the bottom, doesn't "update" values in the column,
# as it proceeds.
df %>%
as_tibble() %>%
mutate(result = 0) %>%
mutate(result = if_else(user == lag(user),
as.integer(date - lag(date)) + lag(result),
0))
# A tibble: 20 x 4
date user desired_result result
<date> <fct> <dbl> <dbl>
1 2020-01-31 A 0 NA
2 2020-02-14 A 14 14
3 2020-03-13 A 42 28
4 2020-03-20 B 0 0
5 2020-04-03 C 0 0
6 2020-05-01 C 28 28
7 2020-05-08 C 35 7
8 2020-06-12 C 70 35
9 2020-07-17 B 0 0
10 2020-08-21 B 35 35
11 2020-09-04 B 49 14
12 2020-09-18 B 63 14
13 2020-10-16 B 91 28
14 2020-10-23 A 0 0
15 2020-11-13 A 21 21
16 2020-11-27 A 35 14
17 2020-12-25 A 63 28
18 2021-01-08 A 77 14
19 2021-02-12 A 112 35
20 2021-03-05 B 0 0
我尝试了group_by(),但不适用,因为类别可能会返回并开始新的条纹cumsum(),到目前为止,它也没有帮助我。我强烈认为必须有一个基本的解决方案:)
我们可以rleid
对“用户”进行分组操作,然后获得“日期”和lag
“日期”之间的差,并获得累计和(cumsum
)
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(user)) %>%
mutate(desired_result2 = cumsum(as.integer(date - lag(date,
default = first(date))))) %>%
ungroup %>%
select(-grp)
-输出
# A tibble: 20 x 4
# date user desired_result desired_result2
# <date> <chr> <dbl> <int>
# 1 2020-01-31 A 0 0
# 2 2020-02-14 A 14 14
# 3 2020-03-13 A 42 42
# 4 2020-03-20 B 0 0
# 5 2020-04-03 C 0 0
# 6 2020-05-01 C 28 28
# 7 2020-05-08 C 35 35
# 8 2020-06-12 C 70 70
# 9 2020-07-17 B 0 0
#10 2020-08-21 B 35 35
#11 2020-09-04 B 49 49
#12 2020-09-18 B 63 63
#13 2020-10-16 B 91 91
#14 2020-10-23 A 0 0
#15 2020-11-13 A 21 21
#16 2020-11-27 A 35 35
#17 2020-12-25 A 63 63
#18 2021-01-08 A 77 77
#19 2021-02-12 A 112 112
#20 2021-03-05 B 0 0
注意:这里desired_result
是来自OP的输出for
循环,desired_result2
是非环路输出
或者,这是可以做到rle
的base R
df$desired_result2 <- with(df, ave(as.numeric(date), with(rle(user),
rep(seq_along(values), lengths)), FUN = function(x)
cumsum(c(0, diff(x)))))
df$desired_result2
#[1] 0 14 42 0 0 28 35 70 0 35 49
#[11] 63 91 0 21 35 63 77 112 0
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句