如何针对大型数据在R中优化for循环

marnix 发表于 Dev

玛尼克斯

我正在研究大型的data.table（250万行）银行同业贷款。以下是前20个摘录：

> dput(head(clean,20))
structure(list(time = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 
4L, 4L, 1L, 2L, 3L, 4L, 3L, 4L, 4L, 4L), bal = structure(c(2L, 
4L, 4L, 4L, 4L, 4L, 3L, 3L, 9L, 4L, 2L, 3L, 3L, 3L, 3L, 2L, 4L, 
5L, 2L, 15L), .Label = c("32001", "32002", "32003", "32004", 
"32005", "32006", "32007", "32008", "32009", "32010", "32201", 
"32202", "32203", "32204", "32205", "32206", "32207", "32208", 
"32209", "32210"), class = "factor"), lender = c(2003L, 2547L, 
2547L, 574L, 574L, 574L, 2984L, 3015L, 812L, 3278L, 3124L, 3124L, 
41L, 354L, 3156L, 3156L, 735L, 735L, 1421L, 3319L), borrower = c(2285L, 
2285L, 2285L, 2285L, 2285L, 2285L, 2285L, 2285L, 269L, 2839L, 
2839L, 2839L, 2839L, 2897L, 2399L, 2399L, 1816L, 1816L, 2476L, 
3033L), obm = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0), obd = c(1, 0.3, 0.2, 0.35, 0.7, 0.5, 0.4, 1.2, 
4, 0.16, 4, 4, 0.5, 0.1, 1.4, 1.4, 4, 1, 3.25, 0.4), obk = c(1, 
0, 0, 0, 0, 0, 0, 0.5, 0, 0, 0, 4, 0.5, 0.1, 0, 0, 0, 0, 3.25, 
0), oem = c(0, 0.3, 0.2, 0.35, 0.7, 0.5, 0.4, 0.7, 4, 0.16, 4, 
0, 0, 0, 1.4, 1.4, 4, 1, 0, 0.4), r = c(35, 63, 63, 63, 63, 63, 
60, 60, 3, 55, 25, 12, 34, 0, 5, 4, 60, 60, 60, 35), type = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L), .Label = c("loan", "deposit"), class = "factor"), 
    term = structure(c(2L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 9L, 4L, 
    2L, 3L, 3L, 3L, 3L, 2L, 4L, 5L, 2L, 5L), .Label = c("overdraft", 
    "<1d", "2-7d", "8-30d", "31-90d", "91-180d", "0.5-1y", "1-3y", 
    ">3y", "demand"), class = "factor"), reported = structure(c(10561, 
    10561, 10561, 10561, 10561, 10561, 10561, 10561, 10531, 10561, 
    10561, 10561, 10470, 10500, 10531, 10561, 10531, 10561, 10561, 
    10561), class = "Date"), issued = structure(c(10542, 10543.5, 
    10550, 10556.5, 10553.5, 10555.5, 10558, 10558, 10515, 10557.5, 
    10560, 10555, 10465, 10488, 10527, 10560, 10515.5, 10545.5, 
    10541, 10544), class = "Date"), issued_radius = c(0, 10.5, 
    10, 3.5, 6.5, 4.5, 2, 2, 15, 2.5, 0, 2, 2, 2, 2, 0, 10.5, 
    14.5, 0, 13), due = structure(c(10543, 10563, 10570, 10583, 
    10577, 10581, 10563, 10563, 11966, 10585, 10561, 10560, 10470, 
    10493, 10532, 10561, 10535, 10611, 10542, 10589), class = "Date"), 
    month = c(4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 1, 2, 3, 4, 
    3, 4, 4, 4), week = c(14, 14, 15, 16, 16, 16, 17, 17, 10, 
    16, 17, 16, 3, 7, 12, 17, 10, 15, 14, 15)), .Names = c("time", 
"bal", "lender", "borrower", "obm", "obd", "obk", "oem", "r", 
"type", "term", "reported", "issued", "issued_radius", "due", 
"month", "week"), class = c("data.table", "data.frame"), row.names = c(NA, 
-20L), .internal.selfref = <pointer: 0x2960818>)

在关注的三列clean是issued，issued_radius和week，但我已包括所有列，因为它们会影响环路的性能。

每行代表我想以每周解决方案估算其Date发行额的一笔贷款。该发行日期在时间间隔[ ，]中。此间隔可能跨越1天，也可能是几周（最多一个月，或者最多5个星期）。该代码生成此间隔，并计算从偏移日期算起的间隔周数。这些周中的每一周都被赋予与重叠一致的权重。例如，在一个贷款，可以在17个星期，18周发出，如从间隔衍生，被扩展成在两笔贷款和借款体积（列，等）被缩放以这个重量。issued - issued_radiusissued + issued_radiuscleanpatchoemobd

library(data.table)

START_DATE = as.Date("1998-8-1")

elapsed_weeks <- function(t, start_date) {
  as.numeric( floor( difftime( t, start_date, units="weeks" ) ) )
}

#load("clean.Rda")

# One-day intervals can be added to our result immediately
patch = clean[issued_radius==0]
clean = clean[issued_radius!=0]

N = nrow(clean)
write_index = nrow(patch)+1

# Allocate space in patch.
dummy = data.table(time = rep(0, N*5))
patch = rbindlist(list(patch, dummy), use.names = TRUE, fill= TRUE)

for (k in 1:N) {
  entry = clean[k]

  # Recover Date interval [i, j].
  i = entry$issued - entry$issued_radius
  j = entry$issued + entry$issued_radius

  # Generate sequence of days in the interval and
  # map each day to a weeknumber, counting the frequencies.
  x = seq.Date(i, j, by="day")
  T = table(elapsed_weeks(x, START_DATE))

  for (name in names(T)) { # can this be vectorized?
    week_number = as.numeric(name)
    week_weight = as.numeric(T[name]) / length(x)

    new_entry = entry

    new_entry$week = week_number
    new_entry$obm = entry$obm * week_weight
    new_entry$obd = entry$obd * week_weight
    new_entry$obk = entry$obk * week_weight
    new_entry$oem = entry$oem * week_weight

    patch[write_index] = new_entry

    write_index = write_index + 1
  }
}

# Delete unused allocated rows.
patch = patch[!is.na(type)]

print(nrow(patch)/nrow(clean)) # < 5

编辑2：添加另一个示例。

> clean[2]
   time   bal lender borrower obm obd obk oem  r type  term   reported     issued issued_radius        due
1:    4 32004   2547     2285   0 0.3   0 0.3 63 loan 8-30d 1998-12-01 1998-11-13          10.5 1998-12-03
   month week
1:     4   14

对于这笔贷款，它可以在[ 1998-11-3，1998-11-24]中的任何一天发行。此间隔中的每一天都映射到从START_DATE起偏移的周数：

> x
 [1] "1998-11-03" "1998-11-04" "1998-11-05" "1998-11-06" "1998-11-07" "1998-11-08" "1998-11-09" "1998-11-10"
 [9] "1998-11-11" "1998-11-12" "1998-11-13" "1998-11-14" "1998-11-15" "1998-11-16" "1998-11-17" "1998-11-18"
[17] "1998-11-19" "1998-11-20" "1998-11-21" "1998-11-22" "1998-11-23" "1998-11-24"
> elapsed_weeks(x, START_DATE)
 [1] 13 13 13 13 14 14 14 14 14 14 14 15 15 15 15 15 15 15 16 16 16 16

现在，我们创建一个频率表，以推断出每个可能的贷款发行周的权重。

> table(elapsed_weeks(x, START_DATE))

13 14 15 16 
 4  7  7  4

因此，此贷款将扩展为week列{13、14、15、16}的贷款。这些贷款的数量与可能的每周抵消量的频率权重成比例。

> table(elapsed_weeks(x, START_DATE)) / length(x)

       13        14        15        16 
0.1818182 0.3181818 0.3181818 0.1818182

因此，我们最终patch看起来像这样：

> patch
   time   bal lender borrower obm        obd obk        oem  r type  term   reported     issued
1:    4 32004   2547     2285   0 0.05454545   0 0.05454545 63 loan 8-30d 1998-12-01 1998-11-13
2:    4 32004   2547     2285   0 0.09545455   0 0.09545455 63 loan 8-30d 1998-12-01 1998-11-13
3:    4 32004   2547     2285   0 0.09545455   0 0.09545455 63 loan 8-30d 1998-12-01 1998-11-13
4:    4 32004   2547     2285   0 0.05454545   0 0.05454545 63 loan 8-30d 1998-12-01 1998-11-13
   issued_radius        due month week
1:          10.5 1998-12-03     4   13
2:          10.5 1998-12-03     4   14
3:          10.5 1998-12-03     4   15
4:          10.5 1998-12-03     4   16

我已经通过@David（如何加快rbind？）进行了一些优化，但是结果仍然很慢。在每晚计算十个小时之后，我已经处理了clean数据表的4％。

所以我的问题是：如何将这个循环扩展到一个大的data.table？

谢谢大家的时间。

编辑：R版本3.3.1（2016-06-21）。

罗兰

如果我正确理解了您的解释，则应在data.table中使用重叠连接。

#define start and end dates, 
#fractional days could be an issue here, but I have not checked that further
DT[, c("start", "end") := .(issued - issued_radius, issued + issued_radius)]
#create an ID
DT[, id := .I]

#create a data.table with start of week and end of week for whole year
weeks <- data.table(date = seq(as.Date("1998-01-01"), as.Date("1998-12-31"), by = "1 day"))
weeks[, week := week(date)]
weeks <- weeks[, .(start = min(date), end = max(date)), by = week]
setkey(weeks, start, end)

#now an overlaps join
DT1 <- foverlaps(DT, weeks)
#calculate number of days in each week, 
#special handling of last and first week of year might be necessary here
DT1[, overlap := 7 - (i.start > start) * (i.start - start) -  (i.end < end) * (end - i.end)]
#calculate weights
DT1[, weight := as.numeric(overlap) / sum(as.numeric(overlap)), by = id]
#apply weights
DT1[, c("obm_w",  "obd_w",  "obk_w",  "oem_w") := lapply(.SD, function(x) x * DT1[["weight"]]), 
    .SDcols = c("obm",  "obd",  "obk",  "oem")]

请仔细检查是否满足您的要求，并根据需要进行调整。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-03-4

我来说两句

0条评论

登录后参与评论

上一篇：通过先检查2个不同的URL显示/隐藏内容

来自分类Dev

Related 相关文章

文章

如何针对大型数据在R中优化for循环

如何针对大型数据在R中优化for循环

使用大型数据集优化循环Python

针对php中的许多循环优化建议？

针对大型日志优化Travis

如何对大型数据集的R中的for循环进行矢量化

处理大型数据框时如何在R中更快地运行循环

R 中的 FOR 循环优化

针对数据透视表上的每个循环进行优化

如何加快R在大型文件中的循环运行过程

如何避免针对大型数据库操作的应用失败

优化R中for循环的性能

优化R中for循环的性能

如何使用** for **循环清理大型数据集

如何使用向量化而不是循环来优化我的R代码以消除数据帧中逐行重复的NEIGHBORING

如何在R中读取大型数据集的子集？

如何优化大型数据集上的图形质量

优化大型数据集的性能

R data.table：如何针对每个对应组优化计算两个数据表之间的值差？

针对大数据的CoreData优化

如何在R中循环数据？

避免R中的for循环出现“优化失败”

如何针对R中的一个指定对照组进行循环测试？

用于比较R中两个数据帧的FOR循环优化

针对循环python进行了优化

重塑R中的大型数据集

如何优化循环图像中的条件语句？

如何在 Django 模板中优化循环？

如何将基于列的大型数据框转换为R中的数据框列表

R-For循环针对数据框列表运行代码

如何使用Matlab向量化“大型数据集”的for循环

如何使用 for 循环清理大型数据集