R package arrow

Lipatz Jean-Luc Tue, 11 Oct 2022 05:41:14 -0700

Hello,
I apogize if I misunderstood the contribution guidelines page, but I would like 
too report what seems to be a bug for me, and I cannot submit it as an issue 
the regular way.


I am currently trying to get some idea of the limits implied by the use parquet 
files, as we, in INSEE are moving from SAS to R and need to find an alternate 
storage mode than SAS bases.
I ran something that looks like a crash test, merging large tables (100 million 
lines, 7 columns). It appears that using parquet files is fine with duckdb 
connected to them, but using the package arrow alone doesn’t seem to be a good 
idea : process crashes without any error message.
My questions are :

1)      Is it possible to have a better behaviour ? Just terminating the 
command and printing an error message would be fine.

2)      Is my problem related to some optimization that occurs within duckdb, 
but not with the arrow package, although nothing is executed before the 
‘collect’ function ?

I include here the program I used to generate the files and also the code that 
never ends.

Best regards,

Jean-Luc LIPATZ
INSEE - Direction générale - Direction du Système d'information
Responsable de la coordination de la mise en œuvre d'alternatives à SAS et du 
développement de R

DIR = "V:/PALETTES/tmp/ONE"

## Generation
library(rio)
ROWS = 100000000
DEL = 2000000
ADD = 4000000
N = 2

df <- data.frame(
    id=1:ROWS,
    gen=rep(1,ROWS),
    v1=runif(ROWS),
    v2=runif(ROWS),
    v3=runif(ROWS),
    v4=runif(ROWS),
    v5=runif(ROWS))
n <- 1
rio::export(df,glue::glue("{DIR}1.parquet"))
for (i in 2:N) {
  n <- n + nrow(df)
  df <- df[-sample(1:nrow(df),DEL),]
  df <- rbind(df,
    data.frame(
      id = (n+1):(n+ADD),
      gen= rep(i,ADD),
      v1=runif(ADD),
      v2=runif(ADD),
      v3=runif(ADD),
      v4=runif(ADD),
      v5=runif(ADD)))
  rio::export(df,glue::glue("{DIR}{i}.parquet"))
}

## Test
one1 <- arrow::open_dataset("V:/PALETTES/tmp/ONE1.parquet")
one2 <- arrow::open_dataset("V:/PALETTES/tmp/ONE2.parquet")
inner_join(one1,one2,by="id") %>%
  summarise(n=n(), delta1=mean(v2.y-v1.x)) %>%
  collect()

R package arrow

Reply via email to