Hello,
I apogize if I misunderstood the contribution guidelines page, but I would like
too report what seems to be a bug for me, and I cannot submit it as an issue
the regular way.
I am currently trying to get some idea of the limits implied by the use parquet
files, as we, in INSEE are moving from SAS to R and need to find an alternate
storage mode than SAS bases.
I ran something that looks like a crash test, merging large tables (100 million
lines, 7 columns). It appears that using parquet files is fine with duckdb
connected to them, but using the package arrow alone doesn’t seem to be a good
idea : process crashes without any error message.
My questions are :
1) Is it possible to have a better behaviour ? Just terminating the
command and printing an error message would be fine.
2) Is my problem related to some optimization that occurs within duckdb,
but not with the arrow package, although nothing is executed before the
‘collect’ function ?
I include here the program I used to generate the files and also the code that
never ends.
Best regards,
Jean-Luc LIPATZ
INSEE - Direction générale - Direction du Système d'information
Responsable de la coordination de la mise en œuvre d'alternatives à SAS et du
développement de R
DIR = "V:/PALETTES/tmp/ONE"
## Generation
library(rio)
ROWS = 100000000
DEL = 2000000
ADD = 4000000
N = 2
df <- data.frame(
id=1:ROWS,
gen=rep(1,ROWS),
v1=runif(ROWS),
v2=runif(ROWS),
v3=runif(ROWS),
v4=runif(ROWS),
v5=runif(ROWS))
n <- 1
rio::export(df,glue::glue("{DIR}1.parquet"))
for (i in 2:N) {
n <- n + nrow(df)
df <- df[-sample(1:nrow(df),DEL),]
df <- rbind(df,
data.frame(
id = (n+1):(n+ADD),
gen= rep(i,ADD),
v1=runif(ADD),
v2=runif(ADD),
v3=runif(ADD),
v4=runif(ADD),
v5=runif(ADD)))
rio::export(df,glue::glue("{DIR}{i}.parquet"))
}
## Test
one1 <- arrow::open_dataset("V:/PALETTES/tmp/ONE1.parquet")
one2 <- arrow::open_dataset("V:/PALETTES/tmp/ONE2.parquet")
inner_join(one1,one2,by="id") %>%
summarise(n=n(), delta1=mean(v2.y-v1.x)) %>%
collect()