Hello, I apogize if I misunderstood the contribution guidelines page, but I would like too report what seems to be a bug for me, and I cannot submit it as an issue the regular way.
I am currently trying to get some idea of the limits implied by the use parquet files, as we, in INSEE are moving from SAS to R and need to find an alternate storage mode than SAS bases. I ran something that looks like a crash test, merging large tables (100 million lines, 7 columns). It appears that using parquet files is fine with duckdb connected to them, but using the package arrow alone doesn’t seem to be a good idea : process crashes without any error message. My questions are : 1) Is it possible to have a better behaviour ? Just terminating the command and printing an error message would be fine. 2) Is my problem related to some optimization that occurs within duckdb, but not with the arrow package, although nothing is executed before the ‘collect’ function ? I include here the program I used to generate the files and also the code that never ends. Best regards, Jean-Luc LIPATZ INSEE - Direction générale - Direction du Système d'information Responsable de la coordination de la mise en œuvre d'alternatives à SAS et du développement de R DIR = "V:/PALETTES/tmp/ONE" ## Generation library(rio) ROWS = 100000000 DEL = 2000000 ADD = 4000000 N = 2 df <- data.frame( id=1:ROWS, gen=rep(1,ROWS), v1=runif(ROWS), v2=runif(ROWS), v3=runif(ROWS), v4=runif(ROWS), v5=runif(ROWS)) n <- 1 rio::export(df,glue::glue("{DIR}1.parquet")) for (i in 2:N) { n <- n + nrow(df) df <- df[-sample(1:nrow(df),DEL),] df <- rbind(df, data.frame( id = (n+1):(n+ADD), gen= rep(i,ADD), v1=runif(ADD), v2=runif(ADD), v3=runif(ADD), v4=runif(ADD), v5=runif(ADD))) rio::export(df,glue::glue("{DIR}{i}.parquet")) } ## Test one1 <- arrow::open_dataset("V:/PALETTES/tmp/ONE1.parquet") one2 <- arrow::open_dataset("V:/PALETTES/tmp/ONE2.parquet") inner_join(one1,one2,by="id") %>% summarise(n=n(), delta1=mean(v2.y-v1.x)) %>% collect()