According to my internet research, it looks like readxl is the fastest package.
The profvis package indicated that the bottleneck is indeed in importing the files. My processor has six cores, but when I use four of them the computer crashes completely. When I use three processors, it's still usable. So I did one more benchmark comparing for loop, map_dfr and future_map_dfr (with multisession and three cores). After the benchmark was run 10 times, the result was: expr min lq mean median uq max neval import_for() 140.9940 147.9722 160.7229 155.6459 172.4661 199.1059 10 import_map_dfr() 161.6707 339.6769 480.5760 567.8389 643.8895 666.0726 10 import_furrr() 112.1374 116.4301 127.5976 129.0067 137.9179 140.8632 10 For me it is proven that using the furrr package is the best solution in this case, but what would explain so much difference with map_dfr? Em ter., 4 de out. de 2022 às 16:58, Jeff Newmiller < jdnew...@dcn.davis.ca.us> escreveu: > It looks like you are reading directly from URLs? How do you know the > delay is not network I/O delay? > > Parallel computation is not a panacea. It allows tasks _that are > CPU-bound_ to get through the CPU-intensive work faster. You need to be > certain that your tasks actually can benefit from parallelism before using > it... there is a significant overhead and added complexity to using > parallel processing that will lead to SLOWER processing if mis-used. > > On October 4, 2022 11:29:54 AM PDT, Igor L <igorlal...@gmail.com> wrote: > >Hello all, > > > >I'm developing an R package that basically downloads, imports, cleans and > >merges nine files in xlsx format updated monthly from a public > institution. > > > >The problem is that importing files in xlsx format is time consuming. > > > >My initial idea was to parallelize the execution of the read_xlsx function > >according to the number of cores in the user's processor, but apparently > it > >didn't make much difference, since when trying to parallelize it the > >execution time went from 185.89 to 184.12 seconds: > > > ># not parallelized code > >y <- purrr::map_dfr(paste0(dir.temp, '/', lista.arquivos.locais), > > readxl::read_excel, sheet = 1, skip = 4, col_types = > >c(rep('text', 30))) > > > ># parallelized code > >plan(strategy = future::multicore(workers = 4)) > >y <- furrr::future_map_dfr(paste0(dir.temp, '/', lista.arquivos.locais), > > readxl::read_excel, sheet = 1, skip = 4, > >col_types = c(rep('text', 30))) > > > > Any suggestions to reduce the import processing time? > > > >Thanks in advance! > > > > -- > Sent from my phone. Please excuse my brevity. > [[alternative HTML version deleted]] ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel