[Rd] utils::install.packages with quiet=TRUE fails for source packages on Windows

2018-01-25 Thread Andreas Kersting

Hi,

Installing a source package on Windows using utils::install.packages() 
with quiet=TRUE fails, while it works with the default quiet = FALSE. 
The problem seems to be caused by the fact that when quiet = TRUE, 
stdout and stderr are set to FALSE when calling "R CMD INSTALL" with 
base::system2() here: 
https://github.com/wch/r-source/blob/tags/R-3-4-3/src/library/utils/R/packages2.R#L660-L661.



trace(base::system2, quote(print(ls.str(

Tracing function "system2" in package "base"
[1] "system2"


utils::install.packages("partDF_1.0.0.9001.tar.gz", repos = NULL, lib 

= tempdir(), quiet = TRUE)
Tracing system2(cmd0, args, env = env, stdout = output, stderr = output) 
on entry
args :  chr [1:5] "CMD" "INSTALL" "-l" 
"\"C:\\Users\\askers\\AppData\\Local\\Temp\\RtmpoRb97l\"" ...

command :  chr "C:/PROGRA~1/R/R-34~1.3/bin/x64/R"
env :  chr(0)
input :  NULL
invisible :  logi TRUE
minimized :  logi FALSE
stderr :  logi FALSE
stdin :  chr ""
stdout :  logi FALSE
wait :  logi TRUE
Warning messages:
1: running command '"C:/PROGRA~1/R/R-34~1.3/bin/x64/R" CMD INSTALL -l 
"C:\Users\askers\AppData\Local\Temp\RtmpoRb97l" 
"partDF_1.0.0.9001.tar.gz"' had status 1

2: In utils::install.packages("partDF_1.0.0.9001.tar.gz", repos = NULL,  :
  installation of package 'partDF_1.0.0.9001.tar.gz' had non-zero exit 
status



utils::install.packages("partDF_1.0.0.9001.tar.gz", repos = NULL, lib 

= tempdir(), quiet = FALSE)
Tracing system2(cmd0, args, env = env, stdout = output, stderr = output) 
on entry
args :  chr [1:5] "CMD" "INSTALL" "-l" 
"\"C:\\Users\\askers\\AppData\\Local\\Temp\\RtmpoRb97l\"" ...

command :  chr "C:/PROGRA~1/R/R-34~1.3/bin/x64/R"
env :  chr(0)
input :  NULL
invisible :  logi TRUE
minimized :  logi FALSE
stderr :  chr ""
stdin :  chr ""
stdout :  chr ""
wait :  logi TRUE
* installing *source* package 'partDF' ...
** libs
c:/Rtools/mingw_64/bin/gcc  -I"C:/PROGRA~1/R/R-34~1.3/include" -DNDEBUG 
   -O2 -Wall  -std=gnu99 -mtune=generic -c partDF.c -o partDF.o
c:/Rtools/mingw_64/bin/gcc -shared -s -static-libgcc -o partDF.dll 
tmp.def partDF.o -LC:/PROGRA~1/R/R-34~1.3/bin/x64 -lR

installing to C:/Users/askers/AppData/Local/Temp/RtmpoRb97l/partDF/libs/x64
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package 'partDF'
finding HTML links ... done
anti_glob   html
partDF  html
read_partDF html
write_partDFhtml
** building package indices
** testing if installed package can be loaded
* DONE (partDF)
In R CMD INSTALL



sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252 
LC_MONETARY=German_Germany.1252

[4] LC_NUMERIC=CLC_TIME=German_Germany.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3yaml_2.1.16


This problem is also there when installing source packages from CRAN:


utils::install.packages("mvtnorm", lib = tempdir(), quiet = TRUE)


  There is a binary version available but the source version is later:
binary source needs_compilation
mvtnorm  1.0-6  1.0-7  TRUE

Do you want to install from sources the package which needs compilation?
y/n: y
installing the source package 'mvtnorm'

Tracing system2(cmd0, args, env = env, stdout = outfile, stderr = 
outfile) on entry
args :  Named chr [1:5] "CMD" "INSTALL" "-l" 
"\"C:\\Users\\askers\\AppData\\Local\\Temp\\RtmpoRb97l\"" ...

command :  chr "C:/PROGRA~1/R/R-34~1.3/bin/x64/R"
env :  chr(0)
input :  NULL
invisible :  logi TRUE
minimized :  logi FALSE
stderr :  logi FALSE
stdin :  chr ""
stdout :  logi FALSE
wait :  logi TRUE
Warning messages:
1: running command '"C:/PROGRA~1/R/R-34~1.3/bin/x64/R" CMD INSTALL -l 
"C:\Users\askers\AppData\Local\Temp\RtmpoRb97l" 
C:\Users\askers\AppData\Local\Temp\RtmpoRb97l/downloaded_packages/mvtnorm_1.0-7.tar.gz' 
had status 1

2: In utils::install.packages("mvtnorm", lib = tempdir(), quiet = TRUE) :
  installation of package 'mvtnorm' had non-zero exit status



I do not encounter this problem on my Linux machine.

Andreas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] utils::install.packages with quiet=TRUE fails for source packages on Windows

2018-01-26 Thread Andreas Kersting
Just noticed that this problem only occurs from within RStudio 
(v1.1.414). Any ideas why?


Am 26.01.2018 um 08:56 schrieb Andreas Kersting:

Hi,

Installing a source package on Windows using utils::install.packages() 
with quiet=TRUE fails, while it works with the default quiet = FALSE. 
The problem seems to be caused by the fact that when quiet = TRUE, 
stdout and stderr are set to FALSE when calling "R CMD INSTALL" with 
base::system2() here: 
https://github.com/wch/r-source/blob/tags/R-3-4-3/src/library/utils/R/packages2.R#L660-L661. 




trace(base::system2, quote(print(ls.str(

Tracing function "system2" in package "base"
[1] "system2"


utils::install.packages("partDF_1.0.0.9001.tar.gz", repos = NULL, lib 

= tempdir(), quiet = TRUE)
Tracing system2(cmd0, args, env = env, stdout = output, stderr = output) 
on entry
args :  chr [1:5] "CMD" "INSTALL" "-l" 
"\"C:\\Users\\askers\\AppData\\Local\\Temp\\RtmpoRb97l\"" ...

command :  chr "C:/PROGRA~1/R/R-34~1.3/bin/x64/R"
env :  chr(0)
input :  NULL
invisible :  logi TRUE
minimized :  logi FALSE
stderr :  logi FALSE
stdin :  chr ""
stdout :  logi FALSE
wait :  logi TRUE
Warning messages:
1: running command '"C:/PROGRA~1/R/R-34~1.3/bin/x64/R" CMD INSTALL -l 
"C:\Users\askers\AppData\Local\Temp\RtmpoRb97l" 
"partDF_1.0.0.9001.tar.gz"' had status 1

2: In utils::install.packages("partDF_1.0.0.9001.tar.gz", repos = NULL,  :
   installation of package 'partDF_1.0.0.9001.tar.gz' had non-zero exit 
status



utils::install.packages("partDF_1.0.0.9001.tar.gz", repos = NULL, lib 

= tempdir(), quiet = FALSE)
Tracing system2(cmd0, args, env = env, stdout = output, stderr = output) 
on entry
args :  chr [1:5] "CMD" "INSTALL" "-l" 
"\"C:\\Users\\askers\\AppData\\Local\\Temp\\RtmpoRb97l\"" ...

command :  chr "C:/PROGRA~1/R/R-34~1.3/bin/x64/R"
env :  chr(0)
input :  NULL
invisible :  logi TRUE
minimized :  logi FALSE
stderr :  chr ""
stdin :  chr ""
stdout :  chr ""
wait :  logi TRUE
* installing *source* package 'partDF' ...
** libs
c:/Rtools/mingw_64/bin/gcc  -I"C:/PROGRA~1/R/R-34~1.3/include" -DNDEBUG 
    -O2 -Wall  -std=gnu99 -mtune=generic -c partDF.c -o partDF.o
c:/Rtools/mingw_64/bin/gcc -shared -s -static-libgcc -o partDF.dll 
tmp.def partDF.o -LC:/PROGRA~1/R/R-34~1.3/bin/x64 -lR

installing to C:/Users/askers/AppData/Local/Temp/RtmpoRb97l/partDF/libs/x64
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
   converting help for package 'partDF'
     finding HTML links ... done
     anti_glob   html
     partDF  html
     read_partDF html
     write_partDF    html
** building package indices
** testing if installed package can be loaded
* DONE (partDF)
In R CMD INSTALL



sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252 
LC_MONETARY=German_Germany.1252

[4] LC_NUMERIC=C    LC_TIME=German_Germany.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3    yaml_2.1.16


This problem is also there when installing source packages from CRAN:


utils::install.packages("mvtnorm", lib = tempdir(), quiet = TRUE)


   There is a binary version available but the source version is later:
     binary source needs_compilation
mvtnorm  1.0-6  1.0-7  TRUE

Do you want to install from sources the package which needs compilation?
y/n: y
installing the source package 'mvtnorm'

Tracing system2(cmd0, args, env = env, stdout = outfile, stderr = 
outfile) on entry
args :  Named chr [1:5] "CMD" "INSTALL" "-l" 
"\"C:\\Users\\askers\\AppData\\Local\\Temp\\RtmpoRb97l\"" ...

command :  chr "C:/PROGRA~1/R/R-34~1.3/bin/x64/R"
env :  chr(0)
input :  NULL
invisible :  logi TRUE
minimized :  logi FALSE
stderr :  logi FALSE
stdin :  chr ""
stdout :  logi FALSE
wait :  logi TRUE
Warning messages:
1: running command '"C:/PROGRA~1/R/R-34~1.3/bin/x64/R" CMD INSTALL -l 
"C:\Users\askers\AppData\Local\Temp\RtmpoRb97l" 
C:\Users\askers\AppData\Local\Temp\RtmpoRb97l/downloaded_packages/mvtnorm_1.0-7.tar.gz' 
had status 1

2: In utils::install.packages("mvtnorm", lib = tempdir(), quiet = TRUE) :
   installation of package 'mvtnorm' had non-zero exit status



I do not encounter this problem on my Linux machine.

Andreas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] utils::install.packages with quiet=TRUE fails for source packages on Windows

2018-01-26 Thread Andreas Kersting
I have filed a bug report here: 
https://github.com/rstudio/rstudio/issues/2070


 Original Message 
From: peter dalgaard [mailto:pda...@gmail.com]
Sent: Friday, Jan 26, 2018 10:15 AM GMT
To: Andreas Kersting
Cc: r-devel@r-project.org
Subject: [Rd] utils::install.packages with quiet=TRUE fails for source 
packages on Windows



The obvious guess would be that Rstudio is attempting something like 
redirecting output and getting itself confused. However, it is pretty clearly 
Their Problem, no? Rstudio has their own support infrastructure.

-pd



On 26 Jan 2018, at 09:17 , Andreas Kersting  wrote:

Just noticed that this problem only occurs from within RStudio (v1.1.414). Any 
ideas why?

Am 26.01.2018 um 08:56 schrieb Andreas Kersting:

Hi,
Installing a source package on Windows using utils::install.packages() with quiet=TRUE 
fails, while it works with the default quiet = FALSE. The problem seems to be caused by 
the fact that when quiet = TRUE, stdout and stderr are set to FALSE when calling "R 
CMD INSTALL" with base::system2() here: 
https://github.com/wch/r-source/blob/tags/R-3-4-3/src/library/utils/R/packages2.R#L660-L661.

trace(base::system2, quote(print(ls.str(

Tracing function "system2" in package "base"
[1] "system2"

utils::install.packages("partDF_1.0.0.9001.tar.gz", repos = NULL, lib

= tempdir(), quiet = TRUE)
Tracing system2(cmd0, args, env = env, stdout = output, stderr = output) on 
entry
args :  chr [1:5] "CMD" "INSTALL" "-l" 
"\"C:\\Users\\askers\\AppData\\Local\\Temp\\RtmpoRb97l\"" ...
command :  chr "C:/PROGRA~1/R/R-34~1.3/bin/x64/R"
env :  chr(0)
input :  NULL
invisible :  logi TRUE
minimized :  logi FALSE
stderr :  logi FALSE
stdin :  chr ""
stdout :  logi FALSE
wait :  logi TRUE
Warning messages:
1: running command '"C:/PROGRA~1/R/R-34~1.3/bin/x64/R" CMD INSTALL -l 
"C:\Users\askers\AppData\Local\Temp\RtmpoRb97l" "partDF_1.0.0.9001.tar.gz"' had status 1
2: In utils::install.packages("partDF_1.0.0.9001.tar.gz", repos = NULL,  :
   installation of package 'partDF_1.0.0.9001.tar.gz' had non-zero exit status

utils::install.packages("partDF_1.0.0.9001.tar.gz", repos = NULL, lib

= tempdir(), quiet = FALSE)
Tracing system2(cmd0, args, env = env, stdout = output, stderr = output) on 
entry
args :  chr [1:5] "CMD" "INSTALL" "-l" 
"\"C:\\Users\\askers\\AppData\\Local\\Temp\\RtmpoRb97l\"" ...
command :  chr "C:/PROGRA~1/R/R-34~1.3/bin/x64/R"
env :  chr(0)
input :  NULL
invisible :  logi TRUE
minimized :  logi FALSE
stderr :  chr ""
stdin :  chr ""
stdout :  chr ""
wait :  logi TRUE
* installing *source* package 'partDF' ...
** libs
c:/Rtools/mingw_64/bin/gcc  -I"C:/PROGRA~1/R/R-34~1.3/include" -DNDEBUG 
-O2 -Wall  -std=gnu99 -mtune=generic -c partDF.c -o partDF.o
c:/Rtools/mingw_64/bin/gcc -shared -s -static-libgcc -o partDF.dll tmp.def 
partDF.o -LC:/PROGRA~1/R/R-34~1.3/bin/x64 -lR
installing to C:/Users/askers/AppData/Local/Temp/RtmpoRb97l/partDF/libs/x64
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
   converting help for package 'partDF'
 finding HTML links ... done
 anti_glob   html
 partDF  html
 read_partDF html
 write_partDFhtml
** building package indices
** testing if installed package can be loaded
* DONE (partDF)
In R CMD INSTALL

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252 
LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=CLC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base
loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3yaml_2.1.16
This problem is also there when installing source packages from CRAN:

utils::install.packages("mvtnorm", lib = tempdir(), quiet = TRUE)

   There is a binary version available but the source version is later:
 binary source needs_compilation
mvtnorm  1.0-6  1.0-7  TRUE
Do you want to install from sources the package which needs compilation?
y/n: y
installing the source package 'mvtnorm'
Tracing system2(cmd0, args, env = env, stdout = outfile, stderr = outfile) on 
entry
args :  Named chr [1:5] "CMD" "INSTALL" "-l" 
"\"C:\\Users\\askers\\AppData\\Local\\Temp\\RtmpoRb97l\"" ...
command :  chr "C:/PROGRA~1/R/R-34~1.3/bin/x64/R"
en

[Rd] most robust way to call R API functions from a secondary thread

2019-05-19 Thread Andreas Kersting
Hi,

As the subject suggests, I am looking for the most robust way to call an 
(arbitrary) function from the R API from another but the main POSIX thread in a 
package's code.

I know that, "[c]alling any of the R API from threaded code is ‘for experts 
only’ and strongly discouraged. Many functions in the R API modify internal R 
data structures and might corrupt these data structures if called 
simultaneously from multiple threads. Most R API functions can signal errors, 
which must only happen on the R main thread." 
(https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support)

Let me start with my understanding of the related issues and possible solutions:

1) R API functions are generally not thread-safe and hence one must ensure, 
e.g. by using mutexes, that no two threads use the R API simultaneously

2) R uses longjmps on error and interrupts as well as for condition handling 
and it is undefined behaviour to do a longjmp from one thread to another; 
interrupts can be suspended before creating the threads by setting 
R_interrupts_suspended = TRUE; by wrapping the calls to functions from the R 
API with R_ToplevelExec(), longjmps across thread boundaries can be avoided; 
the only reason for R_ToplevelExec() itself to fail with an R-style error 
(longjmp) is a pointer protection stack overflow

3) R_CheckStack() might be executed (indirectly), which will (probably) signal 
a stack overflow because it only works correctly when called form the main 
thread (see 
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Threading-issues); 
in particular, any function that does allocations, e.g. via allocVector3() 
might end up calling it via GC -> finalizer -> ... -> eval; the only way around 
this problem which I could find is to adjust R_CStackLimit, which is outside of 
the official API; it can be set to -1 to disable the check or be changed to a 
value appropriate for the current thread

4) R sets signal handlers for several signals and some of them make use of the 
R API; hence, issues 1) - 3) apply; signal masks can be used to block delivery 
of signals to secondary threads in general and to the main thread while other 
threads are using the R API


I basically have the following questions:

a) Is my understanding of the issues accurate?
b) Are there more things to consider when calling the R API from secondary 
threads?
c) Are the solutions proposed appropriate? Are there scenarios in which they 
will fail to solve the issue? Or might they even cause new problems?
d) Are there alternative/better solutions?

Any feedback on this is highly appreciated.

Below you can find a template which, combines the proposed solutions (and skips 
all non-illustrative checks of return values). Additionally, 
R_CheckUserInterrupt() is used in combination with R_UnwindProtect() to 
regularly check for interrupts from the main thread, while still being able to 
cleanly cancel the threads before fun_running_in_main_thread() is left via a 
longjmp. This is e.g. required if the secondary threads use memory which was 
allocated in fun_running_in_main_thread() using e.g. R_alloc().

Best regards,
Andreas Kersting



#include 
#include 
#include 
#include 

extern uintptr_t R_CStackLimit;
extern int R_PPStackTop;
extern int R_PPStackSize;

#include 
LibExtern Rboolean R_interrupts_suspended;
LibExtern int R_interrupts_pending;
extern void Rf_onintr(void);

// mutex for exclusive access to the R API:
static pthread_mutex_t r_api_mutex = PTHREAD_MUTEX_INITIALIZER;

// a wrapper arround R_CheckUserInterrupt() which can be passed to 
R_UnwindProtect():
SEXP check_interrupt(void *data) {
  R_CheckUserInterrupt();
  return R_NilValue;
}

// a wrapper arround Rf_onintr() which can be passed to R_UnwindProtect():
SEXP my_onintr(void *data) {
  Rf_onintr();
  return R_NilValue;
}

// function called by R_UnwindProtect() to cleanup on interrupt
void cleanfun(void *data, Rboolean jump) {
  if (jump) {
// terminate threads cleanly ...
  }
}

void fun_calling_R_API(void *data) {
  // call some R API function, e.g. mkCharCE() ...
}

void *threaded_fun(void *td) {

  // ...

  pthread_mutex_lock(&r_api_mutex);

  // avoid false stack overflow error:
  intptr_t R_CStackLimit_old = R_CStackLimit;
  R_CStackLimit = -1;


  // R_ToplevelExec() below will call PROTECT 4x:
  if (R_PPStackTop > R_PPStackSize - 4) {
// ppstack would overflow in R_ToplevelExec() -> handle this ...
  }

  // avoid longjmp to different thread:
  Rboolean ok = R_ToplevelExec(fun_calling_R_API, (void *) &some_data);

  // re-enable stack size checking:
  R_CStackLimit = R_CStackLimit_old;
  pthread_mutex_unlock(&r_api_mutex);

  if (!ok) {
// handle error ...
  }

  // ...
}

SEXP fun_running_in_main_thread() {

  // ...

  /* create continuation token for R_UnwindProtect():
   *
   * do this explicitly here before the threads are created because this might
   * fail in allocation

Re: [Rd] most robust way to call R API functions from a secondary thread

2019-05-21 Thread Andreas Kersting
Hi Simon,

Your response hits the mark of why I am doing it that way rather than going 
with what Stepan proposed. Also good to hear that you consider my analysis to 
be pretty complete. Thanks for the feedback!

Regards,
Andreas

2019-05-20 15:54 GMT+02:00 Simon Urbanek:
> Stepan,
>
> Andreas gave a lot more thought into what you question in your reply. His 
> question was how you can avoid what you where proposing and have proper 
> threading under safe conditions. Having dealt with this before, I think 
> Andreas' write up is pretty much the most complete analysis I have seen. I'd 
> wait for Luke to chime in as the ultimate authority if he gets to it.
>
> The "classic" approach which you mention is to collect and allocate 
> everything, then execute parallel code and then return. What Andres is 
> proposing is obviously much more efficient: you only synchronize on R API 
> calls which are likely a small fraction on the entire time while you keep all 
> threads alive. His question was how to do that safely. (BTW: I really like 
> the touch of counting frames that toplevel exec can use ;) - it may make 
> sense to deal with that edge-case in R if we can ...).
>
> Cheers,
> Simon
>
>
>
>
>> On May 20, 2019, at 5:45 AM, Stepan  wrote:
>>
>> Hi Andreas,
>>
>> note that with the introduction of ALTREP, as far as I understand, calls as 
>> "simple" as DATAPTR can execute arbitrary code (R or native). Even without 
>> ALTREP, if you execute user-provided R code via Rf_eval and such on some 
>> custom thread, you may end up executing native code of some package, which 
>> may assume it is executed only from the R main thread.
>>
>> Could you (1) decompose your problem in a way that in some initial phase you 
>> pull all the necessary data from R, then start the parallel computation, and 
>> then again in the R main thread "submit" the results back to the R world?
>>
>> If you wanted something really robust, you can (2) "send" the requests for R 
>> API usage to the R main thread and pause the worker thread until it receives 
>> the results back. This looks similar to what the "later" package does. Maybe 
>> you can even use that package for your purposes?
>>
>> Do you want to parallelize your code to achieve better performance? Even 
>> with your proposed solution, you need synchronization and chances are that 
>> excessive synchronization will severely affect the expected performance 
>> benefits of parallelization. If you do not need to synchronize that much, 
>> then the question is if you can do with (1) or (2).
>>
>> Best regards,
>> Stepan
>>
>> On 19/05/2019 11:31, Andreas Kersting wrote:
>>> Hi,
>>> As the subject suggests, I am looking for the most robust way to call an 
>>> (arbitrary) function from the R API from another but the main POSIX thread 
>>> in a package's code.
>>> I know that, "[c]alling any of the R API from threaded code is ‘for experts 
>>> only’ and strongly discouraged. Many functions in the R API modify internal 
>>> R data structures and might corrupt these data structures if called 
>>> simultaneously from multiple threads. Most R API functions can signal 
>>> errors, which must only happen on the R main thread." 
>>> (https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_doc_manuals_r-2Drelease_R-2Dexts.html-23OpenMP-2Dsupport&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=neKFCw86thQe2E2-61NAgpDMw4cC7oD_tUTTzraOkQM&m=d1r2raD4w0FF7spOVuz2IVEo0P_II3ZtSbw0TU2NmaE&s=JaadZR_m-QiJ3BQzzQ_fJPYt034tM5Ts6vKhdi6f__A&e=)
>>> Let me start with my understanding of the related issues and possible 
>>> solutions:
>>> 1) R API functions are generally not thread-safe and hence one must ensure, 
>>> e.g. by using mutexes, that no two threads use the R API simultaneously
>>> 2) R uses longjmps on error and interrupts as well as for condition 
>>> handling and it is undefined behaviour to do a longjmp from one thread to 
>>> another; interrupts can be suspended before creating the threads by setting 
>>> R_interrupts_suspended = TRUE; by wrapping the calls to functions from the 
>>> R API with R_ToplevelExec(), longjmps across thread boundaries can be 
>>> avoided; the only reason for R_ToplevelExec() itself to fail with an 
>>> R-style error (longjmp) is a pointer protection stack overflow
>>> 3) R_CheckStack() might be executed (indirectly), which will (probably) 
>>> signal a stack overflow because it only works correctly when ca

Re: [Rd] [External] most robust way to call R API functions from a secondary thread

2019-05-21 Thread Andreas Kersting
Hi Luke,

Thanks also for your feedback! I will then follow the proposed route for the 
problem at hand and I will report back if I encounter any issues. 

I am also going look into the issues of stack checking and R_ToplevelExec.

Regards,
Andreas

2019-05-20 19:29 GMT+02:00 Tierney, Luke:
> Your analysis looks pretty complete to me and your solutions seemsplausible.  
> That said, I don't know that I would have the level of
> confidence yet that we haven't missed an important point that I would
> want before going down this route.
> 
> Losing stack checking is risky; it might be eventually possible to
> provide some support for this to be handled via a thread-local
> variable. Ensuring that R_ToplevelExec can't jump before entering the
> body function would be a good idea; if you want to propose a patch we
> can have a look.
> 
> Best,
> 
> luke
> 
> On Sun, 19 May 2019, Andreas Kersting wrote:
> 
>> Hi,
>>
>> As the subject suggests, I am looking for the most robust way to call an 
>> (arbitrary) function from the R API from another but the main POSIX thread 
>> in a package's code.
>>
>> I know that, "[c]alling any of the R API from threaded code is ‘for experts 
>> only’ and strongly discouraged. Many functions in the R API modify internal 
>> R data structures and might corrupt these data structures if called 
>> simultaneously from multiple threads. Most R API functions can signal 
>> errors, which must only happen on the R main thread." 
>> (https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support)
>>
>> Let me start with my understanding of the related issues and possible 
>> solutions:
>>
>> 1) R API functions are generally not thread-safe and hence one must ensure, 
>> e.g. by using mutexes, that no two threads use the R API simultaneously
>>
>> 2) R uses longjmps on error and interrupts as well as for condition handling 
>> and it is undefined behaviour to do a longjmp from one thread to another; 
>> interrupts can be suspended before creating the threads by setting 
>> R_interrupts_suspended = TRUE; by wrapping the calls to functions from the R 
>> API with R_ToplevelExec(), longjmps across thread boundaries can be avoided; 
>> the only reason for R_ToplevelExec() itself to fail with an R-style error 
>> (longjmp) is a pointer protection stack overflow
>>
>> 3) R_CheckStack() might be executed (indirectly), which will (probably) 
>> signal a stack overflow because it only works correctly when called form the 
>> main thread (see 
>> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Threading-issues);
>>  in particular, any function that does allocations, e.g. via allocVector3() 
>> might end up calling it via GC -> finalizer -> ... -> eval; the only way 
>> around this problem which I could find is to adjust R_CStackLimit, which is 
>> outside of the official API; it can be set to -1 to disable the check or be 
>> changed to a value appropriate for the current thread
>>
>> 4) R sets signal handlers for several signals and some of them make use of 
>> the R API; hence, issues 1) - 3) apply; signal masks can be used to block 
>> delivery of signals to secondary threads in general and to the main thread 
>> while other threads are using the R API
>>
>>
>> I basically have the following questions:
>>
>> a) Is my understanding of the issues accurate?
>> b) Are there more things to consider when calling the R API from secondary 
>> threads?
>> c) Are the solutions proposed appropriate? Are there scenarios in which they 
>> will fail to solve the issue? Or might they even cause new problems?
>> d) Are there alternative/better solutions?
>>
>> Any feedback on this is highly appreciated.
>>
>> Below you can find a template which, combines the proposed solutions (and 
>> skips all non-illustrative checks of return values). Additionally, 
>> R_CheckUserInterrupt() is used in combination with R_UnwindProtect() to 
>> regularly check for interrupts from the main thread, while still being able 
>> to cleanly cancel the threads before fun_running_in_main_thread() is left 
>> via a longjmp. This is e.g. required if the secondary threads use memory 
>> which was allocated in fun_running_in_main_thread() using e.g. R_alloc().
>>
>> Best regards,
>> Andreas Kersting
>>
>>
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> extern uintptr_t R_CStackLimit;
>> extern int R_PPStackTop;
>> extern int R_PPStackSize;
>>
>> #include

[Rd] make running on.exit expr uninterruptible

2019-05-22 Thread Andreas Kersting
Hi,

Is there currently any way to guarantee that on.exit does not fail to execute 
the recorded expression because of a user interrupt arriving during function 
exit? Consider:

f <- function() {
  suspendInterrupts({
on.exit(suspendInterrupts(cntr_on.exit <<- cntr_on.exit + 1L))
cntr_f <<- cntr_f + 1L
  })
  TRUE
}

It is possible to interrupt this function such that cntr_f is incremented while 
cntr_on.exit is not (you might need to adjust timeout_upper to trigger the 
error on your machine):

timeout_upper <- 0.1
repeat {
  cntr_f <- 0L
  cntr_on.exit <- 0L
  
  # timeout code borrowed from R.utils::withTimeout but with setTimeLimit()
  # (correctly) place inside tryCatch (otherwise timeout can occur before it can
  # be caught) and with time limit reset before going into the error handler
  res_list <- lapply(seq(0, timeout_upper, length.out = 1000), 
function(timeout) {
on.exit({
  setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE)
})
tryCatch({
  setTimeLimit(cpu = timeout, elapsed = timeout, transient = TRUE)
  res <- f()
  
  # avoid timeout while running error handler
  setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE)
  
  res
}, error = function(ex) {
  msg <- ex$message
  pattern <- gettext("reached elapsed time limit", "reached CPU time limit",
 domain = "R")
  pattern <- paste(pattern, collapse = "|")
  if (regexpr(pattern, msg) != -1L) {
FALSE
  }
  else {
stop(ex)
  }
})
  })
  print(sum(unlist(res_list)))  # number of times f completed
  stopifnot(cntr_on.exit == cntr_f)
}

Example output:

1] 1000
[1] 1000
[1] 1000
[1] 1000
[1] 999
[1] 1000
[1] 1000
[1] 999
[1] 998
[1] 1000
[1] 998
[1] 1000
[1] 1000
[1] 1000
[1] 1000
[1] 999
Error: cntr_on.exit == cntr_f is not TRUE

I was bitten by this because an on.exit expression, which releases a file lock, 
was interrupted (before it actually executed) such that subsequent calls block 
indefinitely.

Regards,
Andreas
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] error in parallel:::sendMaster

2019-11-27 Thread Andreas Kersting
Hi,

I am facing a very weird problem with parallel::mclapply. I have a script which 
does some data wrangling on an input dataset in parallel and then writes the 
results to disk. I have been using this script daily for more than one year 
always on an EC2 instance launched from the same AMI (no updates installed 
after launch) and processed thousands of different input data sets 
successfully. I now have an input dataset for which I face the following bug:

The basic outline of the problematic section of the script:

# parts is a data.table with 88 rows
mc_ret <- parallel::mclapply(sample.int(nrow(parts)), function(i) {
  # do some data wrangling and write the result to a file
  # ...

  print(paste0("part ", i, " written successfully."))
  return(TRUE)
}, mc.preschedule = FALSE, mc.cores = 2L)

str(mc_ret)


Expected output: "part i written successfully." is printed 88 times, once for 
each value of i. mc_ret is a list of length 88, each element being TRUE. Its 
structure is printed once. All outputs are created successfully.

Actual output (see end of the message): "part i written successfully." is 
printed 88 times, once for each value of i. mc_ret is a list of length 88, each 
element being TRUE. Its structure is printed. All outputs are created 
successfully. So far so good.

But then "part i written successfully." it is printed another X times, for 
values of i for which it was already printed. This output is intermingled with 
X-1 times the following error message:

Error in sendMaster(try(eval(expr, env), silent = TRUE)) :
  write error, closing pipe to the master
Calls: lapply ...  ->  -> mcparallel -> sendMaster

and Y times the message "Execution halted". mc_ret is printed again, now being 
a list of length 85, with the first element being TRUE and all other elements 
being NULL. X and Y vary from run to run.


Now to the main problem: I tried very hard to create a reproducible example, 
but I failed. What I observed:
- The output is (and has always been) written to path which is on an NFS share. 
If I instead write to a path on a local disk it will work.
- The script is invoked using Rscript. If I instead source it from an 
interactive R session it works. There are at least two more people who have 
observed this: 
https://stackoverflow.com/questions/51986674/mclapply-sendmaster-error-only-with-rscript
- Before the call to mclapply the code acquires an exclusive file lock on a 
dedicated lock file, not written to but also on the NFS share. If I remove the 
code acquiring the lock, the whole script will also work if called using 
Rscript.
- The problem also occurs for mc.preschedule = TRUE.
- There is no error if I set mc.cores to 1.
- And stressing again: the code works without any changes from Rscript for 
thousands of other data sets.


Rscript -e "sessionInfo()":
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=C.UTF-8   LC_NUMERIC=C   LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8   LC_NAME=C  LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.2


I know this is a fairly old R version. I have not been able to reproduce the 
bug with a more recent version, but since it is so difficult to trigger, this 
does not mean much, I guess. I have looked through the changes made to the code 
of mclapply since that version and could not find something directly related. I 
am not even sure if it is a problem in the parallel package or some other 
(memory) bug. What strikes me is that others have observed a very similar error 
when using Rscript but not when using an interactive R session, just like I do.

I am not expecting a fix based on the information I provide, but maybe someone 
has some thoughts on this!?

Regards,
Andreas




Actual output:

[1] "part 51 written successfully."
[1] "part 30 written successfully."
[1] "part 32 written successfully."
[1] "part 48 written successfully."
[1] "part 63 written successfully."
[1] "part 5 written successfully."
[1][1] "part 14 written successfully." "part 18 written successfully."

[1] "part 38 written successfully."
[1] "part 11 written successfully."
[1] "part 68 written successfully."
[1] "part 45 written successfully."
[1] "part 88 written successfully."
[1] "part 36 written successfully."
[1] "part 44 written successfully."
[1] "part 55 written successfully."
[1] "part 26 written successfully."
[1] "part 37 written successfully."
[1] "part 22 written successfully."
[1] "part 13 written successfully."
[1] "part 67 written successfully."
[1] "part 10 written successfu

Re: [Rd] error in parallel:::sendMaster

2019-11-27 Thread Andreas Kersting
Hi again,

One important correction of my first message: I misinterpreted the output. 
Actually in that R session 2 input files were processed one after the other in 
a loop. The first (with 88 parts went fine). The second (with 85 parts) 
produced the sendMaster errors and failed. If (in a new session via Rscript) I 
only process the second input file it will work. The other observations on R vs 
Rscript, NFS share etc. still hold.

Sorry for this! Regards,
Andreas

2019-11-27 12:10 GMT+01:00 Andreas Kersting:
> Hi,
> 
> I am facing a very weird problem with parallel::mclapply. I have a script 
> which does some data wrangling on an input dataset in parallel and then 
> writes the results to disk. I have been using this script daily for more than 
> one year always on an EC2 instance launched from the same AMI (no updates 
> installed after launch) and processed thousands of different input data sets 
> successfully. I now have an input dataset for which I face the following bug:
> 
> The basic outline of the problematic section of the script:
> 
> # parts is a data.table with 88 rows
> mc_ret <- parallel::mclapply(sample.int(nrow(parts)), function(i) {
>   # do some data wrangling and write the result to a file
>   # ...
> 
>   print(paste0("part ", i, " written successfully."))
>   return(TRUE)
> }, mc.preschedule = FALSE, mc.cores = 2L)
> 
> str(mc_ret)
> 
> 
> Expected output: "part i written successfully." is printed 88 times, once for 
> each value of i. mc_ret is a list of length 88, each element being TRUE. Its 
> structure is printed once. All outputs are created successfully.
> 
> Actual output (see end of the message): "part i written successfully." is 
> printed 88 times, once for each value of i. mc_ret is a list of length 88, 
> each element being TRUE. Its structure is printed. All outputs are created 
> successfully. So far so good.
> 
> But then "part i written successfully." it is printed another X times, for 
> values of i for which it was already printed. This output is intermingled 
> with X-1 times the following error message:
> 
> Error in sendMaster(try(eval(expr, env), silent = TRUE)) :
>   write error, closing pipe to the master
> Calls: lapply ...  ->  -> mcparallel -> sendMaster
> 
> and Y times the message "Execution halted". mc_ret is printed again, now 
> being a list of length 85, with the first element being TRUE and all other 
> elements being NULL. X and Y vary from run to run.
> 
> 
> Now to the main problem: I tried very hard to create a reproducible example, 
> but I failed. What I observed:
> - The output is (and has always been) written to path which is on an NFS 
> share. If I instead write to a path on a local disk it will work.
> - The script is invoked using Rscript. If I instead source it from an 
> interactive R session it works. There are at least two more people who have 
> observed this: 
> https://stackoverflow.com/questions/51986674/mclapply-sendmaster-error-only-with-rscript
> - Before the call to mclapply the code acquires an exclusive file lock on a 
> dedicated lock file, not written to but also on the NFS share. If I remove 
> the code acquiring the lock, the whole script will also work if called using 
> Rscript.
> - The problem also occurs for mc.preschedule = TRUE.
> - There is no error if I set mc.cores to 1.
> - And stressing again: the code works without any changes from Rscript for 
> thousands of other data sets.
> 
> 
> Rscript -e "sessionInfo()":
> R version 3.5.2 (2018-12-20)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.2 LTS
> 
> Matrix products: default
> BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
> 
> locale:
>  [1] LC_CTYPE=C.UTF-8   LC_NUMERIC=C   LC_TIME=C.UTF-8
>  [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8LC_MESSAGES=C.UTF-8
>  [7] LC_PAPER=C.UTF-8   LC_NAME=C  LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
> 
> loaded via a namespace (and not attached):
> [1] compiler_3.5.2
> 
> 
> I know this is a fairly old R version. I have not been able to reproduce the 
> bug with a more recent version, but since it is so difficult to trigger, this 
> does not mean much, I guess. I have looked through the changes made to the 
> code of mclapply since that version and could not find something directly 
> related. I am not even sure if it is a problem in the parallel package or 
> some other (memory) bug. What strikes me is t

Re: [Rd] error in parallel:::sendMaster

2019-11-27 Thread Andreas Kersting
0x7f5016f3e0a0: file fork.c, line 681.
(gdb) c
Continuing.

Breakpoint 1, mc_send_master (what=0x563382a71910) at fork.c:681
warning: Source file is more recent than executable.
681 {
(gdb) info args
what = 0x563382a71910
(gdb) n
682 if (is_master)
(gdb) n
684 if (master_fd == -1) 
(gdb) n
686 if (TYPEOF(what) != RAWSXP) 
(gdb) n
688 R_xlen_t len = XLENGTH(what);
(gdb) n
689 unsigned char *b = RAW(what);
(gdb) n
693 if (writerep(master_fd, &len, sizeof(len)) != sizeof(len)) {
(gdb) info locals
len = 526
b = 
n = 
(gdb) s
writerep (fildes=7, buf=buf@entry=0x7fff4027ad60, nbyte=nbyte@entry=8)
at fork.c:653
653 {
(gdb) info args
fildes = 7
buf = 0x7fff4027ad60
nbyte = 8
(gdb) n
654 size_t wbyte = 0;
(gdb) n
653 {
(gdb) n
657 ssize_t w = write(fildes, ptr + wbyte, nbyte - wbyte);
(gdb) n
658 if (w == -1) {
(gdb) info locals
w = 
wbyte = 0
ptr = 0x7fff4027ad60 "\016\002"
(gdb) n
657 ssize_t w = write(fildes, ptr + wbyte, nbyte - wbyte);
(gdb) n
658 if (w == -1) {
(gdb) n
659 if (errno == EINTR)
(gdb) n
674 }
(gdb) p __errno_location()
$1 = (int *) 0x7f50322cb540
(gdb) x/x $1
0x7f50322cb540: 0x0009
(gdb) python import errno
(gdb) python print(errno.errorcode[9])
EBADF
(gdb) n
mc_send_master (what=) at fork.c:702
702 close(master_fd);
(gdb) n
704 error(_("write error, closing pipe to the master"));
(gdb) n
703 master_fd = -1;
(gdb) n
704 error(_("write error, closing pipe to the master"));
(gdb) n
685 error(_("there is no pipe to the master process"));


Does this help in any way? 

Is there something else I can/should look at?

Regards,
Andreas


2019-11-27 15:04 GMT+01:00 Tomas Kalibera:
> Hi Andreas,
> the error is reported when some child process cannot send results to the 
> master process, which originates from an error returned by write() - when 
> write() returns -1 or 0. The logic around the writing has not changed since R 
> 3.5.2. It should not be related to the printing in the child, only to 
> returning the value. The problem may be originating from the execution 
> environment, virtualization, and/or possibly from a lack of robustness in R. 
> To resolve this we need to find out which error was returned and why. Either 
> you can try to create a reproducible example (something I could use to 
> trigger an error on my system and then debug) or to debug on your system 
> (build R from source, ensure the bug is still triggered, then instrument to 
> print the exact error from the OS and where it was detected, etc). In 
> principle you could also try without code instrumentation just using strace. 
> Just from looking at the code in R around the writing I am not seeing any bug 
> there. If you choose to debug o
 n your system I can help with the instrumentation.
> 
> Best
> Tomas
> 
> On 11/27/19 12:40 PM, Andreas Kersting wrote:
>> Hi again,
>>
>> One important correction of my first message: I misinterpreted the output. 
>> Actually in that R session 2 input files were processed one after the other 
>> in a loop. The first (with 88 parts went fine). The second (with 85 parts) 
>> produced the sendMaster errors and failed. If (in a new session via Rscript) 
>> I only process the second input file it will work. The other observations on 
>> R vs Rscript, NFS share etc. still hold.
>>
>> Sorry for this! Regards,
>> Andreas
>>
>> 2019-11-27 12:10 GMT+01:00 Andreas Kersting:
>>> Hi,
>>>
>>> I am facing a very weird problem with parallel::mclapply. I have a script 
>>> which does some data wrangling on an input dataset in parallel and then 
>>> writes the results to disk. I have been using this script daily for more 
>>> than one year always on an EC2 instance launched from the same AMI (no 
>>> updates installed after launch) and processed thousands of different input 
>>> data sets successfully. I now have an input dataset for which I face the 
>>> following bug:
>>>
>>> The basic outline of the problematic section of the script:
>>>
>>> # parts is a data.table with 88 rows
>>> mc_ret <- parallel::mclapply(sample.int(nrow(parts)), function(i) {
>>># do some data wrangling and write the result to a file
>>># ...
>>>
>>>print(paste0("part ", i, " written successfully."))
>>>return(TRUE)
>>> }, mc.preschedule = FALSE, mc.cores = 2L)
>>>
>>> str(mc_ret)
>>>
>>>
>>> Expected output: "part i written successfully." is printed 88 times, once 
>>

Re: [Rd] error in parallel:::sendMaster

2019-11-28 Thread Andreas Kersting
n is=20
> why the pipe has been closed prematurely; it could be accidentally by R=20
> (a race condition in the cleanup code in fork.c) or possibly by some=20
> other code running in the same process (maybe the R program itself or=20
> some other code it runs). Maybe we can take this off the list and come=20
> back when we know the cause or have it fixed.
> 
> It would help a lot if you could try with R built from source, with=20
> optimizations disabled to get more accurate debug symbols (e.g. env=20
> CFLAGS=3D"-Wall -O0 -gdwarf-2 -g3" CXXFLAGS=3D"-Wall -O0 -gdwarf-2 -g3"=20
> =2E/configure), and with MC_DEBUG defined in fork.c - line 26. Ideally in=
> =20
> R-devel, so that we are sure the problem still exists. The debug=20
> messages should give a hint whether it was R (fork.c) that closed the=20
> pipe and why. Maybe you could also add a debug message to=20
> close_fds_child_ci() to see if it was closed there. Maybe you could find =
> 
> this out even in your current debugging setup via breakpoints and=20
> backtraces, but I think it may be easier to build from source with these =
> 
> debugging messages.
> 
> Also if you could send me a complete example I could run that causes=20
> this on your system, that would be nice (even if it didn't cause the=20
> problem on my system).
> 
> Thanks
> Tomas
> 
> On 11/28/19 6:35 AM, Andreas Kersting wrote:
>> Hi Tomas,
>>
>> Thanks for your prompt reply and your offer to help. I might need to ge=
> t back to this since I am not too experienced in debugging these kinds of=
> issues. Anyway, I gave it a try and I think I have found the immediate c=
> ause:
>>
>> I installed the debug symbols (r-base-core-dbg), placed https://github.=
> com/wch/r-source/blob/tags/R-3-5-2/src/library/parallel/src/fork.c in cwd=
> and changed the wrapper code to:
>>
>> mc_ret <- parallel::mclapply(seq_len(nrow(parts)), function(i) {
>>  # we fail for the input resulting in parts having 85 rows
>>  if (nrow(parts) =3D=3D 85L && !file.exists(as.character(Sys.getpid=
> ( {
>>file.create(as.character(Sys.getpid()))
>>print(Sys.getpid())
>>Sys.sleep(30)
>>  }
>>
>>  # ...
>>
>>  return(TRUE)
>>}, mc.preschedule =3D TRUE, mc.cores =3D 2L)
>>
>> This way I ended up with only two child processes to which I each attac=
> hed a debugger. In total I ran about 10 debugging sessions and it was alw=
> ays the second child process failing. The errno after write returned -1 w=
> as 9 (EBADF).
>>
>>  From what I can see, the reason for this is that the second child trie=
> s to write to fd 7, but already during the very beginning of the first in=
> vocation of the anonymous function to parallelize, i.e. during Sys.sleep(=
> 30), there is no such file descriptor. From this observation I would conc=
> lude that it is NOT the code run from that function, i.e. # ...,  causing=
> the issue. Let me point out again, that this is NOT the very first invoc=
> ation of mclapply in this R session. There is at least one previous call =
> to it, which works fine.
>>
>>
>> File descriptors directly after attaching gdb to both child processes d=
> uring Sys.sleep(30):
>>
>> ### master
>> root@ip-10-0-48-30:~/latest_test# ls -l /proc/22119/fd
>> total 0
>> lrwx-- 1 root root 64 Nov 28 04:49 0 -> /dev/pts/0
>> lrwx-- 1 root root 64 Nov 28 04:49 1 -> /dev/pts/0
>> lrwx-- 1 root root 64 Nov 28 04:49 2 -> /dev/pts/0
>> lr-x-- 1 root root 64 Nov 28 04:49 3 -> /path/to/script.R
>> lrwx-- 1 root root 64 Nov 28 04:49 4 -> /path/on/nfs/write.lock
>> lr-x-- 1 root root 64 Nov 28 04:49 5 -> 'pipe:[266120]'
>> l-wx-- 1 root root 64 Nov 28 04:49 8 -> 'pipe:[266121]'
>>
>>
>> ### first child (writes to fd 6)
>> (gdb) shell ls -l /proc/22134/fd
>> total 0
>> lrwx-- 1 root root 64 Nov 28 04:42 0 -> /dev/pts/0
>> lrwx-- 1 root root 64 Nov 28 04:42 1 -> /dev/pts/0
>> lrwx-- 1 root root 64 Nov 28 04:42 2 -> /dev/pts/0
>> lr-x-- 1 root root 64 Nov 28 04:42 3 -> /path/to/script.R
>> lrwx-- 1 root root 64 Nov 28 04:42 4 -> /path/on/nfs/write.lock
>> l-wx-- 1 root root 64 Nov 28 04:42 6 -> 'pipe:[266120]'
>> l-wx-- 1 root root 64 Nov 28 04:42 8 -> 'pipe:[266121]'
>>
>> ### second child (tries writing to fd 7)
>> (gdb) shell ls -l /proc/22135/fd
>> total 0
>> lr-x-- 1 root root 64 Nov 28 04:42 0 -> 'pipe:[266123]'
>> lrwx-- 1 root root 64

Re: [Rd] error in parallel:::sendMaster

2019-12-04 Thread Andreas Kersting
Hi all,

With the help of Tomas, I was able to track the issue down: Prior to R v3.6.0 
the parallel package passes an uninitialized variable as the file descriptor 
argument to the close system call. 

In my particular R session this uninitialized variable (reproducibly) was 
holding the value 7, which corresponded to the file descriptor of the write end 
of the pipe the second child would use to send its results to the master. 
Hence, the child unintentionally closed this pipe directly after fork in 
close_fds_child_ci() resulting in sendMaster() later failing with EBADF.

It was fixed with this commit: 
https://github.com/wch/r-source/commit/e08cffac1c5b9015a1625938d568b648eb1d8aee

Regards,
Andreas

2019-11-28 13:54 GMT+01:00 Andreas Kersting:
> Hi Tomas,
> 
> I rebuild R (v3.5.2 for now, R-devel to follow) from the Debian package with 
> MC_DEBUG defined and hopefully also with "-Wall -O0 -gdwarf-2 -g3", though I 
> still have to verify this.
> 
> Below is the output. I think it is a total of two mclapply invocations in 
> this R session, the failing one starting around the lines "[1] 15381" and 
> "[1] 15382". The "Error in partDF::write_partDF ..." is because the 
> script/package checks the return value of mclapply and detects that it is not 
> a list of length 85 with only the elements "TRUE".
> 
> Regarding sending you the complete example: I first have to figure out if 
> this is possible at all, because it would involve data of a client.
> 
> Regards,
> Andreas
> 
> parent[15366] created pipes: comm (6->5), sir (8->7)
> parent registers new child 15379
> child process 15379 started
> parent[15366] created pipes: comm (7->6), sir (10->9)
> parent registers new child 15380
> child process 15380 started
> select_children: added child 15380 (6)
> select_children: added child 15379 (5)
> select_children: maxfd=6, wlen=2, wcount=2, timeout=-1.00
> child 15380: send_master (550 bytes)
>   sr = 1
>  - read select 1 children: 15380 
> child 15380: 'mcexit' called
> child 15380 is waiting for permission to exit
> read_child_ci(15380) - read length returned 8
> read_child_ci(15380) - read 550 at 0 returned 550
> select_children: added child 15380 (6)
> select_children: added child 15379 (5)
> select_children: maxfd=6, wlen=2, wcount=2, timeout=-1.00
>   sr = 1
>  - read select 1 children: 15380 
> read_child_ci(15380) - read length returned 8
> detached child 15380 (signal 10)
> child process 15380 got SIGUSR1; child_exit_status=-1
> child 15380: exiting
> select_children: added child 15379 (5)
> select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.00
> child 15380 terminated with exit status 0
> child 15379: send_master (550 bytes)
>   sr = 1
>  - read select 1 children: 15379 
> read_child_ci(15379) - read length returned 8
> read_child_ci(15379) - read 550 at 0 returned 550
> select_children: added child 15379 (5)
> select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.00
> child 15379: 'mcexit' called
> child 15379 is waiting for permission to exit
>   sr = 1
>  - read select 1 children: 15379 
> read_child_ci(15379) - read length returned 8
> detached child 15379 (signal 10)
> child process 15379 got SIGUSR1; child_exit_status=-1
> child 15379: exiting
> removing waited-for child 15380 from the list
> killed detached child 15379 (signal 15)
> removing waited-for child 2147483647 from the list
> child 15379 terminated with exit status 0
> removing waited-for child 15379 from the list
> parent[15366] created pipes: comm (6->5), sir (8->7)
> parent registers new child 15381
> child process 15381 started
> parent[15366] created pipes: comm (7->6), sir (10->9)
> [1] 15381
> parent registers new child 15382
> child process 15382 started
> select_children: added child 15382 (6)
> select_children: added child 15381 (5)
> select_children: maxfd=6, wlen=2, wcount=2, timeout=-1.00
>   sr = 1
>  - read select 1 children: 15382 
> read_child_ci(15382) - read length returned 0
> detached child 15382 (signal 10)
> child process 15382 got SIGUSR1; child_exit_status=-1
> select_children: added child 15381 (5)
> select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.00
> [1] 15382
> child 15382: send_master (526 bytes)
> Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) : 
>   write error, closing pipe to the master
> Calls: lapply ...  ->  -> lapply -> FUN -> sendMaster
> child 15382: 'mcexit' called
> child 15382: exiting
> child 15382 terminated with exit status 1
> child 15381: send_master (538 bytes)
>   sr = 1
>  - read select 1 children: 15381 
> read_child_ci(1538

Re: [Rd] error in parallel:::sendMaster

2019-12-05 Thread Andreas Kersting
Hi all,

With the help of Tomas, I was able to track the issue down: Prior to R v3.6.0 
the parallel package passes an uninitialized variable as the file descriptor 
argument to the close system call. 

In my particular R session this uninitialized variable (reproducibly) was 
holding the value 7, which corresponded to the file descriptor of the write end 
of the pipe the second child would use to send its results to the master. 
Hence, the child unintentionally closed this pipe directly after fork in 
close_fds_child_ci() resulting in sendMaster() later failing with EBADF.

It was fixed with this commit: 
https://github.com/wch/r-source/commit/e08cffac1c5b9015a1625938d568b648eb1d8aee

Regards,
Andreas

2019-11-28 13:54 GMT+01:00 Andreas Kersting:
> Hi Tomas,
> 
> I rebuild R (v3.5.2 for now, R-devel to follow) from the Debian package with 
> MC_DEBUG defined and hopefully also with "-Wall -O0 -gdwarf-2 -g3", though I 
> still have to verify this.
> 
> Below is the output. I think it is a total of two mclapply invocations in 
> this R session, the failing one starting around the lines "[1] 15381" and 
> "[1] 15382". The "Error in partDF::write_partDF ..." is because the 
> script/package checks the return value of mclapply and detects that it is not 
> a list of length 85 with only the elements "TRUE".
> 
> Regarding sending you the complete example: I first have to figure out if 
> this is possible at all, because it would involve data of a client.
> 
> Regards,
> Andreas
> 
> parent[15366] created pipes: comm (6->5), sir (8->7)
> parent registers new child 15379
> child process 15379 started
> parent[15366] created pipes: comm (7->6), sir (10->9)
> parent registers new child 15380
> child process 15380 started
> select_children: added child 15380 (6)
> select_children: added child 15379 (5)
> select_children: maxfd=6, wlen=2, wcount=2, timeout=-1.00
> child 15380: send_master (550 bytes)
>   sr = 1
>  - read select 1 children: 15380 
> child 15380: 'mcexit' called
> child 15380 is waiting for permission to exit
> read_child_ci(15380) - read length returned 8
> read_child_ci(15380) - read 550 at 0 returned 550
> select_children: added child 15380 (6)
> select_children: added child 15379 (5)
> select_children: maxfd=6, wlen=2, wcount=2, timeout=-1.00
>   sr = 1
>  - read select 1 children: 15380 
> read_child_ci(15380) - read length returned 8
> detached child 15380 (signal 10)
> child process 15380 got SIGUSR1; child_exit_status=-1
> child 15380: exiting
> select_children: added child 15379 (5)
> select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.00
> child 15380 terminated with exit status 0
> child 15379: send_master (550 bytes)
>   sr = 1
>  - read select 1 children: 15379 
> read_child_ci(15379) - read length returned 8
> read_child_ci(15379) - read 550 at 0 returned 550
> select_children: added child 15379 (5)
> select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.00
> child 15379: 'mcexit' called
> child 15379 is waiting for permission to exit
>   sr = 1
>  - read select 1 children: 15379 
> read_child_ci(15379) - read length returned 8
> detached child 15379 (signal 10)
> child process 15379 got SIGUSR1; child_exit_status=-1
> child 15379: exiting
> removing waited-for child 15380 from the list
> killed detached child 15379 (signal 15)
> removing waited-for child 2147483647 from the list
> child 15379 terminated with exit status 0
> removing waited-for child 15379 from the list
> parent[15366] created pipes: comm (6->5), sir (8->7)
> parent registers new child 15381
> child process 15381 started
> parent[15366] created pipes: comm (7->6), sir (10->9)
> [1] 15381
> parent registers new child 15382
> child process 15382 started
> select_children: added child 15382 (6)
> select_children: added child 15381 (5)
> select_children: maxfd=6, wlen=2, wcount=2, timeout=-1.00
>   sr = 1
>  - read select 1 children: 15382 
> read_child_ci(15382) - read length returned 0
> detached child 15382 (signal 10)
> child process 15382 got SIGUSR1; child_exit_status=-1
> select_children: added child 15381 (5)
> select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.00
> [1] 15382
> child 15382: send_master (526 bytes)
> Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) : 
>   write error, closing pipe to the master
> Calls: lapply ...  ->  -> lapply -> FUN -> sendMaster
> child 15382: 'mcexit' called
> child 15382: exiting
> child 15382 terminated with exit status 1
> child 15381: send_master (538 bytes)
>   sr = 1
>  - read select 1 children: 15381 
> read_child_ci(1538

Re: [Rd] Error in close.connection(p) : ignoring SIGPIPE signal

2019-12-05 Thread Andreas Kersting
Hi Benjamin,

you cannot pipe to echo, since it does not read from stdin. 

echo just echos is first arg, i.e. echo /dev/stdin > /dev/null will echo the 
string "/dev/stdin"to /dev/stdout, which is redirected to /dev/null.

Try 

p <- pipe("cat > /dev/null", open = "w")

instead.

Regards,
Andreas

2019-12-06 02:46 GMT+01:00 Benjamin Tyner:
> Not sure if this is a bug, so posting here first. If I run:
>    cnt <- 0L
>    while (TRUE) {
>        cnt <- cnt + 1L
>        p <- pipe("echo /dev/stdin > /dev/null", open = "w")
>        writeLines("foobar", p)
>        tryCatch(close(p), error = function(e) { print(cnt); stop(e)})
>    }
> 
> then once cnt gets to around 650, it fails with:
> 
>    [1] 654
>    Error in close.connection(p) : ignoring SIGPIPE signal
> 
> Should I not be using pipe() in this way? Here is my sessionInfo()
> 
>    R version 3.6.0 (2019-04-26)
>    Platform: x86_64-pc-linux-gnu (64-bit)
>    Running under: Ubuntu 18.04.3 LTS
> 
>    Matrix products: default
>    BLAS:   /home/btyner/R360/lib64/R/lib/libRblas.so
>    LAPACK: /home/btyner/R360/lib64/R/lib/libRlapack.so
> 
>    locale:
>     [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>     [3] LC_TIME=en_US.UTF-8    LC_COLLATE=en_US.UTF-8
>     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>     [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>     [9] LC_ADDRESS=C   LC_TELEPHONE=C
>    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
>    attached base packages:
>    [1] stats graphics  grDevices utils datasets  methods base
> 
>    loaded via a namespace (and not attached):
>    [1] compiler_3.6.0
> 
> Regards,
> Ben
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Missing objects using dump.frames for post-mortem debugging of crashed batch jobs. Bug or gap in documentation?

2016-11-28 Thread Andreas Kersting
 cat(gettext("Available environments had calls:\n"))
cat(paste0(1L:n, ": ", calls), sep = "\n")
cat(gettext("\nEnter an environment number, or 0 to exit  "))
repeat {
ind <- .Call(C_menu, as.character(calls))
if(ind <= n) break
}
if(ind == 0L) return(invisible())
# debugger.look(ind)
cat(gettext("Browsing in the environment with call:\n   "),
calls[ind], "\n", sep = "")
evalq(browser(), envir = dump[[ind]])
}
}

So instead of copying all objects of the chosen frame to some new 
environment, i.e. the frame of debugger.look(), we directly inspect the 
dumped one with evalq(browser(), envir = dump[[ind]]). This way we do 
not alter the enclosing environment of the frame.


If the global environment was included in the dump, we change the 
enclosing environment of the dumped .GlobalEnv to search()[2]. For all 
other dumped frames which have the global environment as their enclosing 
one, we change their enclosing environment to the dumped .GlobalEnv.


By doing so we should get an environment tree which is closer to the one 
when dump.frames() was called, with an obvious (potential) difference 
being the search path.


Andreas


The semantic difference is that the global variable "g" is visible
within the function "f" in the first version, but not in the second
version.

If I dump to a file and load and debug it then the search path through
the
frames is not the same during run time vs. debug time.

An implementation with the same semantics could be achieved
by applying this workaround currently:

  dump.frames()
  save.image(file = "last.dump.rda")

Does it possibly make sense to unify the semantics?

THX!


On Mon, 2016-11-14 at 11:34 +0100, Martin Maechler wrote:
> >>>>> nospam at altfeld-im de 
> >>>>> on Sun, 13 Nov 2016 13:11:38 +0100 writes:
>
> > Dear R friends, to allow post-mortem debugging In my
> > Rscript based batch jobs I use
>
> >tryCatch( , error = function(e) {
> > dump.frames(to.file = TRUE) })
>
> > to write the called frames into a dump file.
>
> > This is similar to the method recommended in the "Writing
> > R extensions" manual in section 4.2 Debugging R code (page
> > 96):
>
> > https://cran.r-project.org/doc/manuals/R-exts.pdf
>
> >> options(error = quote({dump.frames(to.file=TRUE); q()}))
>
>
>
> > When I load the dump later in a new R session to examine
> > the error I use
>
> > load(file = "last.dump.rda") debugger(last.dump)
>
> > My problem is that the global objects in the workspace are
> > NOT contained in the dump since "dump.frames" does not
> > save the workspace.
>
> > This makes debugging difficult.
>
>
>
> > For more details see the stackoverflow question + answer
> > in:
> > 
https://stackoverflow.com/questions/40421552/r-how-make-dump-frames-include-all-variables-for-later-post-mortem-debugging/40431711#40431711
>
>
>
> > I think the reason of the problem is:
> > 
>
> > If you use dump.files(to.file = FALSE) in an interactive
> > session debugging works as expected because it creates a
> > global variable called "last.dump" and the workspace is
> > still loaded.
>
> > In the batch job scenario however the workspace is NOT
> > saved in the dump and therefore lost if you debug the dump
> > in a new session.
>
>
> > Options to solve the issue:
> > --
>
> > 1. Improve the documentation of the R help for
> > "dump.frames" and the R_exts manual to propose another
> > code snippet for batch job scenarios:
>
> >   dump.frames() save.image(file = "last.dump.rda")
>
> > 2. Change the semantics of "dump.frames(to.file = TRUE)"
> > to include the workspace in the dump.  This would change
> > the semantics implied by the function name but makes the
> > semantics consistent for both "to.file" param values.
>
> There is a third option, already in place for three months now:
> Andreas Kersting did propose it (nicely, as a wish),
>https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17116
> and I had added it to the development version of R back then :
>
> 
> r71102 | maechler | 2016-08-16 17:36:10 +0200 (Tue, 16 Aug 2016) | 1 line
>
> dump.frames(*, include.GlobalEnv)
> 
>
> So, if you (or others) want to use this before next spring,
> you should install a version of R-devel
> and you use that, with
>
>   tryCatch( ,
>error = function(e)
>   dump.frames(to.file = TRUE, include.GlobalEnv = TRUE))
>
> Using R-devel is nice and helpful for the R community, as you
> will help finding bugs/problems in the new features (and
> possibly changed features) we've introduced there.
>
>
> Best regards,
> Martin


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] print for lists evaluates "AsIs"-elements of type "language"

2017-06-13 Thread Andreas Kersting

Consider the following code snippets:

> list(quote(1 + 1), I(quote(1 + 1)))
[[1]]
1 + 1

[[2]]
[1] 2  # should also be 1 + 1!?

> str(list(quote(1 + 1), I(quote(1 + 1
List of 2
 $ : language 1 + 1
 $ : language + 1 1  # why is this line different from the one above?
  ..- attr(*, "class")= chr "AsIs"

> quote(1 + 1)
1 + 1

> I(quote(1 + 1))
1 + 1  # OK

> str(quote(1 + 1))
 language 1 + 1

> str(I(quote(1 + 1)))
 language + 1 1  # again different
 - attr(*, "class")= chr "AsIs"


This inconsistency is particularly striking when printing the individual 
elements of a list works but printing the whole list does not:


> l <- list(quote(length(a)), I(quote(length(a

> l[[1]]
length(a)

> l[[2]]
length(a)

> l
[[1]]
length(a)

[[2]]
Error in print(length(a)) : object 'a' not found


Should we consider this a bug? If so, I can add it to Bugzilla.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] [WISH / PATCH] possibility to split string literals across multiple lines

2017-06-14 Thread Andreas Kersting

Hi,

I would really like to have a way to split long string literals across 
multiple lines in R.


Currently, if a string literal spans multiple lines, there is no way to 
inhibit the introduction of newline characters:


> "aaa
+ bbb"
[1] "aaa\nbbb"


If a line ends with a backslash, it is just ignored:

> "aaa\
+ bbb"
[1] "aaa\nbbb"


We could use this fact to implement string splitting in a fairly 
backward-compatible way, since currently such trailing backslashes 
should hardly be used as they do not have any effect. The attached patch 
makes the parser ignore a newline character directly following a backslash:


> "aaa\
+ bbb"
[1] "aaabbb"


I personally would also prefer if leading blanks (spaces and tabs) in 
the second line are ignored to allow for proper indentation:


>   "aaa \
+bbb"
[1] "aaa bbb"

>   "aaa\
+\ bbb"
[1] "aaa bbb"

This is also implemented by this patch.


An alternative approach could be to have something like

("aaa "
"bbb")

or

("aaa ",
"bbb")

be interpreted as "aaa bbb".

I don't know the ins and outs of the parser of R (hence: please very 
carefully review the attached patch), but I guess this would be more 
work to implement!?



What do you think? Is there anybody else who is missing this feature in 
the first place?


Regards,
Andreas
Index: src/main/gram.c
===
--- src/main/gram.c	(Revision 72789)
+++ src/main/gram.c	(Arbeitskopie)
@@ -4646,10 +4646,17 @@
 int wcnt = 0;
 ucs_t wcs[10001];
 Rboolean oct_or_hex = FALSE, use_wcs = FALSE, currtext_truncated = FALSE;
+Rboolean backslash_was_newline = FALSE, ignore_next_blank = FALSE;
 
 CTEXT_PUSH(c);
 while ((c = xxgetc()) != R_EOF && c != quote) {
 	CTEXT_PUSH(c);
+	if (ignore_next_blank) {
+	if (c == ' ' || c == '\t')
+	continue;
+	else
+	ignore_next_blank = FALSE;
+	} 
 	if (c == '\n') {
 	xxungetc(c); CTEXT_POP();
 	/* Fix suggested by Mark Bravington to allow multiline strings
@@ -4657,6 +4664,7 @@
 	 * return ERROR;
 	 */
 	c = '\\';
+	backslash_was_newline = TRUE;
 	}
 	if (c == '\\') {
 	c = xxgetc(); CTEXT_PUSH(c);
@@ -4815,8 +4823,14 @@
 		case '\'':
 		case '`':
 		case ' ':
+		break;
 		case '\n':
-		break;
+		if (backslash_was_newline) {
+		backslash_was_newline = FALSE;
+		break;
+		}
+		ignore_next_blank = TRUE;
+		continue;
 		default:
 		*ct = '\0';
 		errorcall(R_NilValue, _("'\\%c' is an unrecognized escape in character string starting \"%s\""), c, currtext);
Index: src/main/gram.y
===
--- src/main/gram.y	(Revision 72789)
+++ src/main/gram.y	(Arbeitskopie)
@@ -2308,10 +2308,17 @@
 int wcnt = 0;
 ucs_t wcs[10001];
 Rboolean oct_or_hex = FALSE, use_wcs = FALSE, currtext_truncated = FALSE;
+Rboolean backslash_was_newline = FALSE, ignore_next_blank = FALSE;
 
 CTEXT_PUSH(c);
 while ((c = xxgetc()) != R_EOF && c != quote) {
 	CTEXT_PUSH(c);
+	if (ignore_next_blank) {
+	if (c == ' ' || c == '\t')
+	continue;
+	else
+	ignore_next_blank = FALSE;
+	} 
 	if (c == '\n') {
 	xxungetc(c); CTEXT_POP();
 	/* Fix suggested by Mark Bravington to allow multiline strings
@@ -2319,6 +2326,7 @@
 	 * return ERROR;
 	 */
 	c = '\\';
+	backslash_was_newline = TRUE;
 	}
 	if (c == '\\') {
 	c = xxgetc(); CTEXT_PUSH(c);
@@ -2477,8 +2485,14 @@
 		case '\'':
 		case '`':
 		case ' ':
+		break;
 		case '\n':
-		break;
+		if (backslash_was_newline) {
+		backslash_was_newline = FALSE;
+		break;
+		}
+		ignore_next_blank = TRUE;
+		continue;
 		default:
 		*ct = '\0';
 		errorcall(R_NilValue, _("'\\%c' is an unrecognized escape in character string starting \"%s\""), c, currtext);
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [WISH / PATCH] possibility to split string literals across multiple lines

2017-06-14 Thread Andreas Kersting
On Wed, 14 Jun 2017 06:12:09 -0500, Duncan Murdoch  
wrote:

> On 14/06/2017 5:58 AM, Andreas Kersting wrote:
> > Hi,
> >
> > I would really like to have a way to split long string literals across
> > multiple lines in R.
> 
> I don't understand why you require the string to be a literal.  Why not 
> construct the long string in an expression like
> 
>   paste0("aaa",
>  "bbb")
> 
> ?  Surely the execution time of the paste0 call is negligible.
> 
> Duncan Murdoch

Actually "execution time" is precisely one of the reasons why I would like to 
see this feature as - depending on the context (e.g. in a tight loop) - the 
execution time of paste0 (or probably also glue, thanks Gabor) is not 
necessarily insignificant. 

The other reason is style: I think it is cleaner if we can construct such a 
long string literal without the need for a function call.

Andreas

> >
> > Currently, if a string literal spans multiple lines, there is no way to
> > inhibit the introduction of newline characters:
> >
> >  > "aaa
> > + bbb"
> > [1] "aaa\nbbb"
> >
> >
> > If a line ends with a backslash, it is just ignored:
> >
> >  > "aaa\
> > + bbb"
> > [1] "aaa\nbbb"
> >
> >
> > We could use this fact to implement string splitting in a fairly
> > backward-compatible way, since currently such trailing backslashes
> > should hardly be used as they do not have any effect. The attached patch
> > makes the parser ignore a newline character directly following a backslash:
> >
> >  > "aaa\
> > + bbb"
> > [1] "aaabbb"
> >
> >
> > I personally would also prefer if leading blanks (spaces and tabs) in
> > the second line are ignored to allow for proper indentation:
> >
> >  >   "aaa \
> > +bbb"
> > [1] "aaa bbb"
> >
> >  >   "aaa\
> > +\ bbb"
> > [1] "aaa bbb"
> >
> > This is also implemented by this patch.
> >
> >
> > An alternative approach could be to have something like
> >
> > ("aaa "
> > "bbb")
> >
> > or
> >
> > ("aaa ",
> > "bbb")
> >
> > be interpreted as "aaa bbb".
> >
> > I don't know the ins and outs of the parser of R (hence: please very
> > carefully review the attached patch), but I guess this would be more
> > work to implement!?
> >
> >
> > What do you think? Is there anybody else who is missing this feature in
> > the first place?
> >
> > Regards,
> > Andreas
> >
> >
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [WISH / PATCH] possibility to split string literals across multiple lines

2017-06-14 Thread Andreas Kersting


 Original Message 
From: Duncan Murdoch [mailto:murdoch.dun...@gmail.com]
Sent: Wednesday, Jun 14, 2017 1:36 PM GMT
To: Andreas Kersting
Cc: r-devel
Subject: [Rd] [WISH / PATCH] possibility to split string literals across 
multiple lines



On 14/06/2017 6:45 AM, Andreas Kersting wrote:

On Wed, 14 Jun 2017 06:12:09 -0500, Duncan Murdoch
 wrote:


On 14/06/2017 5:58 AM, Andreas Kersting wrote:

Hi,

I would really like to have a way to split long string literals across
multiple lines in R.


I don't understand why you require the string to be a literal.  Why not
construct the long string in an expression like

  paste0("aaa",
 "bbb")

?  Surely the execution time of the paste0 call is negligible.

Duncan Murdoch


Actually "execution time" is precisely one of the reasons why I would
like to see this feature as - depending on the context (e.g. in a
tight loop) - the execution time of paste0 (or probably also glue,
thanks Gabor) is not necessarily insignificant.


You also need to consider implementation time.  This is not just changes
to R itself; trailing backslashes *are* used in some packages (e.g.
geoparser), so those packages would need to be identified and modified
and resubmitted to CRAN.


I am totally with you on this "runtime vs. implementation-time"-issue. 
That is why I proposed the patch as I did: It seemed to require only 
minor changes to base R and I didn't see how it could be incompatible 
with existing code.


Actually I can still not see how a package could have potentially *used* 
backslashes immediately followed by newlines up to now, since those 
backslashes were just ignored by the parser (And changes to the function 
StringValue are just about the parser, aren't they?). Of course I cannot 
rule out the possibility that there is code like

var <- "aaa\
bbb"
around, but this would be based on the undocumented(?) features that 
"backslash newline" is a valid escape sequence and that it is treated as 
"newline".


Maybe its a good idea to show some more examples how the patched parser 
behaves. There should only be difference to the current implementation 
if a string literal spans multiple lines and a line ends in an odd 
number of backslashes (see last example):


> "aaa\\
+ bbb"
[1] "aaa\\\nbbb"

> "aaa\\nbbb"
[1] "aaa\\nbbb"

> "aaa\\\nbbb"
[1] "aaa\\\nbbb"

> "aaa\\"
[1] "aaa\\"

> "aaa\\\n"
[1] "aaa\\\n"

> "aaa"
[1] "aaa"

> "aaa\n"
[1] "aaa\n"

> "aaa
+ bbb"
[1] "aaa\nbbb"

> "aaa\\\
+ bbb"
[1] "aaa\\bbb"

Andreas


Core changes to existing behaviour need really strong arguments, and I'm
just not seeing those here.

Duncan Murdoch


The other reason is style: I think it is cleaner if we can construct
such a long string literal without the need for a function call.

Andreas



Currently, if a string literal spans multiple lines, there is no way to
inhibit the introduction of newline characters:

 > "aaa
+ bbb"
[1] "aaa\nbbb"


If a line ends with a backslash, it is just ignored:

 > "aaa\
+ bbb"
[1] "aaa\nbbb"


We could use this fact to implement string splitting in a fairly
backward-compatible way, since currently such trailing backslashes
should hardly be used as they do not have any effect. The attached
patch
makes the parser ignore a newline character directly following a
backslash:

 > "aaa\
+ bbb"
[1] "aaabbb"


I personally would also prefer if leading blanks (spaces and tabs) in
the second line are ignored to allow for proper indentation:

 >   "aaa \
+bbb"
[1] "aaa bbb"

 >   "aaa\
+\ bbb"
[1] "aaa bbb"

This is also implemented by this patch.


An alternative approach could be to have something like

("aaa "
"bbb")

or

("aaa ",
"bbb")

be interpreted as "aaa bbb".

I don't know the ins and outs of the parser of R (hence: please very
carefully review the attached patch), but I guess this would be more
work to implement!?


What do you think? Is there anybody else who is missing this feature in
the first place?

Regards,
Andreas



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel









__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [WISH / PATCH] possibility to split string literals across multiple lines

2017-06-14 Thread Andreas Kersting

 Original Message 
From: Hadley Wickham [mailto:h.wick...@gmail.com]
Sent: Wednesday, Jun 14, 2017 2:51 PM GMT
To: Simon Urbanek
Cc: Andreas Kersting; r-devel@r-project.org
Subject: [Rd] [WISH / PATCH] possibility to split string literals across 
multiple lines



On Wed, Jun 14, 2017 at 8:48 AM, Simon Urbanek
 wrote:

As I recall this has been discussed at least a few times (unfortunately I'm 
traveling so can't check the references), but the justification was never 
satisfactory.

Personally, I wouldn't mind string continuation supported since it makes for 
more readable code (I had one of my packages raise a NOTE in examples because 
there is no way in R to split a long hash into multiple lines), but I would be 
strongly against random removal of whitespaces as it's counter-intuitive, 
misleading and makes it impossible to continue spaces on the next line. None of 
the languages that I can think of with multiline strings do that as that's way 
too dangerous.


Julia does, but uses triple quotes:
https://docs.julialang.org/en/stable/manual/strings/#triple-quoted-string-literals

Hadley



If we consider bash a programming language: Here documents 
(http://tldp.org/LDP/abs/html/here-docs.html) can have leading tabs be 
removed (see Example 19-4).


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Specifying C Standard in Package's Makevars File

2020-09-28 Thread Andreas Kersting
Hi,

what is the correct way to specify a C standard in a package's Makevars file?

Building a package with e.g. PKG_CFLAGS = -std=gnu11 does work but R CMD check 
issues a warning:

* checking compilation flags in Makevars ... WARNING
Non-portable flags in variable 'PKG_CFLAGS':
  -std=gnu11

(Same for -std=c11.)

Thanks! Regards,
Andreas Kersting
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Specifying C Standard in Package's Makevars File

2020-10-02 Thread Andreas Kersting
Thanks, that was very helpful. The C11 features I use do actually work in C99 
mode, so I will stick with that. I just thought it was kind of "cleaner" to 
specify C11 mode when using features from that standard.

2020-09-29 16:35 GMT+02:00 "Prof Brian Ripley" :
> On 28/09/2020 12:44, Andreas Kersting wrote:> Hi,
>> > what is the correct way to specify a C standard in a package's Makevars 
>> > file?
>> > Building a package with e.g. PKG_CFLAGS = -std=gnu11 does work but R CMD 
>> > check issues a warning:
> 
> for some unstated value of 'work' ...
> 
>> * checking compilation flags in Makevars ... WARNING
>> Non-portable flags in variable 'PKG_CFLAGS':
>>-std=gnu11
>> > (Same for -std=c11.)
>> > Thanks! Regards,
>> Andreas Kersting
> 
> Those flags are not portable, as 'check' correctly says.  Furthermore, on 
> some platforms there may be no flag which can be added -- R documents that 
> 'CC' specifies a C99 compiler, and that or CC+CFLAGS are likely to specify 
> flags which are incompatible with -std=c11 (true on Solaris where -xc99 is 
> used).
> 
> So, like all such overrides (see 'Writing R Extensions') you need to write a 
> configure script (preferably using autoconf) to
> 
> - select an appropriate C compiler+flags
> - substitute them into src/Makefile.in
> 
> For the new features I have used in C11, all known compilers make them 
> available in C99 mode and a configure script could be used to test for their 
> presence (as R itself does).  That is, it is rare to actually need to specify 
> C11 mode.
> 
> -- 
> Brian D. Ripley,  rip...@stats.ox.ac.uk
> Emeritus Professor of Applied Statistics, University of Oxford
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] custom allocators, Valgrind and uninitialized memory

2021-03-26 Thread Andreas Kersting
Hi,

In my package bettermc, I use a custom allocator, which hands out already 
defined/initialized memory (mmap of a POSIX shared memory object).

If my code is run in R which was configured --with-valgrind-instrumentation > 
0, Valgrind will (correctly) complain about uninitialized memory being used, 
e.g.

==813836== Conditional jump or move depends on uninitialised value(s)
==813836==at 0x4F0A9D: getvar (svn/R-devel/src/main/eval.c:5171)
==813836==by 0x4D9B38: bcEval (svn/R-devel/src/main/eval.c:6867)
==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
==813836==by 0x4F09AF: forcePromise (svn/R-devel/src/main/eval.c:555)
==813836==by 0x4F0C57: FORCE_PROMISE (svn/R-devel/src/main/eval.c:5136)
==813836==by 0x4F0C57: getvar (svn/R-devel/src/main/eval.c:5177)
==813836==by 0x4D9B38: bcEval (svn/R-devel/src/main/eval.c:6867)
==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
==813836==by 0x4F1A8D: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==813836==by 0x4F2783: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==813836==by 0x4DF61D: bcEval (svn/R-devel/src/main/eval.c:7083)
==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
==813836==by 0x4F1A8D: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==813836==  Uninitialised value was created by a client request
==813836==at 0x52D5CF: Rf_allocVector3 (svn/R-devel/src/main/memory.c:2892)
==813836==by 0x16B415EA: allocate_from_shm 
(packages/tests-vg/bettermc/src/copy2shm.c:289)
==813836==by 0x49D123: R_doDotCall (svn/R-devel/src/main/dotcode.c:614)
==813836==by 0x4DA36D: bcEval (svn/R-devel/src/main/eval.c:7671)
==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
==813836==by 0x4F1A8D: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==813836==by 0x4F2783: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==813836==by 0x4F0243: Rf_eval (svn/R-devel/src/main/eval.c:850)
==813836==by 0x49B68F: do_External (svn/R-devel/src/main/dotcode.c:573)
==813836==by 0x4D3566: bcEval (svn/R-devel/src/main/eval.c:7115)
==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
==813836==by 0x4F1A8D: R_execClosure (svn/R-devel/src/main/eval.c:1897)

(allocate_from_shm() is my function calling allocVector3() with a custom 
allocator.) Valgrind is correct, because allocVector3() explicitly calls 
VALGRIND_MAKE_MEM_UNDEFINED() on the memory my custom allocator returns.

- Sould allocVector3() call VALGRIND_MAKE_MEM_UNDEFINED() also if a custom 
allocator is used? For some custom allocators this is correct, for others not.

- Or should the code using a custom allocator call VALGRIND_MAKE_MEM_DEFINED() 
on the DATAPTR() returned by allocVector3()? E.g.

...
ret = PROTECT(allocVector3(asInteger(type), asReal(length), &allocator));
VALGRIND_MAKE_MEM_DEFINED(DATAPTR(ret), size);
...

For the latter to work also on systems without Valgrind installed, I need to 
include both valgrind.h and memcheck.h in the src of my package and include 
these (rather than the system headers), correct? Should I best take these 
headers directly from R (src/include/vg)?

Thanks! Regards,
Andreas
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] custom allocators, Valgrind and uninitialized memory

2021-03-26 Thread Andreas Kersting
Another idea for the second option. Instead of including the Valgrind headers, 
the following could be enough:

#if __has_include()
#include 
#else
#define VALGRIND_MAKE_MEM_DEFINED(_qzz_addr,_qzz_len)  \
  do { \
(_qzz_addr);   \
(_qzz_len);\
  } while (0)
#endif

I guess the packages are build on the same CRAN machine which also runs the 
tests under Valgrind, i.e. valgrind/memcheck.h is available during build of the 
package!? 

Not sure though if Oracle Developer Studio on Solaris supports __has_include() 
...

2021-03-26 08:40 GMT+01:00 "Andreas Kersting" :
> Hi,
> 
> In my package bettermc, I use a custom allocator, which hands out already 
> defined/initialized memory (mmap of a POSIX shared memory object).
> 
> If my code is run in R which was configured --with-valgrind-instrumentation > 
> 0, Valgrind will (correctly) complain about uninitialized memory being used, 
> e.g.
> 
> ==813836== Conditional jump or move depends on uninitialised value(s)
> ==813836==at 0x4F0A9D: getvar (svn/R-devel/src/main/eval.c:5171)
> ==813836==by 0x4D9B38: bcEval (svn/R-devel/src/main/eval.c:6867)
> ==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
> ==813836==by 0x4F09AF: forcePromise (svn/R-devel/src/main/eval.c:555)
> ==813836==by 0x4F0C57: FORCE_PROMISE (svn/R-devel/src/main/eval.c:5136)
> ==813836==by 0x4F0C57: getvar (svn/R-devel/src/main/eval.c:5177)
> ==813836==by 0x4D9B38: bcEval (svn/R-devel/src/main/eval.c:6867)
> ==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
> ==813836==by 0x4F1A8D: R_execClosure (svn/R-devel/src/main/eval.c:1897)
> ==813836==by 0x4F2783: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
> ==813836==by 0x4DF61D: bcEval (svn/R-devel/src/main/eval.c:7083)
> ==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
> ==813836==by 0x4F1A8D: R_execClosure (svn/R-devel/src/main/eval.c:1897)
> ==813836==  Uninitialised value was created by a client request
> ==813836==at 0x52D5CF: Rf_allocVector3 
> (svn/R-devel/src/main/memory.c:2892)
> ==813836==by 0x16B415EA: allocate_from_shm 
> (packages/tests-vg/bettermc/src/copy2shm.c:289)
> ==813836==by 0x49D123: R_doDotCall (svn/R-devel/src/main/dotcode.c:614)
> ==813836==by 0x4DA36D: bcEval (svn/R-devel/src/main/eval.c:7671)
> ==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
> ==813836==by 0x4F1A8D: R_execClosure (svn/R-devel/src/main/eval.c:1897)
> ==813836==by 0x4F2783: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
> ==813836==by 0x4F0243: Rf_eval (svn/R-devel/src/main/eval.c:850)
> ==813836==by 0x49B68F: do_External (svn/R-devel/src/main/dotcode.c:573)
> ==813836==by 0x4D3566: bcEval (svn/R-devel/src/main/eval.c:7115)
> ==813836==by 0x4F0077: Rf_eval (svn/R-devel/src/main/eval.c:727)
> ==813836==by 0x4F1A8D: R_execClosure (svn/R-devel/src/main/eval.c:1897)
> 
> (allocate_from_shm() is my function calling allocVector3() with a custom 
> allocator.) Valgrind is correct, because allocVector3() explicitly calls 
> VALGRIND_MAKE_MEM_UNDEFINED() on the memory my custom allocator returns.
> 
> - Sould allocVector3() call VALGRIND_MAKE_MEM_UNDEFINED() also if a custom 
> allocator is used? For some custom allocators this is correct, for others not.
> 
> - Or should the code using a custom allocator call 
> VALGRIND_MAKE_MEM_DEFINED() on the DATAPTR() returned by allocVector3()? E.g.
> 
> ...
> ret = PROTECT(allocVector3(asInteger(type), asReal(length), &allocator));
> VALGRIND_MAKE_MEM_DEFINED(DATAPTR(ret), size);
> ...
> 
> For the latter to work also on systems without Valgrind installed, I need to 
> include both valgrind.h and memcheck.h in the src of my package and include 
> these (rather than the system headers), correct? Should I best take these 
> headers directly from R (src/include/vg)?
> 
> Thanks! Regards,
> Andreas
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] custom allocators, Valgrind and uninitialized memory

2021-03-26 Thread Andreas Kersting
Hi Dirk,

Sure, let me try to explain:

CRAN ran the tests of my package using R which was configured 
--with-valgrind-instrumentation > 0. Valgrind reported many errors related to 
the use of supposedly uninitialized memory and the CRAN team asked me to tackle 
these.

These errors are false positives, because I pass a custom allocator to 
allocVector3() which hands out memory which is already initialized. However, 
this memory is explicitly marked for Valgrind as uninitialized by 
allocVector3(), and I do not initialize it subsequently, so Valgrind complains.

Now I am asking if it is correct that allocVector3() marks memory as 
uninitialized/undefined, even if it comes from a custom allocator. This is 
because allocVector3() cannot know if the memory might already by initialized.

If this is the intended behavior of allocVector3(), then I am looking for the 
proper way to get rid of these false Valgrind errors. So to be able to more 
easily spot the true ones ...

Which section of _Writing R Extensions_ do you have in mind? I cannot find 
anything on custom allocators there, but maybe I am using the wrong search 
terms. No, these object are returned to R and I am not aware that this is a 
problem / not allowed.

Regards, Andreas

2021-03-26 19:51 GMT+01:00 "Dirk Eddelbuettel" :
> 
> Andreas,
> 
> Can you briefly describe what it is you are trying to do?
> 
> In general, no R package would use valgrind directly; it is an optional
> debugger. Also note _Writing R Extensions_ has a few things to say about how
> memory destined for R object can and cannot be allocated -- I presume your
> custom allocator is only for objects you managed and do not return to R?
> 
> Best, Dirk
> 
> -- 
> https://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
> 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] custom allocators, Valgrind and uninitialized memory

2021-03-29 Thread Andreas Kersting
Hi Tomas,

Thanks for sharing your view on this! I understand your point, but still I 
think that the current situation is somewhat unfortunate:

I would argue that mmap() is a natural candidate to be used together with 
allocVector3(); it is even mentioned explicitly here: 
https://github.com/wch/r-source/blob/trunk/src/main/memory.c#L2575-L2576

However, when using a non-anonymous mapping, i.e. we want mmap() to initialize 
the memory e.g. from a file or a POSIX shared memory object, this means that we 
need to use MAP_FIXED in case we are obliged to initialize the memory AFTER 
allocVector3() returned it; at least I cannot think of a different way to 
achieve this.

The use of MAP_FIXED
- is discouraged (e.g. 
https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/mmap.2.html)
- requires two calls to mmap(): (1) to obtain the (anonymous) memory to be 
handed out by the custom allocater and (2) to actually map the file "over" the 
just allocated vector (using MAP_FIXED), which will overwrite the vector 
header; hence, we need to first back it up to later restore it

I have implemented my function using MAP_FIXED here: 
https://github.com/gfkse/bettermc/commit/f34c4f4c45c9ab11abe9b9e9b8b48064f128d731#diff-7098a5dde34efab163bbef27fe32f95c29e76236649479985d09c70100e4c737R278-R323

This solution, to me, is much more complicated and hacky than my previous one, 
which assumed it is OK to hand out already initialized memory directly from 
allocVector3().

Regards,
Andreas


2021-03-29 10:41 GMT+02:00 "Tomas Kalibera" :
> Hi Andreas,
> On 3/26/21 8:48 PM, Andreas Kersting wrote:
>> Hi Dirk,  > > Sure, let me try to explain: > > CRAN ran the tests of my 
>> package using R which was configured > --with-valgrind-instrumentation > 0. 
>> Valgrind reported many errors > related to the use of supposedly 
>> uninitialized memory and the CRAN > team asked me to tackle these. > > These 
>> errors are false positives, because I pass a custom allocator > to 
>> allocVector3() which hands out memory which is already > initialized. 
>> However, this memory is explicitly marked for Valgrind > as uninitialized by 
>> allocVector3(), and I do not initialize it > subsequently, so Valgrind 
>> complains. > > Now I am asking if it is correct that allocVector3() marks 
>> memory as > uninitialized/undefined, even if it comes from a custom 
>> allocator. > This is because allocVector3() cannot know if the memory might 
>> > already by initialized.
> I think the semantics of allocVector/allocVector3 should be the same 
> regardless of whether custom allocators are used. The semantics of 
> allocVector is to provide uninitialized memory (non-pointer types, Writing R 
> Extensions 5.9.2). Therefore, it is the caller who needs to take care of 
> initialization. This is also the semantics of "malloc" and Rallocators.h says 
> "custom_alloc_t mem_alloc; /* malloc equivalent */".
> 
> So I think that your code using your custom allocator needs to initialize 
> allocated memory to be correct. If your allocator initializes the memory, 
> that is fine, but unnecessary.
> 
> So technically speaking, the valgrind reports are not false alarms. I think 
> your call sites should initialize.
> 
> Best
> Tomas
> 
> 
> 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] custom allocators, Valgrind and uninitialized memory

2021-03-29 Thread Andreas Kersting
Hi Simon,

Yes, if this was acceptable on CRAN, I would agree that calling 
VALGRIND_MAKE_MEM_DEFINED() in my code would be sufficient. 

But since Tomas said, "So I think that your code using your custom allocator 
needs to initialize allocated memory to be correct. If your allocator 
initializes the memory, that is fine, but unnecessary.", I am not sure if it is 
acceptable.

Regards,
Andreas

2021-03-30 00:39 GMT+02:00 "Simon Urbanek" :
> Andres,
> 
> correct me if I'm wrong, but the issue here is not initialisation but rather 
> valgrind flagging. You simply have to call VALGRIND_MAKE_MEM_DEFINED() in 
> your code after allocVector3() to declare that you have initialised the 
> memory - or am I missing something?
> 
> Cheers,
> Simon
> 
> 
> 
>> On 30/03/2021, at 9:18 AM, Andreas Kersting  wrote:
>> 
>> Hi Tomas,
>> 
>> Thanks for sharing your view on this! I understand your point, but still I 
>> think that the current situation is somewhat unfortunate:
>> 
>> I would argue that mmap() is a natural candidate to be used together with 
>> allocVector3(); it is even mentioned explicitly here: 
>> https://github.com/wch/r-source/blob/trunk/src/main/memory.c#L2575-L2576
>> 
>> However, when using a non-anonymous mapping, i.e. we want mmap() to 
>> initialize the memory e.g. from a file or a POSIX shared memory object, this 
>> means that we need to use MAP_FIXED in case we are obliged to initialize the 
>> memory AFTER allocVector3() returned it; at least I cannot think of a 
>> different way to achieve this.
>> 
>> The use of MAP_FIXED
>> - is discouraged (e.g. 
>> https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/mmap.2.html)
>> - requires two calls to mmap(): (1) to obtain the (anonymous) memory to be 
>> handed out by the custom allocater and (2) to actually map the file "over" 
>> the just allocated vector (using MAP_FIXED), which will overwrite the vector 
>> header; hence, we need to first back it up to later restore it
>> 
>> I have implemented my function using MAP_FIXED here: 
>> https://github.com/gfkse/bettermc/commit/f34c4f4c45c9ab11abe9b9e9b8b48064f128d731#diff-7098a5dde34efab163bbef27fe32f95c29e76236649479985d09c70100e4c737R278-R323
>> 
>> This solution, to me, is much more complicated and hacky than my previous 
>> one, which assumed it is OK to hand out already initialized memory directly 
>> from allocVector3().
>> 
>> Regards,
>> Andreas
>> 
>> 
>> 2021-03-29 10:41 GMT+02:00 "Tomas Kalibera" :
>>> Hi Andreas,
>>> On 3/26/21 8:48 PM, Andreas Kersting wrote:
>>>> Hi Dirk,  > > Sure, let me try to explain: > > CRAN ran the tests of my 
>>>> package using R which was configured > --with-valgrind-instrumentation > 
>>>> 0. Valgrind reported many errors > related to the use of supposedly 
>>>> uninitialized memory and the CRAN > team asked me to tackle these. > > 
>>>> These errors are false positives, because I pass a custom allocator > to 
>>>> allocVector3() which hands out memory which is already > initialized. 
>>>> However, this memory is explicitly marked for Valgrind > as uninitialized 
>>>> by allocVector3(), and I do not initialize it > subsequently, so Valgrind 
>>>> complains. > > Now I am asking if it is correct that allocVector3() marks 
>>>> memory as > uninitialized/undefined, even if it comes from a custom 
>>>> allocator. > This is because allocVector3() cannot know if the memory 
>>>> might > already by initialized.
>>> I think the semantics of allocVector/allocVector3 should be the same 
>>> regardless of whether custom allocators are used. The semantics of 
>>> allocVector is to provide uninitialized memory (non-pointer types, Writing 
>>> R Extensions 5.9.2). Therefore, it is the caller who needs to take care of 
>>> initialization. This is also the semantics of "malloc" and Rallocators.h 
>>> says "custom_alloc_t mem_alloc; /* malloc equivalent */".
>>> 
>>> So I think that your code using your custom allocator needs to initialize 
>>> allocated memory to be correct. If your allocator initializes the memory, 
>>> that is fine, but unnecessary.
>>> 
>>> So technically speaking, the valgrind reports are not false alarms. I think 
>>> your call sites should initialize.
>>> 
>>> Best
>>> Tomas
>>> 
>>> 
>>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
> 
> 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] custom allocators, Valgrind and uninitialized memory

2021-03-30 Thread Andreas Kersting
Hi Simon, hi Tomas,

Let me try to wrap up this discussion:

- "What does any of this to do with CRAN?"
Not much, I agree. It is just that this whole issue arose because the CRAN team 
asked me to fix the use of uninitialized memory as reported by Valgrind. Sorry 
for mixing things up here.

- "I don't quite get your earlier response about allocating *after* the call 
since that makes no sense to me"
I was talking about *initializing* after the call as originally suggested by 
Tomas and - as I wrote - I also do not like my proposal involving MAP_FIXED.

- bottom line
allocVector3() is correctly marking memory as uninitialized because it cannot 
safely assume otherwise. It is ok for a custom allocator to return already 
initialized memory and inform Valgrind about this fact.

I hope, this summarizes it well.

Thanks for your time and support, Tomas and Simon. Very much appreciated!

Regards,
Andreas


2021-03-30 10:03 GMT+02:00 "Simon Urbanek" :
> Andreas,
> 
> What does any of this to do with CRAN? This not a the CRAN list - we're 
> discussing the proper approach of using valgrind and R can only assume that 
> the memory is uninitialised (since it cannot safely assume anything else) so 
> it is up to you to declare the memory as initialised if you can guarantee 
> that being true. I don't quite get your earlier response about allocating 
> *after* the call since that makes no sense to me - the whole point of a 
> custom allocator is to allow you to allocate the memory, so whether it is 
> initialised or not is under your control - but that means also it is your 
> responsibility to flag the state accordingly. Note, however, that this is not 
> merely just true by the virtue of using mmap - the memory content is only 
> valid (initialised) if you used mmap with previously initialised content. 
> Again, entirely up to you to decide what the semantics are since you are the 
> author of the custom allocator. Does that make sense?
> 
> Cheers,
> Simon
> 
> 
> 
>> On Mar 30, 2021, at 18:27, Andreas Kersting  wrote:
>> 
>> Hi Simon,
>> 
>> Yes, if this was acceptable on CRAN, I would agree that calling 
>> VALGRIND_MAKE_MEM_DEFINED() in my code would be sufficient. 
>> 
>> But since Tomas said, "So I think that your code using your custom allocator 
>> needs to initialize allocated memory to be correct. If your allocator 
>> initializes the memory, that is fine, but unnecessary.", I am not sure if it 
>> is acceptable.
>> 
>> Regards,
>> Andreas
>> 
>> 2021-03-30 00:39 GMT+02:00 "Simon Urbanek" :
>>> Andres,
>>> 
>>> correct me if I'm wrong, but the issue here is not initialisation but 
>>> rather valgrind flagging. You simply have to call 
>>> VALGRIND_MAKE_MEM_DEFINED() in your code after allocVector3() to declare 
>>> that you have initialised the memory - or am I missing something?
>>> 
>>> Cheers,
>>> Simon
>>> 
>>> 
>>> 
>>>> On 30/03/2021, at 9:18 AM, Andreas Kersting  wrote:
>>>> 
>>>> Hi Tomas,
>>>> 
>>>> Thanks for sharing your view on this! I understand your point, but still I 
>>>> think that the current situation is somewhat unfortunate:
>>>> 
>>>> I would argue that mmap() is a natural candidate to be used together with 
>>>> allocVector3(); it is even mentioned explicitly here: 
>>>> https://github.com/wch/r-source/blob/trunk/src/main/memory.c#L2575-L2576
>>>> 
>>>> However, when using a non-anonymous mapping, i.e. we want mmap() to 
>>>> initialize the memory e.g. from a file or a POSIX shared memory object, 
>>>> this means that we need to use MAP_FIXED in case we are obliged to 
>>>> initialize the memory AFTER allocVector3() returned it; at least I cannot 
>>>> think of a different way to achieve this.
>>>> 
>>>> The use of MAP_FIXED
>>>> - is discouraged (e.g. 
>>>> https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/mmap.2.html)
>>>> - requires two calls to mmap(): (1) to obtain the (anonymous) memory to be 
>>>> handed out by the custom allocater and (2) to actually map the file "over" 
>>>> the just allocated vector (using MAP_FIXED), which will overwrite the 
>>>> vector header; hence, we need to first back it up to later restore it
>>>> 
>>>> I have implemented my function using MAP_FIXED here: 
>>>> https://github.com/gfkse/bettermc/commit/f34c4f4c45c9ab11abe9b9e9b8b48064f128d731#d

[Rd] memory consumption of nested (un)serialize of sys.frames()

2021-04-07 Thread Andreas Kersting
Hi,

please consider the following minimal reproducible example:

Create a new R package which just contains the following two (exported) objects:


crash_dumps <- new.env()

f <- function() {
  x <- runif(1e5)
  dump <- lapply(1:2, function(i) unserialize(serialize(sys.frames(), NULL)))
  assign("last.dump", dump, crash_dumps)
}


WARNING: the following will probably eat all your RAM!

Attach this package and run:

for (i in 1:100) {
  print(i)
  f()
}

You will notice that with each iteration the execution of f() slows down 
significantly while the memory consumption of the R process (v4.0.5 on Linux) 
quickly explodes.

I am having a hard time to understand what exactly is happening here. Something 
w.r.t. too deeply nested environments? Could someone please enlighten me? 
Thanks!

Regards,
Andreas


Background:
In an R package I store crash dumps on error in a parallel processes in a way 
similar to what I have just shown (hence the (un)serialize(), which happens as 
part of returning the objects to the parent process). The first 2 or 3 times I 
do so in a session everything is fine, but afterwards it takes very long and I 
soon run out of memory.

Some more observations:
- If I omit `x <- runif(1e5)`, the issues seem to be less pronounced.
- If I assign to .GlobalEnv instead of crash_dumps, there seems to be no issue 
- probably because .GlobalEnv is not included in sys.frames(), while 
crash_dumps is indirectly via the namespace of the package being the parent.env 
of some of the sys.frames()!?
- If I omit the lapply(...), i.e. use `dump <- 
unserialize(serialize(sys.frames(), NULL))` directly, there seems to be no 
issue. The immediate consequence is that there are less sys.frames and - in 
particular - there is no frame which has the base namespace as its parent.env.
- If I make crash_dumps a list and use assignInMyNamespace() to store the dump 
in it, there also seems to be no issue. I will probably use this as a 
workaround:

crash_dumps <- list()

f <- function() {
  x <- runif(1e5)
  dump <- lapply(1:2, function(i) unserialize(serialize(sys.frames(), NULL)))
  crash_dumps[["last.dump"]] <- dump
  assignInMyNamespace("crash_dumps", crash_dumps)
}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [External] memory consumption of nested (un)serialize of sys.frames()

2021-04-07 Thread Andreas Kersting
Hi Luke,

Please see https://github.com/akersting/dumpTest for the package.

Here a session showing my issue:

> library(dumpTest)
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C 
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] dumpTest_0.1.0

loaded via a namespace (and not attached):
[1] compiler_4.0.5
> for (i in 1:100) {
+   print(i)
+   print(system.time(f()))
+ }
[1] 1
   user  system elapsed 
  0.028   0.004   0.034 
[1] 2
   user  system elapsed 
  0.067   0.008   0.075 
[1] 3
   user  system elapsed 
  0.176   0.000   0.176 
[1] 4
   user  system elapsed 
  0.335   0.012   0.349 
[1] 5
   user  system elapsed 
  0.745   0.023   0.770 
[1] 6
   user  system elapsed 
  1.495   0.060   1.572 
[1] 7
   user  system elapsed 
  2.902   0.136   3.040 
[1] 8
   user  system elapsed 
  5.753   0.272   6.034 
[1] 9
   user  system elapsed 
 11.807   0.708  12.597 
[1] 10
^C
Timing stopped at: 6.638 0.549 7.214

I had to interrupt in iteration 10 because I was running low on RAM.

Regards,
Andreas

2021-04-07 15:28 GMT+02:00 luke-tier...@uiowa.edu:
> On Wed, 7 Apr 2021, Andreas Kersting wrote:
> 
>> Hi,
>>
>> please consider the following minimal reproducible example:
>>
>> Create a new R package which just contains the following two (exported) 
>> objects:
> 
> I would not expect this behavior and I don't see it when I make such a
> package (in R 4.0.3 or R-devel on Ubuntu).  You will need to provide a
> more complete reproducible example if you want help with what you are
> trying to do; also sessionInfo() would help.
> 
> Best,
> 
> luke
> 
>>
>>
>> crash_dumps <- new.env()
>>
>> f <- function() {
>>  x <- runif(1e5)
>>  dump <- lapply(1:2, function(i) unserialize(serialize(sys.frames(), NULL)))
>>  assign("last.dump", dump, crash_dumps)
>> }
>>
>>
>> WARNING: the following will probably eat all your RAM!
>>
>> Attach this package and run:
>>
>> for (i in 1:100) {
>>  print(i)
>>  f()
>> }
>>
>> You will notice that with each iteration the execution of f() slows down 
>> significantly while the memory consumption of the R process (v4.0.5 on 
>> Linux) quickly explodes.
>>
>> I am having a hard time to understand what exactly is happening here. 
>> Something w.r.t. too deeply nested environments? Could someone please 
>> enlighten me? Thanks!
>>
>> Regards,
>> Andreas
>>
>>
>> Background:
>> In an R package I store crash dumps on error in a parallel processes in a 
>> way similar to what I have just shown (hence the (un)serialize(), which 
>> happens as part of returning the objects to the parent process). The first 2 
>> or 3 times I do so in a session everything is fine, but afterwards it takes 
>> very long and I soon run out of memory.
>>
>> Some more observations:
>> - If I omit `x <- runif(1e5)`, the issues seem to be less pronounced.
>> - If I assign to .GlobalEnv instead of crash_dumps, there seems to be no 
>> issue - probably because .GlobalEnv is not included in sys.frames(), while 
>> crash_dumps is indirectly via the namespace of the package being the 
>> parent.env of some of the sys.frames()!?
>> - If I omit the lapply(...), i.e. use `dump <- 
>> unserialize(serialize(sys.frames(), NULL))` directly, there seems to be no 
>> issue. The immediate consequence is that there are less sys.frames and - in 
>> particular - there is no frame which has the base namespace as its 
>> parent.env.
>> - If I make crash_dumps a list and use assignInMyNamespace() to store the 
>> dump in it, there also seems to be no issue. I will probably use this as a 
>> workaround:
>>
>> crash_dumps <- list()
>>
>> f <- function() {
>>  x <- runif(1e5)
>>  dump <- lapply(1:2, function(i) unserialize(serialize(sys.frames(), NULL)))
>>  crash_dumps[["last.dump"]] <- dump
>>  assignInMyNamespace("crash_dumps", crash_dumps)
>> }
&g

Re: [Rd] [External] memory consumption of nested (un)serialize of sys.frames()

2021-04-07 Thread Andreas Kersting
Hi Dirk, hi Luke,

Thanks for checking!

I could narrow it down further. I have the issue only if I install 
--with-keep.source, i.e.

R CMD INSTALL --with-keep.source dumpTest

Since this is the default in RStudio when clicking "Install and Restart", I was 
always having the issue - also from base R. If I install using e.g. 
devtools::install_github() directly it is also fine for me.

Could you please confirm? Thanks!

Regards,
Andreas

2021-04-07 16:20 GMT+02:00 "Dirk Eddelbuettel" :
> 
> On 7 April 2021 at 16:06, Andreas Kersting wrote:
> | Hi Luke,
> | 
> | Please see https://github.com/akersting/dumpTest for the package.
> | 
> | Here a session showing my issue:
> | 
> | > library(dumpTest)
> | > sessionInfo()
> | R version 4.0.5 (2021-03-31)
> | Platform: x86_64-pc-linux-gnu (64-bit)
> | Running under: Debian GNU/Linux 10 (buster)
> | 
> | Matrix products: default
> | BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
> | LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
> | 
> | locale:
> |  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
> |  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
> |  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8   
> |  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C 
> |  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> | [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   
> | 
> | attached base packages:
> | [1] stats graphics  grDevices utils datasets  methods   base 
> | 
> | other attached packages:
> | [1] dumpTest_0.1.0
> | 
> | loaded via a namespace (and not attached):
> | [1] compiler_4.0.5
> | > for (i in 1:100) {
> | +   print(i)
> | +   print(system.time(f()))
> | + }
> | [1] 1
> |user  system elapsed 
> |   0.028   0.004   0.034 
> | [1] 2
> |user  system elapsed 
> |   0.067   0.008   0.075 
> | [1] 3
> |user  system elapsed 
> |   0.176   0.000   0.176 
> | [1] 4
> |user  system elapsed 
> |   0.335   0.012   0.349 
> | [1] 5
> |user  system elapsed 
> |   0.745   0.023   0.770 
> | [1] 6
> |user  system elapsed 
> |   1.495   0.060   1.572 
> | [1] 7
> |user  system elapsed 
> |   2.902   0.136   3.040 
> | [1] 8
> |user  system elapsed 
> |   5.753   0.272   6.034 
> | [1] 9
> |user  system elapsed 
> |  11.807   0.708  12.597 
> | [1] 10
> | ^C
> | Timing stopped at: 6.638 0.549 7.214
> | 
> | I had to interrupt in iteration 10 because I was running low on RAM.
> 
> No issue here.  Ubuntu 20.10, R 4.0.5 'from CRAN' i.e. Michael's PPA build
> off my Debian package, hence instrumentation as in the Debian package.
> 
> edd@rob:~$ installGithub.r akersting/dumpTest
> Using github PAT from envvar GITHUB_PAT
> Downloading GitHub repo akersting/dumpTest@HEAD
> ✔  checking for file 
> ‘/tmp/remotes3f9af733166ccd/akersting-dumpTest-3bed8e2/DESCRIPTION’ ...
> ─  preparing ‘dumpTest’:
> ✔  checking DESCRIPTION meta-information ...
> ─  checking for LF line-endings in source and make files and shell scripts
> ─  checking for empty or unneeded directories
> ─  building ‘dumpTest_0.1.0.tar.gz’
>
> Installing package into ‘/usr/local/lib/R/site-library’
> (as ‘lib’ is unspecified)
> * installing *source* package ‘dumpTest’ ...
> ** using staged installation
> ** R
> ** byte-compile and prepare package for lazy loading
> ** help
> No man pages found in package  ‘dumpTest’ 
> *** installing help indices
> ** building package indices
> ** testing if installed package can be loaded from temporary location
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (dumpTest)
> edd@rob:~$ Rscript -e 'system.time({for (i in 1:100) dumpTest::f()})'
>user  system elapsed 
>   0.481   0.019   0.500 
> edd@rob:~$
> 
> (I also ran the variant you showed with the dual print statements, it just
> consumes more screen real estate and ends on
> 
> [...]
> [1] 97  
>user  system elapsed 
>   0.004   0.000   0.005 
> [1] 98
>
>user  system elapsed 
>   0.004   0.000   0.005   
> [1] 99 
>user  system elapsed
>   0.004   0.000   0.004   
>
> [1] 100   
>
>user  system elapsed   
>
>   0.005   0.000   0.005 
> edd@rob:~$ )
> 
> Dirk
> 
> -- 
> https://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
> 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [External] memory consumption of nested (un)serialize of sys.frames()

2021-04-08 Thread Andreas Kersting
Hi,

For (hopefully) full reproducibility:

docker run rocker/tidyverse:4.0.5 Rscript -e 
'devtools::install_github("akersting/dumpTest", INSTALL_opts = 
"--with-keep.source"); library(dumpTest); for (i in 1:100) {print(i); 
print(system.time(f()))}'

Regards,
Andreas

2021-04-07 17:09 GMT+02:00 "Andreas Kersting" :
> Hi Dirk, hi Luke,
> 
> Thanks for checking!
> 
> I could narrow it down further. I have the issue only if I install 
> --with-keep.source, i.e.
> 
> R CMD INSTALL --with-keep.source dumpTest
> 
> Since this is the default in RStudio when clicking "Install and Restart", I 
> was always having the issue - also from base R. If I install using e.g. 
> devtools::install_github() directly it is also fine for me.
> 
> Could you please confirm? Thanks!
> 
> Regards,
> Andreas
> 
> 2021-04-07 16:20 GMT+02:00 "Dirk Eddelbuettel" :
>> 
>> On 7 April 2021 at 16:06, Andreas Kersting wrote:
>> | Hi Luke,
>> | 
>> | Please see https://github.com/akersting/dumpTest for the package.
>> | 
>> | Here a session showing my issue:
>> | 
>> | > library(dumpTest)
>> | > sessionInfo()
>> | R version 4.0.5 (2021-03-31)
>> | Platform: x86_64-pc-linux-gnu (64-bit)
>> | Running under: Debian GNU/Linux 10 (buster)
>> | 
>> | Matrix products: default
>> | BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
>> | LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
>> | 
>> | locale:
>> |  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
>> |  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>> |  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8   
>> |  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C 
>> |  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> | [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   
>> | 
>> | attached base packages:
>> | [1] stats graphics  grDevices utils datasets  methods   base 
>> | 
>> | other attached packages:
>> | [1] dumpTest_0.1.0
>> | 
>> | loaded via a namespace (and not attached):
>> | [1] compiler_4.0.5
>> | > for (i in 1:100) {
>> | +   print(i)
>> | +   print(system.time(f()))
>> | + }
>> | [1] 1
>> |user  system elapsed 
>> |   0.028   0.004   0.034 
>> | [1] 2
>> |user  system elapsed 
>> |   0.067   0.008   0.075 
>> | [1] 3
>> |user  system elapsed 
>> |   0.176   0.000   0.176 
>> | [1] 4
>> |user  system elapsed 
>> |   0.335   0.012   0.349 
>> | [1] 5
>> |user  system elapsed 
>> |   0.745   0.023   0.770 
>> | [1] 6
>> |user  system elapsed 
>> |   1.495   0.060   1.572 
>> | [1] 7
>> |user  system elapsed 
>> |   2.902   0.136   3.040 
>> | [1] 8
>> |user  system elapsed 
>> |   5.753   0.272   6.034 
>> | [1] 9
>> |user  system elapsed 
>> |  11.807   0.708  12.597 
>> | [1] 10
>> | ^C
>> | Timing stopped at: 6.638 0.549 7.214
>> | 
>> | I had to interrupt in iteration 10 because I was running low on RAM.
>> 
>> No issue here.  Ubuntu 20.10, R 4.0.5 'from CRAN' i.e. Michael's PPA build
>> off my Debian package, hence instrumentation as in the Debian package.
>> 
>> edd@rob:~$ installGithub.r akersting/dumpTest
>> Using github PAT from envvar GITHUB_PAT
>> Downloading GitHub repo akersting/dumpTest@HEAD
>> ✔  checking for file 
>> ‘/tmp/remotes3f9af733166ccd/akersting-dumpTest-3bed8e2/DESCRIPTION’ ...
>> ─  preparing ‘dumpTest’:
>> ✔  checking DESCRIPTION meta-information ...
>> ─  checking for LF line-endings in source and make files and shell scripts
>> ─  checking for empty or unneeded directories
>> ─  building ‘dumpTest_0.1.0.tar.gz’
>>
>> Installing package into ‘/usr/local/lib/R/site-library’
>> (as ‘lib’ is unspecified)
>> * installing *source* package ‘dumpTest’ ...
>> ** using staged installation
>> ** R
>> ** byte-compile and prepare package for lazy loading
>> ** help
>> No man pages found in package  ‘dumpTest’ 
>> *** installing help indices
>> ** building package indices
>> ** testing if installed package can be loaded from temporary location
>> ** testing if installed package can be loaded from final location
>> ** testing if installed package keeps a record of temporary installation path
>> * DONE (dumpTest)
>> edd@rob:~$ Rscript -e 

[Rd] GC: parallelizing the CHARSXP cache maintenance

2021-10-07 Thread Andreas Kersting
Hi all,

As part of RunGenCollect() (in src/main/memory.c), some maintenance on the 
CHARSXP cache is done, namely unmarked nodes/CHARSXPs are removed from the hash 
chains. This requires always touching all CHARSXP in the cache, irrespective of 
the number of generations which were just garbage collected. In a session with 
a big CHARSXP cache, this will significantly slow down gc also when just 
collecting the youngest generation.

However, this part of RunGenCollect() seems to be one of the few which can 
easily be parallelized without the need for thread synchronization. And it 
seems to be the one most profiting from parallelization.

Attached patch (parallel_CHARSXP_cache.diff) implements parallelization over 
the elements of R_StringHash and gives the following performance improvements 
on my system when using 4 threads compared to R devel (revision 81008):

Elapsed time for 200 non-full gc in a session after

x <- as.character(runif(1e6))[]
gc(full = TRUE)

8sec -> 2.5sec.

AND

Elapsed time for five non-full gc in a session after

x <- as.character(runif(5e7))[]
gc(full = TRUE)

21sec -> 6sec.

In the patch, I dropped the two lines 

FORWARD_NODE(s);
FORWARD_NODE(CXHEAD(s));

because they are currently both no-ops (and would require synchronization if 
they were not). They are no-ops because we have

# define CXHEAD(x) (x)  // in Defn.h

and hence FORWARD_NODE(s)/FORWARD_NODE(CXHEAD(s)) is only called when s is 
already marked, in which case FORWARD_NODE() does nothing.

I used OpenMP despite the known issues of some of its implementations with 
hanging after a fork, mostly because it was the easiest thing to do for a PoC. 
I worked around this similar to e.g. data.table by using only one thread in 
forked children.

It might be worth considering making the parallelization conditional on the 
size of the CHARSXP cache and use only the main thread if the cache is (still) 
small.

In the second attached patch (parallel_CHARSXP_cache_no_forward.diff) I 
additionally no longer call FORWARD_NODE(R_StringHash) because this will make 
the following call to PROCESS_NODES() iterate through all elements of 
R_StringHash again which is unnecessary since all elements are either 
R_NilValue or an already marked CHARSXP. I rather directly mark & snap 
R_StringHash. In contrast to the parallelization, this only affects full gcs 
since R_StringHash will quickly belong to the oldest generation.

Attached gc_test.R is the script I used to get the previously mentioned and 
more gc timings.

To me this looks like a significant performance improvement, especially given 
the little changeset. What do you think?

Best regards,
AndreasIndex: src/main/memory.c
===
--- src/main/memory.c	(revision 81008)
+++ src/main/memory.c	(working copy)
@@ -90,6 +90,7 @@
 #include  /* for R_allocator_t structure */
 #include  // R_pow_di
 #include  // R_print
+#include  // pthread_atfork
 
 /* malloc uses size_t.  We are assuming here that size_t is at least
as large as unsigned long.  Changed from int at 1.6.0 to (i) allow
@@ -100,6 +101,22 @@
 static int gc_reporting = 0;
 static int gc_count = 0;
 
+static int gc_threads = 4;
+static int pre_fork_gc_threads;
+
+void when_fork() {
+pre_fork_gc_threads = gc_threads;
+gc_threads = 1;
+}
+
+void after_fork() {
+gc_threads = pre_fork_gc_threads;
+}
+
+void avoid_openmp_hang_within_fork() {
+pthread_atfork(&when_fork, &after_fork, NULL);
+}
+
 /* Report error encountered during garbage collection where for detecting
problems it is better to abort, but for debugging (or some production runs,
where external validation of results is possible) it may be preferred to
@@ -406,6 +423,18 @@
 }
 }
 
+static void init_gc_threads()
+{
+char *arg;
+
+arg = getenv("R_GC_THREADS");
+
+if (arg != NULL) {
+int threads = (int) atof(arg);
+if (threads >= 1) gc_threads = threads;
+}
+}
+
 /* Maximal Heap Limits.  These variables contain upper limits on the
heap sizes.  They could be made adjustable from the R level,
perhaps by a handler for a recoverable error.
@@ -1815,6 +1844,7 @@
 {
 	SEXP t;
 	int nc = 0;
+#pragma omp parallel for num_threads(gc_threads) reduction(+:nc) private(s, t) schedule(static)
 	for (i = 0; i < LENGTH(R_StringHash); i++) {
 	s = VECTOR_ELT(R_StringHash, i);
 	t = R_NilValue;
@@ -1827,8 +1857,6 @@
 		s = CXTAIL(s);
 		continue;
 		}
-		FORWARD_NODE(s);
-		FORWARD_NODE(CXHEAD(s));
 		t = s;
 		s = CXTAIL(s);
 	}
@@ -2124,6 +2152,7 @@
 
 init_gctorture();
 init_gc_grow_settings();
+init_gc_threads();
 
 arg = getenv("_R_GC_FAIL_ON_ERROR_");
 if (arg != NULL && StringTrue(arg))
@@ -2215,6 +2244,8 @@
 R_LogicalNAValue = allocVector(LGLSXP, 1);
 LOGICAL(R_LogicalNAValue)[0] = NA_LOGICAL;
 MARK_NOT_MUTABLE(R_LogicalNAValue);
+
+avoid_openmp_hang_within_fork();
 }
 
 /* Since memory allocat

[Rd] GC: improving the marking performance for STRSXPs

2021-10-07 Thread Andreas Kersting
Hi all,

in GC (in src/main/memory.c), FORWARD_CHILDREN() (called by PROCESS_NODES()) 
treats STRSXPs just like VECSXPs, i.e. it calls FORWARD_NODE() for all its 
children. I claim that this is unnecessarily inefficient since the children of 
a STRSXP can legitimately only be (atomic) CHARSXPs and could hence be marked 
directly in the call of FORWARD_CHILDREN() on the STRSXP.

Attached patch (atomic_CHARSXP.diff) implements this and gives the following 
performance improvements on my system compared to R devel (revision 81008):

Elapsed time for two full gc in a session after

x <- as.character(runif(5e7))[]

19sec -> 15sec.

This is the best-case scenario for the patch: very many unique/unmarked CHARSXP 
in the STRSXP. For already marked CHARSXP there is no performance gain since 
FORWARD_NODE() is a no-op for them.

The relative performance gain is even bigger if iterating through the STRSXP 
produces many cache misses, as e.g. after

x <- as.character(runif(5e7))[]
x <- sample(x, length(x))

Elapsed time for two full gc here: 83sec -> 52sec. This is because we have less 
cache misses per CHARSXP.

This patch additionally also assumes that the ATTRIBs of a CHARSXP are not to 
be traced because they are just used for maintaining the CHARSXP hash chains.

The second attached patch (atomic_CHARSXP_safe_unlikely.diff) checks both 
assumptions and calls gc_error() if they are violated and is still noticeably 
faster than R devel: 19sec -> 17sec and 83sec -> 54sec, respectively.

Attached gc_test.R is the script I used to get the previously mentioned and 
more gc timings.

Do you think that this is a reasonable change? It does make the code more 
complex and I am not sure if there might be situations in which the assumptions 
are violated, even though SET_STRING_ELT() and installAttrib() do enforce 
them.

Best regards,
AndreasIndex: src/main/memory.c
===
--- src/main/memory.c	(revision 81008)
+++ src/main/memory.c	(working copy)
@@ -678,7 +678,7 @@
 #define FREE_FORWARD_CASE
 #endif
 /*** assume for now all ALTREP nodes are based on CONS nodes */
-#define DO_CHILDREN(__n__,dc__action__,dc__extra__) do { \
+#define DO_CHILDREN4(__n__,dc__action__,dc__str__action__,dc__extra__) do { \
   if (HAS_GENUINE_ATTRIB(__n__)) \
 dc__action__(ATTRIB(__n__), dc__extra__); \
   if (ALTREP(__n__)) {	\
@@ -701,6 +701,12 @@
   case S4SXP: \
 break; \
   case STRSXP: \
+{ \
+  R_xlen_t i; \
+  for (i = 0; i < XLENGTH(__n__); i++) \
+	  dc__str__action__(VECTOR_ELT(__n__, i), dc__extra__); \
+} \
+break; \
   case EXPRSXP: \
   case VECSXP: \
 { \
@@ -740,6 +746,7 @@
   } \
 } while(0)
 
+#define DO_CHILDREN(__n__,dc__action__,dc__extra__) DO_CHILDREN4(__n__,dc__action__,dc__action__,dc__extra__)
 
 /* Forwarding Nodes.  These macros mark nodes or children of nodes and
place them on the forwarding list.  The forwarding list is assumed
@@ -758,8 +765,22 @@
 } while (0)
 
 #define FC_FORWARD_NODE(__n__,__dummy__) FORWARD_NODE(__n__)
-#define FORWARD_CHILDREN(__n__) DO_CHILDREN(__n__,FC_FORWARD_NODE, 0)
 
+#define PROCESS_CHARSXP(s) do { \
+  SEXP fn__n__ = (s); \
+  if (fn__n__ && ! NODE_IS_MARKED(fn__n__)) { \
+CHECK_FOR_FREE_NODE(fn__n__) \
+MARK_NODE(fn__n__); \
+UNSNAP_NODE(fn__n__); \
+SNAP_NODE(fn__n__, R_GenHeap[NODE_CLASS(fn__n__)].Old[NODE_GENERATION(fn__n__)]); \
+R_GenHeap[NODE_CLASS(fn__n__)].OldCount[NODE_GENERATION(fn__n__)]++; \
+  } \
+} while (0)
+
+#define FC_PROCESS_CHARSXP(__n__,__dummy__) PROCESS_CHARSXP(__n__)
+
+#define FORWARD_CHILDREN(__n__) DO_CHILDREN4(__n__,FC_FORWARD_NODE, FC_PROCESS_CHARSXP, 0)
+
 /* This macro should help localize where a FREESXP node is encountered
in the GC */
 #ifdef PROTECTCHECK
Index: src/main/memory.c
===
--- src/main/memory.c	(revision 81008)
+++ src/main/memory.c	(working copy)
@@ -678,7 +678,7 @@
 #define FREE_FORWARD_CASE
 #endif
 /*** assume for now all ALTREP nodes are based on CONS nodes */
-#define DO_CHILDREN(__n__,dc__action__,dc__extra__) do { \
+#define DO_CHILDREN4(__n__,dc__action__,dc__str__action__,dc__extra__) do { \
   if (HAS_GENUINE_ATTRIB(__n__)) \
 dc__action__(ATTRIB(__n__), dc__extra__); \
   if (ALTREP(__n__)) {	\
@@ -701,6 +701,12 @@
   case S4SXP: \
 break; \
   case STRSXP: \
+{ \
+  R_xlen_t i; \
+  for (i = 0; i < XLENGTH(__n__); i++) \
+	  dc__str__action__(VECTOR_ELT(__n__, i), dc__extra__); \
+} \
+break; \
   case EXPRSXP: \
   case VECSXP: \
 { \
@@ -740,6 +746,7 @@
   } \
 } while(0)
 
+#define DO_CHILDREN(__n__,dc__action__,dc__extra__) DO_CHILDREN4(__n__,dc__action__,dc__action__,dc__extra__)
 
 /* Forwarding Nodes.  These macros mark nodes or children of nodes and
place them on the forwarding list.  The forwarding list is assumed
@@ -758,8 +765,34 @@
 } while (0)
 
 #define FC_FORWARD_NODE(__n__,__dummy__

[Rd] GC: speeding-up the CHARSXP cache maintenance, 2nd try

2021-11-04 Thread Andreas Kersting
Hi,

In https://stat.ethz.ch/pipermail/r-devel/2021-October/081147.html I proposed 
to speed up the CHARSXP cache maintenance during GC using threading. This was 
rejected by Luke in 
https://stat.ethz.ch/pipermail/r-devel/2021-October/081172.html.

Here I want to propose an alternative approach to significantly speed up 
CHARSXP cache maintenance during partial GCs. A patch which passes `make 
check-devel` is attached. Compared to R devel (revision 81110) I get the 
following performance improvements on my system:

Elapsed time for five non-full gc in a session after

x <- as.character(runif(5e7))[]
gc(full = TRUE)

+20sec -> ~1sec.


This patch introduces (theoretical) overheads to mkCharLenCE() and full GCs. 
However, I did not measure dramatic differences:

y <- "old_CHARSXP" 

after

x <- "old_CHARSXP"; gc(); gc()

takes a median 32 nanoseconds with and without the patch.


gc(full = TRUE)

in a new session takes a median 16 milliseconds with and 14 without the patch.


The basic idea is to maintain the CHARSXP cache using subtables in 
R_StringHash, one for each of the (NUM_GC_GENERATIONS := NUM_OLD_GENERATIONS + 
1) GC generations. New CHARSXPs are added by mkCharLenCE() to the subtable of 
the youngest generation. After a partial GC, only the chains anchored at the 
subtables of the youngest (num_old_gens_to_collect + 1) generations need to be 
searched for and cleaned of unmarked nodes. Afterwards, these chains need to be 
merged into those of the respective next generation, if any. This approach 
relies on the fact that an object/CHARSXP can never become younger again. It is 
OK though if an object/CHARSXP "skips" a GC generation.

R_StringHash, which is now of length (NUM_GC_GENERATIONS * char_hash_size), is 
structured such that the chains for the same hashcode but for different 
generations are anchored at slots of R_StringHash which are next to each other 
in memory. This is because we often need to access two or more (i.e. currently 
all three) of them for one operation and this avoids cache misses.

HASHPRI, i.e. the number of occupied primary slots, is computed and stored as 
NUM_GC_GENERATIONS times the number of slots which are occupied in at least one 
of the subtables. This is done because in mkCharLenCE() we need to iterate 
through one or more chains if and only if there is a chain for the particular 
hashcode in at least one subtable.

I tried to keep the patch as minimal as possible. In particular, I did not add 
long vector support to R_StringHash. I rather reduced the max value of 
char_hash_size from 2^30 to 2^29, assuming that NUM_OLD_GENERATIONS is (not 
larger than) 2. I also did not yet adjust do_show_cache() and do_write_cache(), 
but I could do so if the patch is accepted.

Thanks for your consideration and feedback.

Regards,
Andreas


P.S. I had a hard time to get the indentation right in the patch due the mix of 
tabs and spaces. Sorry, if I screwed this up.Index: src/include/Defn.h
===
--- src/include/Defn.h	(revision 81110)
+++ src/include/Defn.h	(working copy)
@@ -94,6 +94,7 @@
  */
 
 #define NAMED_BITS 16
+#define NUM_GC_GENERATIONS 3
 
 /* Flags */
 
Index: src/main/envir.c
===
--- src/main/envir.c	(revision 81110)
+++ src/main/envir.c	(working copy)
@@ -4079,7 +4079,7 @@
 
 void attribute_hidden InitStringHash()
 {
-R_StringHash = R_NewHashTable(char_hash_size);
+R_StringHash = R_NewHashTable(char_hash_size * NUM_GC_GENERATIONS);
 }
 
 /* #define DEBUG_GLOBAL_STRING_HASH 1 */
@@ -4089,7 +4089,7 @@
 {
 SEXP old_table = R_StringHash;
 SEXP new_table, chain, new_chain, val, next;
-unsigned int counter, new_hashcode, newmask;
+unsigned int counter, new_slot, newmask;
 #ifdef DEBUG_GLOBAL_STRING_HASH
 unsigned int oldsize = HASHSIZE(R_StringHash);
 unsigned int oldpri = HASHPRI(R_StringHash);
@@ -4103,27 +4103,38 @@
 /* When using the ATTRIB fields to maintain the chains the chain
moving is destructive and does not involve allocation.  This is
therefore the only point where GC can occur. */
-new_table = R_NewHashTable(newsize);
+new_table = R_NewHashTable(newsize * NUM_GC_GENERATIONS);
 newmask = newsize - 1;
 
-/* transfer chains from old table to new table */
-for (counter = 0; counter < LENGTH(old_table); counter++) {
-	chain = VECTOR_ELT(old_table, counter);
-	while (!ISNULL(chain)) {
-	val = CXHEAD(chain);
-	next = CXTAIL(chain);
-	new_hashcode = char_hash(CHAR(val), LENGTH(val)) & newmask;
-	new_chain = VECTOR_ELT(new_table, new_hashcode);
-	/* If using a primary slot then increase HASHPRI */
-	if (ISNULL(new_chain))
-		SET_HASHPRI(new_table, HASHPRI(new_table) + 1);
-	/* move the current chain link to the new chain */
-	/* this is a destructive modification */
-	new_chain = SET_CXTAIL(val, new_chain);
-	SET_VECTOR_ELT(new_tab

Re: [Rd] GC: parallelizing the CHARSXP cache maintenance

2021-11-07 Thread Andreas Kersting
I have now added this to the wishlist in Bugzilla: 
https://bugs.r-project.org/show_bug.cgi?id=18234

2021-10-07 09:26 GMT+02:00 "Andreas Kersting" :
> Hi all,
> 
> As part of RunGenCollect() (in src/main/memory.c), some maintenance on the 
> CHARSXP cache is done, namely unmarked nodes/CHARSXPs are removed from the 
> hash chains. This requires always touching all CHARSXP in the cache, 
> irrespective of the number of generations which were just garbage collected. 
> In a session with a big CHARSXP cache, this will significantly slow down gc 
> also when just collecting the youngest generation.
> 
> However, this part of RunGenCollect() seems to be one of the few which can 
> easily be parallelized without the need for thread synchronization. And it 
> seems to be the one most profiting from parallelization.
> 
> Attached patch (parallel_CHARSXP_cache.diff) implements parallelization over 
> the elements of R_StringHash and gives the following performance improvements 
> on my system when using 4 threads compared to R devel (revision 81008):
> 
> Elapsed time for 200 non-full gc in a session after
> 
> x <- as.character(runif(1e6))[]
> gc(full = TRUE)
> 
> 8sec -> 2.5sec.
> 
> AND
> 
> Elapsed time for five non-full gc in a session after
> 
> x <- as.character(runif(5e7))[]
> gc(full = TRUE)
> 
> 21sec -> 6sec.
> 
> In the patch, I dropped the two lines 
> 
> FORWARD_NODE(s);
> FORWARD_NODE(CXHEAD(s));
> 
> because they are currently both no-ops (and would require synchronization if 
> they were not). They are no-ops because we have
> 
> # define CXHEAD(x) (x)  // in Defn.h
> 
> and hence FORWARD_NODE(s)/FORWARD_NODE(CXHEAD(s)) is only called when s is 
> already marked, in which case FORWARD_NODE() does nothing.
> 
> I used OpenMP despite the known issues of some of its implementations with 
> hanging after a fork, mostly because it was the easiest thing to do for a 
> PoC. I worked around this similar to e.g. data.table by using only one thread 
> in forked children.
> 
> It might be worth considering making the parallelization conditional on the 
> size of the CHARSXP cache and use only the main thread if the cache is 
> (still) small.
> 
> In the second attached patch (parallel_CHARSXP_cache_no_forward.diff) I 
> additionally no longer call FORWARD_NODE(R_StringHash) because this will make 
> the following call to PROCESS_NODES() iterate through all elements of 
> R_StringHash again which is unnecessary since all elements are either 
> R_NilValue or an already marked CHARSXP. I rather directly mark & snap 
> R_StringHash. In contrast to the parallelization, this only affects full gcs 
> since R_StringHash will quickly belong to the oldest generation.
> 
> Attached gc_test.R is the script I used to get the previously mentioned and 
> more gc timings.
> 
> To me this looks like a significant performance improvement, especially given 
> the little changeset. What do you think?
> 
> Best regards,
> Andreas
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [External] GC: speeding-up the CHARSXP cache maintenance, 2nd try

2021-11-07 Thread Andreas Kersting
Hi Luke,

As proposed by you, I have just added this to the wishlist in Bugzilla: 
https://bugs.r-project.org/show_bug.cgi?id=18233

Best,
Andreas

2021-11-04 17:04 GMT+01:00 luke-tier...@uiowa.edu:
> Can you please submit this as a wishlist item to bugzilla? it is
> easier to keep track of there. You could also submit your threads
> based suggestion there, again to keep it easier to keep track of and
> possibly get back to in the future.
> 
> I will have a look at your approach when I get a chance, but I am
> exploring a different approach to avoid scanning old generations that
> may be simpler.
> 
> Best,
> 
> luke
> 
> On Wed, 3 Nov 2021, Andreas Kersting wrote:
> 
>> Hi,
>>
>> In https://stat.ethz.ch/pipermail/r-devel/2021-October/081147.html I 
>> proposed to speed up the CHARSXP cache maintenance during GC using 
>> threading. This was rejected by Luke in 
>> https://stat.ethz.ch/pipermail/r-devel/2021-October/081172.html.
>>
>> Here I want to propose an alternative approach to significantly speed up 
>> CHARSXP cache maintenance during partial GCs. A patch which passes `make 
>> check-devel` is attached. Compared to R devel (revision 81110) I get the 
>> following performance improvements on my system:
>>
>> Elapsed time for five non-full gc in a session after
>>
>> x <- as.character(runif(5e7))[]
>> gc(full = TRUE)
>>
>> +20sec -> ~1sec.
>>
>>
>> This patch introduces (theoretical) overheads to mkCharLenCE() and full GCs. 
>> However, I did not measure dramatic differences:
>>
>> y <- "old_CHARSXP"
>>
>> after
>>
>> x <- "old_CHARSXP"; gc(); gc()
>>
>> takes a median 32 nanoseconds with and without the patch.
>>
>>
>> gc(full = TRUE)
>>
>> in a new session takes a median 16 milliseconds with and 14 without the 
>> patch.
>>
>>
>> The basic idea is to maintain the CHARSXP cache using subtables in 
>> R_StringHash, one for each of the (NUM_GC_GENERATIONS := NUM_OLD_GENERATIONS 
>> + 1) GC generations. New CHARSXPs are added by mkCharLenCE() to the subtable 
>> of the youngest generation. After a partial GC, only the chains anchored at 
>> the subtables of the youngest (num_old_gens_to_collect + 1) generations need 
>> to be searched for and cleaned of unmarked nodes. Afterwards, these chains 
>> need to be merged into those of the respective next generation, if any. This 
>> approach relies on the fact that an object/CHARSXP can never become younger 
>> again. It is OK though if an object/CHARSXP "skips" a GC generation.
>>
>> R_StringHash, which is now of length (NUM_GC_GENERATIONS * char_hash_size), 
>> is structured such that the chains for the same hashcode but for different 
>> generations are anchored at slots of R_StringHash which are next to each 
>> other in memory. This is because we often need to access two or more (i.e. 
>> currently all three) of them for one operation and this avoids cache misses.
>>
>> HASHPRI, i.e. the number of occupied primary slots, is computed and stored 
>> as NUM_GC_GENERATIONS times the number of slots which are occupied in at 
>> least one of the subtables. This is done because in mkCharLenCE() we need to 
>> iterate through one or more chains if and only if there is a chain for the 
>> particular hashcode in at least one subtable.
>>
>> I tried to keep the patch as minimal as possible. In particular, I did not 
>> add long vector support to R_StringHash. I rather reduced the max value of 
>> char_hash_size from 2^30 to 2^29, assuming that NUM_OLD_GENERATIONS is (not 
>> larger than) 2. I also did not yet adjust do_show_cache() and 
>> do_write_cache(), but I could do so if the patch is accepted.
>>
>> Thanks for your consideration and feedback.
>>
>> Regards,
>> Andreas
>>
>>
>> P.S. I had a hard time to get the indentation right in the patch due the mix 
>> of tabs and spaces. Sorry, if I screwed this up.
> 
> -- 
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa  Phone: 319-335-3386
> Department of Statistics andFax:   319-335-3017
>Actuarial Science
> 241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
> Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
> 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel