Re: [R-pkg-devel] [External] Re: UTF-8 and raw strings in package code

Peter Dalgaard Mon, 01 Dec 2025 06:12:40 -0800

> On 1 Dec 2025, at 02.53, Mark Bravington <[email protected]> wrote:
> 
> On Sun, Nov 30, 2025, at 13:10, [email protected] wrote:
>> Keeping the ASCII-only restriction for code is important as it makes
>> the code easier to understand by a wider audience.
>> 
>> Allowing non-ASCII characters in literal strings, raw or regular, does
>> seem reasonable to me in principle, but others may see issues I am not
>> aware of.
>> 
>> But checking for non-ASCII characters in code while allowing non-ASCII
>> characters in string literals needs much more sophisticated check code
>> than we currently have. If you or anyone else want to see this happen
>> you can explore creating a patch and submit to bugzilla for
>> consideration.
> 
> Fair enough. It might be easier than you suspect, though, since the parser 
> already does the heavy lifting--- code below.
> 
> (i) If the file doesn't even parse, that's a more serious problem! 
> 
> (ii) If the file does parse OK, then AFAICS the only places that non-ASCII 
> characters might be lurking are: (a) in comments, where they are somewhat 
> grudgingly allowed IIRC; (b) in string literals, where we would like to allow 
> them;  and of course (c) in symbols (variable names; see notes below), where 
> we DON'T want them if it's a package. And this can all be checked easily from 
> $parseData. My specimen function below does it in ~20 lines of "real" code.
> 
> A couple of notes:
> 
> #1 I didn't realize that it is even possible to have a "normal" (ie 
> non-backticked) variable name with non-ASCII letters (see ?Quotes, "Names and 
> Identifiers"). And indeed I can run the following in my (Anglo) Windows RGUI:
> 
> français <- 'bon'
> 
> Crikey, that's actually scary... Anyway,  the intention is clearly to NOT 
> allow that in package code, at least not yet.

Maybe scary, but part of the R idiom is that plots, etc get auto-labeled with 
the name of the variables. If I want to do a child-vs-parents' income chart in 
Danish, it becomes "børn" and "forældre". And such names can be column names in 
datasets, etc. You can work around it but why should you? 

So, for local usage, it is quite sensible to allow extended character sets. 

For packages (and other distributed materials) probably not so. It is probably 
the language and not actually the character set you want to restrict, though. 

-pd

> 
> #2 Should packages nevertheless be allowed to use backticked identifiers 
> containing non-ASCII characters? (IME backticks are often used for funny 
> names with all-ASCII characters but in the wrong places.) Personally I'd vote 
> no, but it's well above my pay grade--- and there's no voting in R. Anyhow, 
> my code below has an option to check/not-check backticked symbols.
> 
> Is this likely to be acceptable? If so I'll try to submit a formal patch.
> 
> cheers
> Mark
> 
> 
> ## My function:
> 
> check_ASCII_code_MVB <- function( 
>    file, pp= NULL, check_backticks= FALSE
> ){
>  # Checks that any non-ASCII UTF-8 characters are confined to 
>  # string-literals & comments
> 
>  # Can directly supply results of previous parse(), for speed
>  if( is.null( pp)){ # ... or, if not:
>    pp <- try( parse( file=file, keep.source=TRUE, encoding='UTF-8'))
>    if( inherits( pp, 'try-error')){
>      warning( "Can't even parse, let alone check for non-ASCII")
> return( FALSE)
>    }
>  }
> 
>  # Get tokens of "leaf" (terminal) elements, and associated text
>  # This mimicks utils::getParseData()
>  ppd <- pp |> attr( 'wholeSrcref') |> attr( 'srcfile') |>
>    _$parseData |> attributes() |> _[ c( 'tokens', 'text')]
> 
>  symbols <- with( ppd, 
>      text[ grepl( 'SYMBOL', tokens, fixed=TRUE)])
> 
>  if( !check_backticks){
>    # Not obvious whether to allow UTF-8 in backticked names
> 
>    # AFAICS backticks can only occur both at start and end of a parsable 
> symbol
>    backy <- startsWith( symbols, r"{`}") & endsWith( symbols, r"{`}")
>    symbols <- symbols[ !backy]
>  }
> 
>  non_ASCII <- .Call( tools:::C_nonASCII, symbols)
> 
>  OK <- !any( non_ASCII)
>  if( !OK){
>    attr( OK, 'offending_symbols') <- unique( symbols[ non_ASCII])
>  }
> return( OK)
> }
> 
> ## A snippet to save into a file, for testing. Note the raw string: 
> irrelevant, but useful.
> 
> nonASCII_R <- r"--{
>  français <- 'bon'
>  `français` <- 'bon'
>  lingo <- "français"
>  # Nothing wrong with a bit of français in comments
> }--" |> strsplit( '\n') |> _[[1]]
> 
> writeLines( nonASCII_R, <file of your choice>)
> 
> 
> ## Possible patch of tools::.check_package_ASCII_code :
> 
> .check_package_ASCII_code_patch <- function (
>  dir, respect_quotes = FALSE
> ){
>    if (!dir.exists(dir)) 
>        stop(gettextf("directory '%s' does not exist", dir), 
>            domain = NA)
>    dir <- file_path_as_absolute(dir)
>    wrong_things <- character()
>    for (f in c(file.path(dir, "NAMESPACE"), 
> list_files_with_type(file.path(dir, 
>        "R"), "code", OS_subdirs = c("unix", "windows")))) {
> ## OLD        
>        #text <- readLines(f, warn = FALSE)
>        # if (.Call(C_check_nonASCII, text, respect_quotes)) 
> ## NEW        
>        if( !check_ASCII_code_MVB( f))
>            wrong_things <- c(wrong_things, f)
>    }
>    if (length(wrong_things)) {
>        wrong_things <- substring(wrong_things, nchar(dir) + 
>            2L)
>        cat(wrong_things, sep = "\n")
>    }
>    invisible(wrong_things)
> }
> 
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [email protected]  Priv: [email protected]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] [External] Re: UTF-8 and raw strings in package code

Reply via email to