Re: [R-pkg-devel] [External] Re: UTF-8 and raw strings in package code

Luke Tierney via R-package-devel Wed, 03 Dec 2025 14:20:52 -0800

On Mon, 1 Dec 2025, Mark Bravington wrote:

[You don't often get email from [email protected]. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]


On Sun, Nov 30, 2025, at 13:10, [email protected] wrote:

Keeping the ASCII-only restriction for code is important as it makes
the code easier to understand by a wider audience.

Allowing non-ASCII characters in literal strings, raw or regular, does
seem reasonable to me in principle, but others may see issues I am not
aware of.

But checking for non-ASCII characters in code while allowing non-ASCII
characters in string literals needs much more sophisticated check code
than we currently have. If you or anyone else want to see this happen
you can explore creating a patch and submit to bugzilla for
consideration.


Fair enough. It might be easier than you suspect, though, since the parser 
already does the heavy lifting--- code below.

(i) If the file doesn't even parse, that's a more serious problem!


You still have to handle it in a way that is consistent with the rest
of the checking process, which I believe means catching the error and
returning FALSE. I would use tryCatch for that.

(ii) If the file does parse OK, then AFAICS the only places that non-ASCII characters 
might be lurking are: (a) in comments, where they are somewhat grudgingly allowed IIRC; 
(b) in string literals, where we would like to allow them;  and of course (c) in symbols 
(variable names; see notes below), where we DON'T want them if it's a package. And this 
can all be checked easily from $parseData. My specimen function below does it in ~20 
lines of "real" code.


It isn't quite right though: symbols can appear in a few other
places. Look at

    function(x = y) g(z = w)

I believe you are only picking up two of the five symbols you want.

Also you can simplify your code by using getParseData. I would also
avoid using the pipe operator since it isn't consistent with the
coding style in the file you are proposing to change.

A couple of notes:

#1 I didn't realize that it is even possible to have a "normal" (ie non-backticked) 
variable name with non-ASCII letters (see ?Quotes, "Names and Identifiers"). And indeed I 
can run the following in my (Anglo) Windows RGUI:

français <- 'bon'

Crikey, that's actually scary... Anyway,  the intention is clearly to NOT allow 
that in package code, at least not yet.

#2 Should packages nevertheless be allowed to use backticked identifiers 
containing non-ASCII characters? (IME backticks are often used for funny names 
with all-ASCII characters but in the wrong places.) Personally I'd vote no, but 
it's well above my pay grade--- and there's no voting in R. Anyhow, my code 
below has an option to check/not-check backticked symbols.

Is this likely to be acceptable? If so I'll try to submit a formal patch.


It is worth putting together a clean and well-tested patch that can be
easily reviewed and tested by others. There are folks who spend much
more time than I do on the QC code and may see reasons why going down
this road is a bad idea, or how to do this better, but we'll see.

Best,

luke


cheers
Mark


## My function:

check_ASCII_code_MVB <- function(
   file, pp= NULL, check_backticks= FALSE
){
 # Checks that any non-ASCII UTF-8 characters are confined to
 # string-literals & comments

 # Can directly supply results of previous parse(), for speed
 if( is.null( pp)){ # ... or, if not:
   pp <- try( parse( file=file, keep.source=TRUE, encoding='UTF-8'))
   if( inherits( pp, 'try-error')){
     warning( "Can't even parse, let alone check for non-ASCII")
return( FALSE)
   }
 }

 # Get tokens of "leaf" (terminal) elements, and associated text
 # This mimicks utils::getParseData()
 ppd <- pp |> attr( 'wholeSrcref') |> attr( 'srcfile') |>
   _$parseData |> attributes() |> _[ c( 'tokens', 'text')]

 symbols <- with( ppd,
     text[ grepl( 'SYMBOL', tokens, fixed=TRUE)])

 if( !check_backticks){
   # Not obvious whether to allow UTF-8 in backticked names

   # AFAICS backticks can only occur both at start and end of a parsable symbol
   backy <- startsWith( symbols, r"{`}") & endsWith( symbols, r"{`}")
   symbols <- symbols[ !backy]
 }

 non_ASCII <- .Call( tools:::C_nonASCII, symbols)

 OK <- !any( non_ASCII)
 if( !OK){
   attr( OK, 'offending_symbols') <- unique( symbols[ non_ASCII])
 }
return( OK)
}

## A snippet to save into a file, for testing. Note the raw string: irrelevant, 
but useful.

nonASCII_R <- r"--{
 français <- 'bon'
 `français` <- 'bon'
 lingo <- "français"
 # Nothing wrong with a bit of français in comments
}--" |> strsplit( '\n') |> _[[1]]

writeLines( nonASCII_R, <file of your choice>)


## Possible patch of tools::.check_package_ASCII_code :

.check_package_ASCII_code_patch <- function (
 dir, respect_quotes = FALSE
){
   if (!dir.exists(dir))
       stop(gettextf("directory '%s' does not exist", dir),
           domain = NA)
   dir <- file_path_as_absolute(dir)
   wrong_things <- character()
   for (f in c(file.path(dir, "NAMESPACE"), list_files_with_type(file.path(dir,
       "R"), "code", OS_subdirs = c("unix", "windows")))) {
## OLD
       #text <- readLines(f, warn = FALSE)
       # if (.Call(C_check_nonASCII, text, respect_quotes))
## NEW
       if( !check_ASCII_code_MVB( f))
           wrong_things <- c(wrong_things, f)
   }
   if (length(wrong_things)) {
       wrong_things <- substring(wrong_things, nchar(dir) +
           2L)
       cat(wrong_things, sep = "\n")
   }
   invisible(wrong_things)
}


--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
   Actuarial Science
241 Schaeffer Hall                  email:   [email protected]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Re: [R-pkg-devel] [External] Re: UTF-8 and raw strings in package code

Reply via email to