Thanks, that's helpful. I'll tidy up and submit the patch and que sera, sera.
FWIW my code does already trap parse-errors, and in fact it will check all 5 "symboloids" in your snippet function(x = y) g(z = w) . The $parseData reports SYMBOL, SYMBOL_SUB, SYMBOL_FUNCTION_CALL, SYMBOL_FORMALS for the different cases, which are all caught by grepl(). I initially did use getParseData() but then realized it wasn't adding anything (except time). Anyway, I'll leave both options in there, de-piped, for people to consider. cheers Mark On Thu, Dec 4, 2025, at 09:20, [email protected] wrote: > On Mon, 1 Dec 2025, Mark Bravington wrote: > > > [You don't often get email from [email protected]. Learn why this > > is important at https://aka.ms/LearnAboutSenderIdentification ] > > > > On Sun, Nov 30, 2025, at 13:10, [email protected] wrote: > >> Keeping the ASCII-only restriction for code is important as it makes > >> the code easier to understand by a wider audience. > >> > >> Allowing non-ASCII characters in literal strings, raw or regular, does > >> seem reasonable to me in principle, but others may see issues I am not > >> aware of. > >> > >> But checking for non-ASCII characters in code while allowing non-ASCII > >> characters in string literals needs much more sophisticated check code > >> than we currently have. If you or anyone else want to see this happen > >> you can explore creating a patch and submit to bugzilla for > >> consideration. > > > > Fair enough. It might be easier than you suspect, though, since the parser > > already does the heavy lifting--- code below. > > > > (i) If the file doesn't even parse, that's a more serious problem! > > You still have to handle it in a way that is consistent with the rest > of the checking process, which I believe means catching the error and > returning FALSE. I would use tryCatch for that. > > > (ii) If the file does parse OK, then AFAICS the only places that non-ASCII > > characters might be lurking are: (a) in comments, where they are somewhat > > grudgingly allowed IIRC; (b) in string literals, where we would like to > > allow them; and of course (c) in symbols (variable names; see notes > > below), where we DON'T want them if it's a package. And this can all be > > checked easily from $parseData. My specimen function below does it in ~20 > > lines of "real" code. > > It isn't quite right though: symbols can appear in a few other > places. Look at > > function(x = y) g(z = w) > > I believe you are only picking up two of the five symbols you want. > > Also you can simplify your code by using getParseData. I would also > avoid using the pipe operator since it isn't consistent with the > coding style in the file you are proposing to change. > > > A couple of notes: > > > > #1 I didn't realize that it is even possible to have a "normal" (ie > > non-backticked) variable name with non-ASCII letters (see ?Quotes, "Names > > and Identifiers"). And indeed I can run the following in my (Anglo) Windows > > RGUI: > > > > français <- 'bon' > > > > Crikey, that's actually scary... Anyway, the intention is clearly to NOT > > allow that in package code, at least not yet. > > > > #2 Should packages nevertheless be allowed to use backticked identifiers > > containing non-ASCII characters? (IME backticks are often used for funny > > names with all-ASCII characters but in the wrong places.) Personally I'd > > vote no, but it's well above my pay grade--- and there's no voting in R. > > Anyhow, my code below has an option to check/not-check backticked symbols. > > > > Is this likely to be acceptable? If so I'll try to submit a formal patch. > > It is worth putting together a clean and well-tested patch that can be > easily reviewed and tested by others. There are folks who spend much > more time than I do on the QC code and may see reasons why going down > this road is a bad idea, or how to do this better, but we'll see. > > Best, > > luke > > > > > cheers > > Mark > > > > > > ## My function: > > > > check_ASCII_code_MVB <- function( > > file, pp= NULL, check_backticks= FALSE > > ){ > > # Checks that any non-ASCII UTF-8 characters are confined to > > # string-literals & comments > > > > # Can directly supply results of previous parse(), for speed > > if( is.null( pp)){ # ... or, if not: > > pp <- try( parse( file=file, keep.source=TRUE, encoding='UTF-8')) > > if( inherits( pp, 'try-error')){ > > warning( "Can't even parse, let alone check for non-ASCII") > > return( FALSE) > > } > > } > > > > # Get tokens of "leaf" (terminal) elements, and associated text > > # This mimicks utils::getParseData() > > ppd <- pp |> attr( 'wholeSrcref') |> attr( 'srcfile') |> > > _$parseData |> attributes() |> _[ c( 'tokens', 'text')] > > > > symbols <- with( ppd, > > text[ grepl( 'SYMBOL', tokens, fixed=TRUE)]) > > > > if( !check_backticks){ > > # Not obvious whether to allow UTF-8 in backticked names > > > > # AFAICS backticks can only occur both at start and end of a parsable > > symbol > > backy <- startsWith( symbols, r"{`}") & endsWith( symbols, r"{`}") > > symbols <- symbols[ !backy] > > } > > > > non_ASCII <- .Call( tools:::C_nonASCII, symbols) > > > > OK <- !any( non_ASCII) > > if( !OK){ > > attr( OK, 'offending_symbols') <- unique( symbols[ non_ASCII]) > > } > > return( OK) > > } > > > > ## A snippet to save into a file, for testing. Note the raw string: > > irrelevant, but useful. > > > > nonASCII_R <- r"--{ > > français <- 'bon' > > `français` <- 'bon' > > lingo <- "français" > > # Nothing wrong with a bit of français in comments > > }--" |> strsplit( '\n') |> _[[1]] > > > > writeLines( nonASCII_R, <file of your choice>) > > > > > > ## Possible patch of tools::.check_package_ASCII_code : > > > > .check_package_ASCII_code_patch <- function ( > > dir, respect_quotes = FALSE > > ){ > > if (!dir.exists(dir)) > > stop(gettextf("directory '%s' does not exist", dir), > > domain = NA) > > dir <- file_path_as_absolute(dir) > > wrong_things <- character() > > for (f in c(file.path(dir, "NAMESPACE"), > > list_files_with_type(file.path(dir, > > "R"), "code", OS_subdirs = c("unix", "windows")))) { > > ## OLD > > #text <- readLines(f, warn = FALSE) > > # if (.Call(C_check_nonASCII, text, respect_quotes)) > > ## NEW > > if( !check_ASCII_code_MVB( f)) > > wrong_things <- c(wrong_things, f) > > } > > if (length(wrong_things)) { > > wrong_things <- substring(wrong_things, nchar(dir) + > > 2L) > > cat(wrong_things, sep = "\n") > > } > > invisible(wrong_things) > > } > > > > > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: [email protected] > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
