Next year one of my R programming assigments will read like this: "Here is some R code written by a multi-million dollar AI system to compute [something]. It doesn't work. Fix the bugs, then ask the AI to write a letter to its creators apologising for how rubbish it is at coding. Collect one million dollars."
On Sun, Dec 18, 2022 at 7:01 PM Boris Steipe <boris.ste...@utoronto.ca> wrote: > > This email originated outside the University. Check before clicking links or > attachments. > > Technically not a help question. But crucial to be aware of, especially for > those of us in academia, or otherwise teaching R. I am not aware of a > suitable alternate forum. If this does not interest you, please simply ignore > - I already know that this may be somewhat OT. > > Thanks. > ------------------------------------------------------ > > You very likely have heard of ChatGPT, the conversation interface on top of > the GPT-3 large language model and that it can generate code. I thought it > doesn't do R - I was wrong. Here is a little experiment: > > MY PROMPT: > ========== > Please write me some R code that will read a tsv file, remove all rows that > have duplicates in the column "names", and store the median values of the > remaining rows in a variable called "Medians". Of course, that calculation > should only be done with columns that are numeric. Print the last variable in > order from largest to smallest. > > > AI RESPONSE: > ============ > > Here is some R code that should accomplish what you've described: > > (Note: the code is formatted in a little terminal window, complete with a > button to "copy code" at the top.) > > ---- > # Read the tsv file > data <- read.table("file.tsv", header = TRUE, sep = "\t") > > # Remove rows with duplicates in the "names" column > data <- data[!duplicated(data$names), ] > > # Select only the numeric columns > num_columns <- sapply(data, is.numeric) > numeric_data <- data[, num_columns] > > # Calculate the medians for each column > Medians <- apply(numeric_data, 1, median) > > # Sort Medians in descending order > Medians <- sort(Medians, decreasing = TRUE) > > # Print Medians > print(Medians) > ---- > > This code assumes that the tsv file is called "file.tsv" and is stored in the > current working directory. It also assumes that the first row of the file > contains the column names. > > > > MY PROMPT: > ========== > Can you provide me with a bit of sample data that I can use to test this code? > > AI RESPONSE: > ============ > Certainly! Here is some sample data that you can use to test the code I > provided: > > ---- > name age height weight > Alice 23 5.6 120 > Bob 32 6.2 180 > Charlie 29 5.9 160 > Alice 23 5.6 120 > David 25 5.8 140 > Eve 27 5.5 110 > ---- > > You can save this data as a tsv file and use it to test the code. > > > ------------------------------------------------------ > > Notes: > 1) The code, as written does not work with the test data. The reason is that > the test data specifies a column "name", but the code assumes "names" (which > I specified in the instructions). Once this is fixed in the test data, the > code executes. > > 2) The instructions are not accurately translated: I had asked for row means. > But changing the margin in the apply() statement from 2 to 1 does the correct > thing. > > 3) Note that the test data contains both numeric and non-numeric columns. > Also, the name "Alice" is duplicated in the test data, which is subtle, and > the right thing to do. > > > ------------------------------------------------------ > > A similar experiment, run on the beta-site of OpenAI's Codex code-completion > tool gives the following code (complete with the comments.): > > ---- > # Read in the data > data <- read.table("mydata.tsv", sep = "\t", header = TRUE) > > # Remove all rows that have duplicates in the column "names" > unique_names <- unique(data$names) > data_unique <- data[data$names %in% unique_names,] > > # Calculate medians > Medians <- sapply(data_unique[, sapply(data_unique, is.numeric)], median, > na.rm = TRUE) > > # Print last variable in order form largest to smallest > rev(sort(Medians[length(Medians)])) > ---- > > Note that the strategy is quite different (e.g using %in%, not duplicated() > ), the interpretation of "last variable" is technically correct but not what > I had in mind (ChatGPT got that right though). > > > Changing my prompts slightly resulted it going for a dplyr solution instead, > complete with %>% idioms etc ... again, syntactically correct but not giving > me the fully correct results. > > ------------------------------------------------------ > > Bottom line: The AI's ability to translate natural language instructions into > code is astounding. Errors the AI makes are subtle and probably not easy to > fix if you don't already know what you are doing. But the way that this can > be "confidently incorrect" and plausible makes it nearly impossible to detect > unless you actually run the code (you may have noticed that when you read the > code). > > Will our students use it? Absolutely. > > Will they successfully cheat with it? That depends on the assignment. We > probably need to _encourage_ them to use it rather than sanction - but > require them to attribute the AI, document prompts, and identify their own, > additional contributions. > > Will it help them learn? When you are aware of the issues, it may be quite > useful. It may be especially useful to teach them to specify their code > carefully and completely, and to ask questions in the right way. Test cases > are crucial. > > How will it affect what we do as instructors? I don't know. Really. > > And the future? I am not pleased to extrapolate to a job market in which they > compete with knowledge workers who work 24/7 without benefits, vacation pay, > or even a salary. They'll need to rethink the value of their investment in an > academic education. We'll need to rethink what we do to provide value above > and beyond what AI's can do. (Nb. all of the arguments I hear about why > humans will always be better etc. are easily debunked, but that's even more > OT :-) > > -------------------------------------------------------- > > If you have thoughts to share how your institution is thinking about academic > integrity in this situation, or creative ideas how to integrate this into > teaching, I'd love to hear from you. > > > All the best! > Boris > > > -- > Boris Steipe MD, PhD > University of Toronto > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.