Re: [R] Intended use-case for data.matrix

Philip Charles Wed, 04 Nov 2020 13:45:25 -0800

Hi Duncan,

Thanks; that's really useful info, and now that you point it out I completely 
agree that the frame arguments description does make my original use invalid - 
I will pay closer attention to such details in future.  Would you suggest 
sapply(...,as.numeric)  is the most 'R'-y way of converting a character 
dataframe to numeric matrix, or is there a cleaner pattern?


Best wishes,

Phil

On 4 Nov 2020, at 20:37, Duncan Murdoch 
<murdoch.dun...@gmail.com<mailto:murdoch.dun...@gmail.com>> wrote:

You can see the change to the help page here:

https://github.com/wch/r-source/commit/d1d3863d72613660727379dd5dffacad32ac9c35#diff-9143902e81e6ad39faace2d926725c4c72b078dd13fbb1223c4a35f833b58ee6

Before the change, it said the input should be

a data frame whose components are logical vectors, factors or numeric vectors

which suggests your input was invalid. But later it says

Logical and factor columns are converted to integers. Any other
column which is not numeric (according to \code{\link{is.numeric}}) is
converted by \code{\link{as.numeric}} or, for S4 objects,
\code{\link{as}(, "numeric")}.

which suggests what you were doing was supported.

It's unfortunate that you didn't know about this change, but it was made in 
August 2019, and appeared on the news feed here:

https://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2019/08/08#n2019-08-08

so some of the blame for this goes to you for not paying attention and testing 
unreleased R versions.

To protect yourself against this kind of unpleasant surprise in the future, I'd 
suggest this:

- Follow the news feed.

- Put your code in a package, and test it against R-devel now and then. (If 
your package is on CRAN the testing will happen automatically; if it's not on 
CRAN and not in a package, you could still test against R-devel, but why make 
your life more difficult by *not* putting it in a package?)

Duncan Murdoch

On 04/11/2020 6:48 a.m., Philip Charles wrote:
> Hi R gurus,
>
> We do a lot of work with biological -omics datasets (genomics, proteomics 
> etc). The text file inputs to R typically contain a mixture of (mostly) 
> character data and numeric data. The number of columns (both character and 
> numeric data) in the file vary with the number of samples measured (which 
> makes use of colClasses , so a typical approach might be
>
> 1) read in the whole file as character matrix
>
> #simulated result of read.table (with stringsAsFactors=FALSE)
> raw <- 
> data.frame(Accession=c('P04637','P01375','P00761'),Description=c('Cellular 
> tumor antigen p53','Tumor necrosis factor','Trypsin'),Species=c('Homo 
> sapiens','Homo sapiens','Sus 
> scrofa'),Intensity.SampleA=c('919948','1346170','15870'),Intensity.SampleB=c('1625540','710272','83624'),Intensity.SampleC=c('1232780','1481040','62548'))
>
> 2) use grep to identify numeric columns based on column names and split the 
> raw matrix
>
> QUANT_COLS <- grepl('^Intensity\\.',colnames(raw))
> META_COLS <- !QUANT_COLS
> quant.df.char <- raw[,QUANT_COLS]
> meta.df <- raw[, META_COLS]
>
> 3) convert the quantitation data frame to a numeric matrix
>
> Prior to R version 4, my standard method for step 3 was to use data.matrix() 
> for this last step. After recently updating from v3.6.3, I've found that all 
> my workflows using this function were giving wildly incorrect results. I 
> figured out that data.matrix now yields a matrix of factor levels rather than 
> the actual numeric values
>
>> quant.df.char
> Intensity.SampleA Intensity.SampleB Intensity.SampleC
> 1 919948 1625540 1232780
> 2 1346170 710272 1481040
> 3 15870 83624 62548
>
>> data.matrix(quant.df.char)
> Intensity.SampleA Intensity.SampleB Intensity.SampleC
> [1,] 3 1 1
> [2,] 1 2 2
> [3,] 2 3 3
>
> The change in behaviour of this function is documented in the R v4.0.0 
> changelog, so it is clearly intentional:
>
> "data.matrix() now converts character columns to factors and from this to 
> integers."
>
> Now, I know there are other ways to achieve the same conversion, e.g. 
> sapply(quant.df.char, as.numeric). They aren't quite as straightforward to 
> read in the code as data.matrix (sapply/lapply in particular I have to think 
> though whether there will a need to transpose the result!), but the fact that 
> this base function has been changed (without a way to replicate the previous 
> behaviour) leads me to suspect that I have probably not previously been using 
> data.matrix in the intended manner - and I may therefore be making similar 
> mistakes elsewhere! I've certainly distributed/handed out R scripting 
> examples in the past that will now give incorrect results when run on v4+ R.
>
> What even more confusing to me (but possibly related as regards an answer) is 
> that R v4 broke with long-standing convention to change 
> default.stringsAsFactors() to FALSE. So on one hand the update took away what 
> was (at least, from our perspective, with our data - I am sure some here may 
> disagree!) a perennial source of confusion/bugs for R learners, by not 
> introducing string factorisation during data import, and then on the other 
> hand changed a base function to explicitly introduce string factorisation.. I 
> can't see when converting a character dataset, not to factors but, straight 
> to numeric factor levels might be that useful (but of course that doesn't 
> mean it isn't!).
>
> I've had a look through r-help and r-devel archives and couldn't spot any 
> discussion of this, so apologies if this has been asked before. I'm also 
> pretty sure my misunderstanding is with the intended use-case of data.matrix 
> and R ethos around strings/factors rather than the rationale for the change, 
> which is why I'm asking here.
>
> Best wishes,
>
> Phil
>
> Philip Charles
> Target Discovery Institute, Nuffield Department Of Medicine
> University of Oxford
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>





        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Intended use-case for data.matrix

Reply via email to