OK Jim, I will put very simple messages in (one liners) that will simply state whether the relationship between keys and the requested columns was 1:1, 1:many, many:1, or many:many. Hopefully this will represent an acceptable compromise.
Marc On 06/05/2015 08:37 AM, James W. MacDonald wrote: > I agree that a warning is probably not the way to go, as it does imply > that there might have been something wrong with either the input or > output. Plus, not everybody understands the distinction between error > and warning. > > And having additional documentation can't possibly hurt. But that > assumes that most/some/all of the end users both peruse and understand > the documentation, which we all know is not the case. The main issue, > for me at least, is that a significant proportion of people seem to > think there is some sort of uniqueness imposed on things like Entrez > Gene IDs and Hugo symbols, etc. While that is the ultimate goal, we do > not have and maybe never will achieve unique IDs for each annotatable > object. > > I used to work for a PI who was a very smart, well informed > statistical geneticist who was absolutely shocked when I informed her > that a) there are SNPs in dbSNP that have more than one RS ID, and > that b.) there are RS IDs in dbSNP that have been assigned to multiple > SNPs. She just assumed that there was a one-to-one RS ID -> SNP mapping. > > So this is to me the crux of the problem. It is perfectly valid to > return one-to-many mappings, and that is what should be expected /by > those of us who already understand such things. /But for those of us > who are ignorant of the details, and those who assume uniqueness of > IDs, it would be really nice if they got a message telling them > something like > > /Please note that there are one-to-many mappings between the input and > output IDs, so the output is longer than your input vector. Please see > ?select for more detail./ > / > / > And if the message is objectionable to some, you could give the option > for people to set a global flag to shut it off. Something like > > if(!pleaseMakeItStop) > message(<message goes here>) > > and they could set > > pleaseMakeItStop = TRUE in their .Rprofile > > Is that a reasonable compromise? > > Jim > > > > On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson <mcarl...@fredhutch.org > <mailto:mcarl...@fredhutch.org>> wrote: > > Hi Jim, > > I do agree that the warning was protective for that (this is why I > put it there). > > But it was also annoying for many and a source of some confusion > because when people see a warning() they think that something has > gone wrong with the code that was just run. And in this case the > select method was actually doing exactly what it was supposed to > be doing. What it was actually warning you about was what you did > separately in that assignment to fit2... Which is the step right > after the select method already did it's work. And I can > understand why that seems a little bit confusing since you are > basically telling someone to be careful with the data you just > gave them. > > Now I could replace it with a message() I guess, but in cases like > this where the warning is about something that happens outside of > the function you are calling, shouldn't that probably be handled > by documentation? Or at least, that is the argument that finally > persuaded me to remove it. That and that fact that almost every > call to select() ended up accompanied by the warning you > mentioned, because it turns out that perfect 1:1 relationships are > pretty rare for annotation data. Very often, you are going to get > back multiple results. > > But I didn't just remove the warning, I also supplied an > alternative for people who have a real need for consistent 1:1 > mapping. > > The mapIds() method takes most of the same arguments as select, > except that unlike select(), it only looks up one column and it > always returns a vector that is the same size as the vector that > came in. > > So for your example, you could do something like this psuedocode here: > > mapIds(<chippackage>, featureNames(eset), column="ENTREZID", > keytype="PROBEID") > > And mapIds will follow a rule specified by the default value for > the multiVals argument so that you can get back your results in a > 1:1 manner. And if you don't like any of the options available > for the multiVals argument, you can make your own function and > pass it in. > > > Anyhow please continue to let us know what you think? > > > Marc > > > > > > > > On 06/04/2015 10:50 AM, James W. MacDonald wrote: > > In the last release, the warning message from select() telling > people that > their results include one-to-many mappings was removed. While > some may find > this warning annoying, I think silently returning something > unexpected to > our users is dangerous. > > In other words, for me it is a common practice to do something > like this: > > fit <- lmFit(eset, design) > fit2 <- eBayes(fit) > gns <- select(<chippackage>, featureNames(eset), > c("ENTREZID","SYMBOL")) > gns <- gns[!duplicated(gns[,1]),] > fit2$genes <- gns > > I add in the step where dups are removed because I already > know they are > there. But a naive user might instead do > > fit2$genes <- select(<chippackage>, featureNames(eset), > c("ENTREZID","SYMBOL")) > > Which will work just fine, but then all the annotation (except > for the > first few lines) will now be completely incorrect, and there > wasn't a > warning to let the end user know that they may have made a > mistake. > > lmFit() will parse the featureData slot of an ExpressionSet > and use those > data for annotation, so that gives some hypothetical > protections, for those > who first put their annotation data into their ExpressionSet. > However, > ?eSet says: > > ‘featureData’: Contains variables describing features (i.e., > rows > in ‘assayData’) unique to this experiment. Use the > ‘annotation’ slot to efficiently reference feature data > common to the annotation package used in the > experiment. > Class: ‘AnnotatedDataFrame-class’ > > Which to me indicates that the featureData slot isn't really > intended to > contain annotation data, but instead some unique information > that pertains > to a given experiment. But maybe I misunderstand. > > Is the featureData slot actually intended for annotation data? > If not, what > is the intended pipeline for annotating data in an > ExpressionSet? Am I > alone in being concerned about this? > > Best, > > Jim > > > > _______________________________________________ > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing > list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel