Re: [R] When is interactive data visualization useful to use?

Claudia Beleites Fri, 11 Feb 2011 11:23:57 -0800

Dear Tal, dear list,

I think the importance of interactive graphics has a lot do with how visual yourscientific discipline works. I'm spectroscopist, and I think we are veryvisually oriented: if I think of a spectrum I mentally see a graph.

So for that kind of work, I need a lot of interaction (type: plot, change a bit,plot again), e.g.One example is the removal of spikes from Raman spectra (caused e.g. by cosmicrays hitting the detector). It is fairly easy to compute a list of suspicioussignals. It is already much more complicated to find the actual beginning andend of the spike. And it is really difficult not to have false positives by someautomatic procedure, because the spectra can look very different for differentsamples. It would just take me far longer to find a computational description ofwhat is a spike than interactively accepting/rejecting the automatically markedsuspicions. Even though it feels like slave work ;-)

Roughly the same applies for the choice of pre-processing like baselinecorrection. A number of different physical causes can produce different kinds ofbaselines, and usually you don't know which process contributes to what extent.In practice, experience suggests a method, I apply it and look whether theresult looks as expected. I'm not aware of any performance measure that wouldindicate success here.

The next point where interaction is needed pops up as my data has e.g. spatialand spectral dimensions. So do the models usually: e.g. in a PCA, the loadingswould usually capture the spectroscopic direction, whereas the scores belong tothe spatial domain. So I have "connected" graphs: the spatial distribution(intensity map, score map, etc.), and the spectra (or loadings).

As soon as I have such connections I wish for interactive visualization:

I go back and forth between the plots: what is the spectrum that belongs to thisregion of the map? Where on the sample are high intensities of this band? Whatis the substance behind that: if it is x, the intensities at that other spectralband should correlate. And then I want to compare this to the scatterplot (pairsplot of the PCA score) or to a dendrogram of HCA...

Also, exploration is not just prerequisite for models, but it frequently isalready the very proper scientific work (particularly in basic science). Themore so, if you include exploring the models: Now, which of the bands areactually used by my predictive models? Which samples do get their predictionsbecause of which spectral feature?And, the "statistical outliers" may very well be just the interesting part ofthe sample. And the outlier statistics cannot interprete the data in terms ofinteresting ./. crap.

For presentation* of results, I personally think that most of the time a carefulselection of static graphs is much better than live interaction.*The thing where you talk to an audience far awayf from your work computer. Asopposed to sitting down with your client/colleague and analysing the data together.

It could be argued that the interactive part is good for exploring (For
example) a different behavior of different groups/clusters in the data. But
when (in practice) I approached such situation, what I tended to do was to
run the relevant statistical procedures (and post-hoc tests)

As long as the relevant measure exists, sure.

Yet as a non-statistician, my work is focused on the physical/chemicalinterpretation. Summary statistics are one set of tools for me, and interactivevisualisation is another set of tools (overlapping though).

I may want to subtract the influence of the overall unchanging sample matrix(that would be the minimal intensity for each wavelength). But the minimumspectrum is too noisy. So I use a quantile. Which one? Depends on the data. I'llhave a look at a series (say, the 2nd to 10th percentile) and decide trading offnoise and whether any new signals appear. I honestly think there's nothinggained if I sit down and try to write a function scoring the similarity to theminimum spectrum and the noise level: the more so as it just shifts the need fora decision (How much noise outweighs what intensity of real signal beingsubtracted?). It is a decision I need to take. With number or with eye. Andafter all, my professional training was thought to enable me taking thisdecision, and I'm paid (also) for being able to take this decision efficiently(i.e. making a reasonably good choice within not too long time).

After all, it may also have to do with a complaint a colleague from acomputational data analysis group once had. He said the bad thing with usspectroscopists is that our problems are either so easy that there's no fun insolving them, or they are too hard to solve.

- and what I
found to be significant I would then plot with colors clearly dividing the
data to the relevant groups. From what I've seen, this is a safer approach
then "wondering around" the data (which could easily lead to data dredging
(were the scope of the multiple comparison needed for correction is not even
clear).

Sure, yet:

- Isn't that what validation was invented for (I mean with a proper, new,[double] blind test set after you decided your parameters)?- Summarizing a whole data set into a few numbers, without having looked at thedata itself may not be safe, either:- The few comparisons shouldn't come at the cost of risking a bad modelingmodelling strategy and fitting parameters because the data was not properlyexamined.


My 2 ct,

Claudia (who in practice warns far more frequently of multiple comparisons andvalidation sets being compromised (not independent) than of too few dataexploration ;-) )


--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbelei...@units.it

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] When is *interactive* data visualization useful to use?

Reply via email to

Re: [R] When is interactive data visualization useful to use?