On 7 January 2013 17:58, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote: > >> There are sometimes good reasons to get a line of best fit by eye. In >> particular if your data contains clusters that are hard to separate, >> sometimes it's useful to just pick out roughly where you think a line >> through a subset of the data is. > > Cherry picking subsets of your data as well as line fitting by eye? Two > wrongs do not make a right.
It depends on what you're doing, though. I wouldn't use an eyeball fit to get numbers that were an important part of the conclusion of some or other study. I would very often use it while I'm just in the process of trying to understand something. > If you're going to just invent a line based on where you think it should > be, what do you need the data for? Just declare "this is the line I wish > to believe in" and save yourself the time and energy of collecting the > data in the first place. Your conclusion will be no less valid. An example: Earlier today I was looking at some experimental data. A simple model of the process underlying the experiment suggests that two variables x and y will vary in direct proportion to one another and the data broadly reflects this. However, at this stage there is some non-normal variability in the data, caused by experimental difficulties. A subset of the data appears to closely follow a well defined linear pattern but there are outliers and the pattern breaks down in an asymmetric way at larger x and y values. At some later time either the sources of experimental variation will be reduced, or they will be better understood but for now it is still useful to estimate the constant of proportionality in order to check whether it seems consistent with the observed values of z. With this particular dataset I would have wasted a lot of time if I had tried to find a computational method to match the line that to me was very visible so I chose the line visually. > > How do you distinguish between "data contains clusters that are hard to > separate" from "data doesn't fit a line at all"? > In the example I gave it isn't possible to make that distinction with the currently available data. That doesn't make it meaningless to try and estimate the parameters of the relationship between the variables using the preliminary data. > Even if the data actually is linear, on what basis could we distinguish > between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit > by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely > subjective judgement can be equally denied on the basis of subjective > judgement. It gets a bit easier if the line is constrained to go through the origin. You seem to be thinking that the important thing is proving that the line is "real", rather than identifying where it is. Both things are important but not necessarily in the same problem. In my example, the "real line" may not be straight and may not go through the origin, but it is definitely there and if there were no experimental problems then the data would all be very close to it. > Anyone can fool themselves into placing a line through a subset of non- > linear data. Or, sadly more often, *deliberately* cherry picking fake > clusters in order to fool others. Here is a real world example of what > happens when people pick out the data clusters that they like based on > visual inspection: > > http://www.skepticalscience.com/images/TempEscalator.gif > > And not linear by any means, but related to the cherry picking theme: > > http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif > > > To put it another way, when we fit patterns to data by eye, we can easily > fool ourselves into seeing patterns that aren't there, or missing the > patterns which are there. At best line fitting by eye is prone to honest > errors; at worst, it is open to the most deliberate abuse. We have eyes > and brains that evolved to spot the ripe fruit in trees, not to spot > linear trends in noisy data, and fitting by eye is not safe or > appropriate. This is all true. But the human brain is also in many ways much better than a typical computer program at recognising patterns in data when the data can be depicted visually. I would very rarely attempt to analyse data without representing it in some visual form. I also think it would be highly foolish to go so far with refusing to eyeball data that you would accept the output of some regression algorithm even when it clearly looks wrong. Oscar -- http://mail.python.org/mailman/listinfo/python-list