On Tue, 23 Sep 2014 22:01:51 -0700 (PDT) Miki Tebeka <miki.teb...@gmail.com> wrote:
> On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote: > > > While you're at it, think > > long and hard about that definition of fuzziness. If you can make it > > closer to the concept of histogram "bins" you'll get much better > > performance. > The problem for me here is that I can't determine the number of bins in > advance. I'd like to get frequencies. I guess every "new" (don't have any > previous equal item) can be a bin. > > > TL;DR you need to think very hard about your problem definition and > > what you want to happen before you actually try to implement this. > Always a good advice :) I'm actually implementing algorithm for someone else > (in the bio world where I know very little about). See, THERE's your problem. You've got a scientist trying to make prescriptions for an engineering problem. He's given you a fuzzy description of the sort of thing he's trying to do. Your job is to turn that fuzzy description into a concrete, actual algorithm before you even write a single line of code, which means understanding what the data is, and what the desired result of that data is. Because the thing you keep trying to do, with all of its order dependencies fundamentally CANNOT be right, regardless of what the squishy scientist tells you. The "histogram" bin solution that everyone keeps trying to steer you towards is almost certainly what you really want. Epsilon is your resolution. You cannot resolve any information below your resolution limit. Yes, 1.49 and 1.51 wind up in different bins, whereas 1.51 and 2.49 are in the same one, but that's what it means to have a resolution of 1; you can't say anything about whether any given count in the "2, plus or minus a bit" bin is very nearly 1 or very nearly 3. This doesn't require you to know the number of bins in advance, you can just create and fill them as needed. That said, you're trying to solve a physical problem, and so it has physical limits. Your biologist should be able to give you an order of magnitude estimate of how many "bins" you're expecting, and what the ultimate shape is expected to look like. Normally distributed? Wildly bimodal? Is the overall span of data going to span 10 epsilon or 10,000 epsilon? If there are going to be a ton of bins, you may be better served by putting 1/3 of a count into bins n-1, n, and n+1 rather than just in bin n; it's the equivalent of squinting a bit when you look at the bins. But you have to understand the problem to solve it. -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix. -- https://mail.python.org/mailman/listinfo/python-list