On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell <larry.mart...@gmail.com> wrote: > I have an python app that queries a MySQL DB. The query has this form: > > SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f > FROM t > GROUP BY a, b, c, d, f > > x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or > 10053.490, 2542.094). > > The business issue is that if either x or y in 2 rows that are in the > same a, b, c, d group are within 1 of each other then they should be > grouped together. And to make it more complicated, the tolerance is > applied as a rolling continuum. For example, if the x and y in a set > of grouped rows are: > > row 1: 1.5, 9.5 > row 2: 2.4, 20.8 > row 3: 3.3, 40.6 > row 4: 4.2, 2.5 > row 5: 5.1, 10.1 > row 6: 6.0, 7.9 > row 7: 8.0, 21.0 > row 8: 100, 200 > > 1 through 6 get combined because all their X values are within the > tolerance of some other X in the set that's been combined. 7's Y value > is within the tolerance of 2's Y, so that should be combined as well. > 8 is not combined because neither the X or Y value is within the > tolerance of any X or Y in the set that was combined.
Trying to get my head around this a bit more. Are columns a/b/c/d treated as a big category (eg type, brand, category, model), such that nothing will ever be grouped that has any difference in those four columns? If so, we can effectively ignore them and pretend we have a table with exactly one set (eg stick a WHERE clause onto the query that stipulates their values). Then what you have is this: * Aggregate based on proximity of x and y * Emit results derived from e Is that correct? So here's my way of writing it. * Subselect: List all values for x, in order, and figure out which ones are less than the previous value plus one * Subselect: Ditto, for y. * Outer select: Somehow do an either-or group. I'm not quite sure how to do that part, actually! A PGSQL window function would cover the two subselects - at least, I'm fairly sure it would. I can't quite get the whole thing, though; I can get a true/false flag that says whether it's near to the previous one (that's easy), and creating a grouping column value should be possible from that but I'm not sure how. But an either-or grouping is a bit trickier. The best I can think of is to collect all the y values for each group of x values, and then if any two groups 'overlap' (ie have points within 1.0 of each other), merge the groups. That's going to be seriously tricky to do in SQL, I think, so you may have to go back to Python on that one. My analysis suggests that, whatever happens, you're going to need every single y value somewhere. So it's probably not worth trying to do any grouping/aggregation in SQL, since you need to further analyze all the individual data points. I can't think of any way better than just leafing through the whole table (either in Python or in a stored procedure - if you can run your script on the same computer that's running the database, I'd do that, otherwise consider a stored procedure to reduce network transfers) and building up mappings. Of course, "I can't think of a way" does not equate to "There is no way". There may be some magic trick that I didn't think of, or some arcane incantation that gets what you want. Who knows? If you can produce an ASCII art Mandelbrot set [1] in pure SQL, why not this! ChrisA [1] http://wiki.postgresql.org/wiki/Mandelbrot_set -- https://mail.python.org/mailman/listinfo/python-list