Hi Chris, Responses are inline
--- Jeremiah Peschka - Founder, Brent Ozar Unlimited MCITP: SQL Server 2008, MVP Cloudera Certified Developer for Apache Hadoop On Sun, Apr 14, 2013 at 4:47 PM, Chris Corbyn <ch...@w3style.co.uk> wrote: > All, > > Just copying this from my stackoverflow post, as the riak tag doesn't get > much love over there :) It's fine to just outright say that Riak is never > going to work efficiently in this case because I'm inherently depending on > MapReduce. > MapReduce works well when: a) you know the set of objects you're going to MapReduce over b) you need to return the entire object c) when you plan to manipulate a lot of data inside Riak - read all sales records and write them out as shipping invoices > > Everywhere I read, people say you shouldn't use Riak's MapReduce over an > entire bucket and that there are other ways of achieving your goals. I'm > not sure how, though. I'm also not clear on why using an entire bucket is > slow, if you only have one bucket in the entire system, so either way, you > need to go over all the entries. Passing in a list of key-bucket pairs has > the same effect. Maybe the rule should be "don't use MapReduce with more > than a handful of keys". Which makes me wonder (apart from link traversal) > what use it really has in the real world. > >From my limited understanding of things, MR across an entire bucket is slow because you need to scan the entire key space. Much like a SELECT with no WHERE clause, a MapReduce without a key list is going to sweep across the entire bucket pulling all data off of disk in the process. Someone better versed in Riak internals would have to chime in on why this is so slow. I do know, however, that having on master bucket o' things may not be the best idea. If you're storing sales data, a better approach may be to segment buckets out by primary querying criteria. > > I have a list of 500K+ documents that represent sales data. I need to view > this data in different ways: for example, how much revenue was made in each > month the business was operating? How much revenue did each product raise? > How many of each product were sold in a given month? I always thought > MapReduce was supposed to be good at solving these types of aggregate > problems. That sounds like a myth now though, unless we're just looking at > Hadoop. > MapReduce can aggregate well. RDBMSes can aggregate well, too. But there are practical limits to each paradigm that you have to be aware of. From watching the list and working with people in the real world there are some paradigms that you want to use in Riak that don't immediately seem obvious coming from the RDBMS world. - Know your queries. Examine your application and make a list of common queries that you're going to be running. You'll want to be familiar with this list because you're going to try to write your data in a way that answers these questions. - Know your query pattern. Understand how these queries are working with the data itself. If many queries retrieve data aggregated for a day, week, month, store, sales region, or other logical division then you'll want to make sure that you can answer them quickly. (See Choosing the Right Tool [3]) - Pre-aggregate your data. If you know you're going to have to perform averages, you may want to look interesting ways to write your data (see Rolling with Eventual Consistency [1] and Little-Known Awesome Algorithms: Fenwick Trees – Rapidly Find Cumulative Frequency Sums [2]). MR can be slow since all records need to be streamed through the originating node. Using pre-reduce[4] can ease the load [1]: http://basho.com/rolling-with-eventual-consistency-2/ [2]: http://www.swageroo.com/wordpress/little-known-awesome-algorithms-fenwick-range-trees-rapidly-find-cumulative-frequency-sums/ [3]: http://basho.com/choosing-the-right-tool/ [4]: http://docs.basho.com/riak/1.1.4/references/appendices/MapReduce-Implementation/ > > My documents are all in a bucket named 'sales' and they are records with > the following fields (as native Erlang records, not JSON): > > {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": > "1.00", "created_at": 1365931758}. > > Let's take the example where I need to report the total revenue for each > product in each month over the past 4 years (that's basically the entire > bucket, but that's just the requirement), how does one use Riak's MapReduce > to do that efficiently? Even just trying to use an identity map operation > on the data I get a timeout after ~30 seconds, which MySQL handles in > milliseconds. > > I'm doing this in Erlang (using the protocol buffers client), but any > language is fine for an explanation. > > The equivalent SQL (MySQL) would be: > > SELECT SUM(price) AS revenue, > FROM_UNIXTIME(created_at, '%Y-%m') AS month, > product_key > FROM sales > GROUP BY month, product_key > ORDER BY month ASC; > > Even with secondary indexes, there's still a MapReduce involved in this > query, once you have the list of keys to process. > > (Ordering not important right now). > > Cheers, > > Chris > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com