Josh, is there a specific use pattern you think is served well by Phoenix + Spark? Just curious.
On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin <jmaho...@filetrek.com> wrote: > Phoenix generally presents itself as an endpoint using JDBC, which in my > testing seems to play nicely using JdbcRDD. > > However, a few days ago a patch was made against Phoenix to implement > support via PIG using a custom Hadoop InputFormat, which means now it has > Spark support too. > > Here's a code snippet that sets up an RDD for a specific query: > > -- > val phoenixConf = new PhoenixPigConfiguration(new Configuration()) > phoenixConf.setSelectStatement("SELECT EVENTTYPE,EVENTTIME FROM EVENTS > WHERE EVENTTYPE = 'some_type') > phoenixConf.setSelectColumns("EVENTTYPE,EVENTTIME") > phoenixConf.configure("servername", "EVENTS", 100L) > > val phoenixRDD = sc.newAPIHadoopRDD( > phoenixConf.getConfiguration(), > classOf[PhoenixInputFormat], > classOf[NullWritable], > classOf[PhoenixRecord]) > -- > > I'm still very new at Spark and even less experienced with Phoenix, but > I'm hoping there's an advantage over the JdbcRDD in terms of partitioning. > The JdbcRDD seems to implement partitioning based on a query predicate that > is user defined, but I think Phoenix's InputFormat is able to figure out > the splits which Spark is able to leverage. I don't really know how to > verify if this is the case or not though, so if anyone else is looking into > this, I'd love to hear their thoughts. > > Josh > > > On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Just took a quick look at the overview >> here<http://phoenix.incubator.apache.org/> and >> the quick start guide >> here<http://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html> >> . >> >> It looks like Apache Phoenix aims to provide flexible SQL access to data, >> both for transactional and analytic purposes, and at interactive speeds. >> >> Nick >> >> >> On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang <binwang...@gmail.com> wrote: >> >>> First, I have not tried it myself. However, what I have heard it has >>> some basic SQL features so you can query you HBase table like query content >>> on HDFS using Hive. >>> So it is not "query a simple column", I believe you can do joins and >>> other SQL queries. Maybe you can wrap up an EMR cluster with Hbase >>> preconfigured and give it a try. >>> >>> Sorry cannot provide more detailed explanation and help. >>> >>> >>> >>> On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier < >>> pomperma...@okkam.it> wrote: >>> >>>> Thanks for the quick reply Bin. Phenix is something I'm going to try >>>> for sure but is seems somehow useless if I can use Spark. >>>> Probably, as you said, since Phoenix use a dedicated data structure >>>> within each HBase Table has a more effective memory usage but if I need to >>>> deserialize data stored in a HBase cell I still have to read in memory that >>>> object and thus I need Spark. From what I understood Phoenix is good if I >>>> have to query a simple column of HBase but things get really complicated if >>>> I have to add an index for each column in my table and I store complex >>>> object within the cells. Is it correct? >>>> >>>> Best, >>>> Flavio >>>> >>>> >>>> >>>> >>>> On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang <binwang...@gmail.com> wrote: >>>> >>>>> Hi Flavio, >>>>> >>>>> I happened to attend, actually attending the 2014 Apache Conf, I heard >>>>> a project called "Apache Phoenix", which fully leverage HBase and suppose >>>>> to be 1000x faster than Hive. And it is not memory bounded, in which case >>>>> sets up a limit for Spark. It is still in the incubating group and the >>>>> "stats" functions spark has already implemented are still on the roadmap. >>>>> I >>>>> am not sure whether it will be good but might be something interesting to >>>>> check out. >>>>> >>>>> /usr/bin >>>>> >>>>> >>>>> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier < >>>>> pomperma...@okkam.it> wrote: >>>>> >>>>>> Hi to everybody, >>>>>> >>>>>> in these days I looked a bit at the recent evolution of the big >>>>>> data stacks and it seems that HBase is somehow fading away in favour of >>>>>> Spark+HDFS. Am I correct? >>>>>> Do you think that Spark and HBase should work together or not? >>>>>> >>>>>> Best regards, >>>>>> Flavio >>>>>> >>>>> >>> >> >