Thank you for sharing. Phoenix for realtime queries and Spark for more complex batch processing seems like a potentially good combo.
I wonder if Spark's future will include support for the same kinds of workloads that Phoenix is being built for. This little tidbit<http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html>about the future of Spark SQL seems to suggest just that (noting for others reading that Phoenix is basically a SQL skin over HBase): Look for future blog posts on the following topics: > > - ... > > > - Reading and writing data using other formats and systems, include > Avro and HBase > > I would certainly be nice to have one big data framework to rule them all. Nick On Sat, Apr 26, 2014 at 10:00 AM, Josh Mahonin <jmaho...@filetrek.com>wrote: > We're still in the infancy stages of the architecture for the project I'm > on, but presently we're investigating HBase / Phoenix data store for it's > realtime query abilities, and being able to expose data over a JDBC > connector is attractive for us. > > Much of our data is event based, and many of the reports we'd like to do > can be accomplished using simple SQL queries on that data - assuming they > are performant. This far, the evidence is showing that it is even across > many millions of rows. > > However, there are a number of models we have that today exist as a > combination of PIG and python batch jobs that I'd like to replace with > Spark, which thus far has shown to be more than adequate for what we're > doing today. > > As far as using Phoenix as an endpoint for a batch load, the only real > advantage I see over using straight HBase is that I can specify a query to > prefilter the data before attaching it to an RDD. I haven't run the numbers > yet to see how this compare to more traditional methods though. > > The only worry I have is that the Phoenix input format doesn't adequately > split the data across multiple nodes, so that's something I will need to > look at further. > > Josh > > > > On Apr 25, 2014, at 6:33 PM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > Josh, is there a specific use pattern you think is served well by Phoenix > + Spark? Just curious. > > > On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin <jmaho...@filetrek.com>wrote: > >> Phoenix generally presents itself as an endpoint using JDBC, which in my >> testing seems to play nicely using JdbcRDD. >> >> However, a few days ago a patch was made against Phoenix to implement >> support via PIG using a custom Hadoop InputFormat, which means now it has >> Spark support too. >> >> Here's a code snippet that sets up an RDD for a specific query: >> >> -- >> val phoenixConf = new PhoenixPigConfiguration(new Configuration()) >> phoenixConf.setSelectStatement("SELECT EVENTTYPE,EVENTTIME FROM EVENTS >> WHERE EVENTTYPE = 'some_type') >> phoenixConf.setSelectColumns("EVENTTYPE,EVENTTIME") >> phoenixConf.configure("servername", "EVENTS", 100L) >> >> val phoenixRDD = sc.newAPIHadoopRDD( >> phoenixConf.getConfiguration(), >> classOf[PhoenixInputFormat], >> classOf[NullWritable], >> classOf[PhoenixRecord]) >> -- >> >> I'm still very new at Spark and even less experienced with Phoenix, but >> I'm hoping there's an advantage over the JdbcRDD in terms of partitioning. >> The JdbcRDD seems to implement partitioning based on a query predicate that >> is user defined, but I think Phoenix's InputFormat is able to figure out >> the splits which Spark is able to leverage. I don't really know how to >> verify if this is the case or not though, so if anyone else is looking into >> this, I'd love to hear their thoughts. >> >> Josh >> >> >> On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Just took a quick look at the overview >>> here<http://phoenix.incubator.apache.org/> and >>> the quick start guide >>> here<http://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html> >>> . >>> >>> It looks like Apache Phoenix aims to provide flexible SQL access to >>> data, both for transactional and analytic purposes, and at interactive >>> speeds. >>> >>> Nick >>> >>> >>> On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang <binwang...@gmail.com> wrote: >>> >>>> First, I have not tried it myself. However, what I have heard it has >>>> some basic SQL features so you can query you HBase table like query content >>>> on HDFS using Hive. >>>> So it is not "query a simple column", I believe you can do joins and >>>> other SQL queries. Maybe you can wrap up an EMR cluster with Hbase >>>> preconfigured and give it a try. >>>> >>>> Sorry cannot provide more detailed explanation and help. >>>> >>>> >>>> >>>> On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier < >>>> pomperma...@okkam.it> wrote: >>>> >>>>> Thanks for the quick reply Bin. Phenix is something I'm going to try >>>>> for sure but is seems somehow useless if I can use Spark. >>>>> Probably, as you said, since Phoenix use a dedicated data structure >>>>> within each HBase Table has a more effective memory usage but if I need to >>>>> deserialize data stored in a HBase cell I still have to read in memory >>>>> that >>>>> object and thus I need Spark. From what I understood Phoenix is good if I >>>>> have to query a simple column of HBase but things get really complicated >>>>> if >>>>> I have to add an index for each column in my table and I store complex >>>>> object within the cells. Is it correct? >>>>> >>>>> Best, >>>>> Flavio >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang <binwang...@gmail.com> wrote: >>>>> >>>>>> Hi Flavio, >>>>>> >>>>>> I happened to attend, actually attending the 2014 Apache Conf, I >>>>>> heard a project called "Apache Phoenix", which fully leverage HBase and >>>>>> suppose to be 1000x faster than Hive. And it is not memory bounded, in >>>>>> which case sets up a limit for Spark. It is still in the incubating group >>>>>> and the "stats" functions spark has already implemented are still on the >>>>>> roadmap. I am not sure whether it will be good but might be something >>>>>> interesting to check out. >>>>>> >>>>>> /usr/bin >>>>>> >>>>>> >>>>>> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier < >>>>>> pomperma...@okkam.it> wrote: >>>>>> >>>>>>> Hi to everybody, >>>>>>> >>>>>>> in these days I looked a bit at the recent evolution of the big >>>>>>> data stacks and it seems that HBase is somehow fading away in favour of >>>>>>> Spark+HDFS. Am I correct? >>>>>>> Do you think that Spark and HBase should work together or not? >>>>>>> >>>>>>> Best regards, >>>>>>> Flavio >>>>>>> >>>>>> >>>> >>> >> >