Re: Spark and HBase

Nicholas Chammas Sat, 26 Apr 2014 13:45:07 -0700

Thank you for sharing. Phoenix for realtime queries and Spark for more
complex batch processing seems like a potentially good combo.


I wonder if Spark's future will include support for the same kinds of
workloads that Phoenix is being built for. This little
tidbit<http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html>about
the future of Spark SQL seems to suggest just that (noting for others
reading that Phoenix is basically a SQL skin over HBase):

Look for future blog posts on the following topics:
>
>    - ...
>
>
>    - Reading and writing data using other formats and systems, include
>    Avro and HBase
>
> I would certainly be nice to have one big data framework to rule them all.

Nick



On Sat, Apr 26, 2014 at 10:00 AM, Josh Mahonin <jmaho...@filetrek.com>wrote:

> We're still in the infancy stages of the architecture for the project I'm
> on, but presently we're investigating HBase / Phoenix data store for it's
> realtime query abilities, and being able to expose data over a JDBC
> connector is attractive for us.
>
> Much of our data is event based, and many of the reports we'd like to do
> can be accomplished using simple SQL queries on that data - assuming they
> are performant. This far, the evidence is showing that it is even across
> many millions of rows.
>
> However, there are a number of models we have that today exist as a
> combination of PIG and python batch jobs that I'd like to replace with
> Spark, which thus far has shown to be more than adequate for what we're
> doing today.
>
> As far as using Phoenix as an endpoint for a batch load, the only real
> advantage I see over using straight HBase is that I can specify a query to
> prefilter the data before attaching it to an RDD. I haven't run the numbers
> yet to see how this compare to more traditional methods though.
>
> The only worry I have is that the Phoenix input format doesn't adequately
> split the data across multiple nodes, so that's something I will need to
> look at further.
>
> Josh
>
>
>
> On Apr 25, 2014, at 6:33 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> Josh, is there a specific use pattern you think is served well by Phoenix
> + Spark? Just curious.
>
>
> On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin <jmaho...@filetrek.com>wrote:
>
>> Phoenix generally presents itself as an endpoint using JDBC, which in my
>> testing seems to play nicely using JdbcRDD.
>>
>> However, a few days ago a patch was made against Phoenix to implement
>> support via PIG using a custom Hadoop InputFormat, which means now it has
>> Spark support too.
>>
>> Here's a code snippet that sets up an RDD for a specific query:
>>
>> --
>> val phoenixConf = new PhoenixPigConfiguration(new Configuration())
>> phoenixConf.setSelectStatement("SELECT EVENTTYPE,EVENTTIME FROM EVENTS
>> WHERE EVENTTYPE = 'some_type')
>> phoenixConf.setSelectColumns("EVENTTYPE,EVENTTIME")
>> phoenixConf.configure("servername", "EVENTS", 100L)
>>
>> val phoenixRDD = sc.newAPIHadoopRDD(
>>                         phoenixConf.getConfiguration(),
>> classOf[PhoenixInputFormat],
>>       classOf[NullWritable],
>>       classOf[PhoenixRecord])
>> --
>>
>> I'm still very new at Spark and even less experienced with Phoenix, but
>> I'm hoping there's an advantage over the JdbcRDD in terms of partitioning.
>> The JdbcRDD seems to implement partitioning based on a query predicate that
>> is user defined, but I think Phoenix's InputFormat is able to figure out
>> the splits which Spark is able to leverage. I don't really know how to
>> verify if this is the case or not though, so if anyone else is looking into
>> this, I'd love to hear their thoughts.
>>
>> Josh
>>
>>
>> On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Just took a quick look at the overview 
>>> here<http://phoenix.incubator.apache.org/> and
>>> the quick start guide 
>>> here<http://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html>
>>> .
>>>
>>> It looks like Apache Phoenix aims to provide flexible SQL access to
>>> data, both for transactional and analytic purposes, and at interactive
>>> speeds.
>>>
>>> Nick
>>>
>>>
>>> On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang <binwang...@gmail.com> wrote:
>>>
>>>> First, I have not tried it myself. However, what I have heard it has
>>>> some basic SQL features so you can query you HBase table like query content
>>>> on HDFS using Hive.
>>>> So it is not "query a simple column", I believe you can do joins and
>>>> other SQL queries. Maybe you can wrap up an EMR cluster with Hbase
>>>> preconfigured and give it a try.
>>>>
>>>> Sorry cannot provide more detailed explanation and help.
>>>>
>>>>
>>>>
>>>> On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier <
>>>> pomperma...@okkam.it> wrote:
>>>>
>>>>> Thanks for the quick reply Bin. Phenix is something I'm going to try
>>>>> for sure but is seems somehow useless if I can use Spark.
>>>>> Probably, as you said, since Phoenix use a dedicated data structure
>>>>> within each HBase Table has a more effective memory usage but if I need to
>>>>> deserialize data stored in a HBase cell I still have to read in memory 
>>>>> that
>>>>> object and thus I need Spark. From what I understood Phoenix is good if I
>>>>> have to query a simple column of HBase but things get really complicated 
>>>>> if
>>>>> I have to add an index for each column in my table and I store complex
>>>>> object within the cells. Is it correct?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang <binwang...@gmail.com> wrote:
>>>>>
>>>>>> Hi Flavio,
>>>>>>
>>>>>> I happened to attend, actually attending the 2014 Apache Conf, I
>>>>>> heard a project called "Apache Phoenix", which fully leverage HBase and
>>>>>> suppose to be 1000x faster than Hive. And it is not memory bounded, in
>>>>>> which case sets up a limit for Spark. It is still in the incubating group
>>>>>> and the "stats" functions spark has already implemented are still on the
>>>>>> roadmap. I am not sure whether it will be good but might be something
>>>>>> interesting to check out.
>>>>>>
>>>>>> /usr/bin
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier <
>>>>>> pomperma...@okkam.it> wrote:
>>>>>>
>>>>>>> Hi to everybody,
>>>>>>>
>>>>>>>  in these days I looked a bit at the recent evolution of the big
>>>>>>> data stacks and it seems that HBase is somehow fading away in favour of
>>>>>>> Spark+HDFS. Am I correct?
>>>>>>> Do you think that Spark and HBase should work together or not?
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Flavio
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Spark and HBase

Reply via email to