Re: Spark and HBase

Nicholas Chammas Fri, 25 Apr 2014 15:34:58 -0700

Josh, is there a specific use pattern you think is served well by Phoenix +
Spark? Just curious.



On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin <jmaho...@filetrek.com> wrote:

> Phoenix generally presents itself as an endpoint using JDBC, which in my
> testing seems to play nicely using JdbcRDD.
>
> However, a few days ago a patch was made against Phoenix to implement
> support via PIG using a custom Hadoop InputFormat, which means now it has
> Spark support too.
>
> Here's a code snippet that sets up an RDD for a specific query:
>
> --
> val phoenixConf = new PhoenixPigConfiguration(new Configuration())
> phoenixConf.setSelectStatement("SELECT EVENTTYPE,EVENTTIME FROM EVENTS
> WHERE EVENTTYPE = 'some_type')
> phoenixConf.setSelectColumns("EVENTTYPE,EVENTTIME")
> phoenixConf.configure("servername", "EVENTS", 100L)
>
> val phoenixRDD = sc.newAPIHadoopRDD(
>                         phoenixConf.getConfiguration(),
> classOf[PhoenixInputFormat],
>       classOf[NullWritable],
>       classOf[PhoenixRecord])
> --
>
> I'm still very new at Spark and even less experienced with Phoenix, but
> I'm hoping there's an advantage over the JdbcRDD in terms of partitioning.
> The JdbcRDD seems to implement partitioning based on a query predicate that
> is user defined, but I think Phoenix's InputFormat is able to figure out
> the splits which Spark is able to leverage. I don't really know how to
> verify if this is the case or not though, so if anyone else is looking into
> this, I'd love to hear their thoughts.
>
> Josh
>
>
> On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Just took a quick look at the overview 
>> here<http://phoenix.incubator.apache.org/> and
>> the quick start guide 
>> here<http://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html>
>> .
>>
>> It looks like Apache Phoenix aims to provide flexible SQL access to data,
>> both for transactional and analytic purposes, and at interactive speeds.
>>
>> Nick
>>
>>
>> On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang <binwang...@gmail.com> wrote:
>>
>>> First, I have not tried it myself. However, what I have heard it has
>>> some basic SQL features so you can query you HBase table like query content
>>> on HDFS using Hive.
>>> So it is not "query a simple column", I believe you can do joins and
>>> other SQL queries. Maybe you can wrap up an EMR cluster with Hbase
>>> preconfigured and give it a try.
>>>
>>> Sorry cannot provide more detailed explanation and help.
>>>
>>>
>>>
>>> On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier <
>>> pomperma...@okkam.it> wrote:
>>>
>>>> Thanks for the quick reply Bin. Phenix is something I'm going to try
>>>> for sure but is seems somehow useless if I can use Spark.
>>>> Probably, as you said, since Phoenix use a dedicated data structure
>>>> within each HBase Table has a more effective memory usage but if I need to
>>>> deserialize data stored in a HBase cell I still have to read in memory that
>>>> object and thus I need Spark. From what I understood Phoenix is good if I
>>>> have to query a simple column of HBase but things get really complicated if
>>>> I have to add an index for each column in my table and I store complex
>>>> object within the cells. Is it correct?
>>>>
>>>> Best,
>>>> Flavio
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang <binwang...@gmail.com> wrote:
>>>>
>>>>> Hi Flavio,
>>>>>
>>>>> I happened to attend, actually attending the 2014 Apache Conf, I heard
>>>>> a project called "Apache Phoenix", which fully leverage HBase and suppose
>>>>> to be 1000x faster than Hive. And it is not memory bounded, in which case
>>>>> sets up a limit for Spark. It is still in the incubating group and the
>>>>> "stats" functions spark has already implemented are still on the roadmap. 
>>>>> I
>>>>> am not sure whether it will be good but might be something interesting to
>>>>> check out.
>>>>>
>>>>> /usr/bin
>>>>>
>>>>>
>>>>> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier <
>>>>> pomperma...@okkam.it> wrote:
>>>>>
>>>>>> Hi to everybody,
>>>>>>
>>>>>>  in these days I looked a bit at the recent evolution of the big
>>>>>> data stacks and it seems that HBase is somehow fading away in favour of
>>>>>> Spark+HDFS. Am I correct?
>>>>>> Do you think that Spark and HBase should work together or not?
>>>>>>
>>>>>> Best regards,
>>>>>> Flavio
>>>>>>
>>>>>
>>>
>>
>

Re: Spark and HBase

Reply via email to