Hi Alaa Ali,

That's right, when using the PhoenixInputFormat, you can do simple 'WHERE'
clauses and then perform any aggregate functions you'd like from within
Spark. Any aggregations you run won't be quite as fast as running the
native Spark queries, but once it's available as an RDD you can also do a
lot more with it than just the Phoenix functions provide.

I don't know if this works with PySpark or not, but assuming the
'newHadoopRDD' functionality works for other input formats, it should work
for Phoenix as well.

Josh

On Fri, Nov 21, 2014 at 5:12 PM, Alaa Ali <contact.a...@gmail.com> wrote:

> Awesome, thanks Josh, I missed that previous post of yours! But your code
> snippet shows a select statement, so what I can do is just run a simple
> select with a where clause if I want to, and then run my data processing on
> the RDD to mimic the aggregation I want to do with SQL, right? Also,
> another question, I still haven't tried this out, but I'll actually be
> using this with PySpark, so I'm guessing the PhoenixPigConfiguration and
> newHadoopRDD can be defined in PySpark as well?
>
> Regards,
> Alaa Ali
>
> On Fri, Nov 21, 2014 at 4:34 PM, Josh Mahonin <jmaho...@interset.com>
> wrote:
>
>> Hi Alaa Ali,
>>
>> In order for Spark to split the JDBC query in parallel, it expects an
>> upper and lower bound for your input data, as well as a number of
>> partitions so that it can split the query across multiple tasks.
>>
>> For example, depending on your data distribution, you could set an upper
>> and lower bound on your timestamp range, and spark should be able to create
>> new sub-queries to split up the data.
>>
>> Another option is to load up the whole table using the PhoenixInputFormat
>> as a NewHadoopRDD. It doesn't yet support many of Phoenix's aggregate
>> functions, but it does let you load up whole tables as RDDs.
>>
>> I've previously posted example code here:
>>
>> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAJ6CGtA1DoTdadRtT5M0+75rXTyQgu5gexT+uLccw_8Ppzyt=q...@mail.gmail.com%3E
>>
>> There's also an example library implementation here, although I haven't
>> had a chance to test it yet:
>> https://github.com/simplymeasured/phoenix-spark
>>
>> Josh
>>
>> On Fri, Nov 21, 2014 at 4:14 PM, Alaa Ali <contact.a...@gmail.com> wrote:
>>
>>> I want to run queries on Apache Phoenix which has a JDBC driver. The
>>> query that I want to run is:
>>>
>>>     select ts,ename from random_data_date limit 10
>>>
>>> But I'm having issues with the JdbcRDD upper and lowerBound parameters
>>> (that I don't actually understand).
>>>
>>> Here's what I have so far:
>>>
>>> import org.apache.spark.rdd.JdbcRDD
>>> import java.sql.{Connection, DriverManager, ResultSet}
>>>
>>> val url="jdbc:phoenix:zookeeper"
>>> val sql = "select ts,ename from random_data_date limit ?"
>>> val myRDD = new JdbcRDD(sc, () => DriverManager.getConnection(url), sql,
>>> 5, 10, 2, r => r.getString("ts") + ", " + r.getString("ename"))
>>>
>>> But this doesn't work because the sql expression that the JdbcRDD
>>> expects has to have two ?s to represent the lower and upper bound.
>>>
>>> How can I run my query through the JdbcRDD?
>>>
>>> Regards,
>>> Alaa Ali
>>>
>>
>>
>

Reply via email to