Re: Custom UDFs with zero parameters support

2015-07-28 Thread Sachith Withana
Sure. Will do. Thanks a lot for the help. On Wed, Jul 29, 2015 at 12:08 PM, Reynold Xin wrote: > BTW for 1.5, there is already a now like function being added, so it > should work out of the box in 1.5.0, to be released end of Aug/early Sep. > > > On Tue, Jul 28, 2015 at 11:38 PM, Reynold Xin

Re: Custom UDFs with zero parameters support

2015-07-28 Thread Reynold Xin
Yup - would you be willing to submit a patch to add UDF0? Should be pretty easy (really just add a new Java class, and then add a new function to registerUDF) On Tue, Jul 28, 2015 at 11:36 PM, Sachith Withana wrote: > That's what I'm doing right now. > I'm implementing UDF1 for the now() UDF a

Re: Reminder about Spark 1.5.0 code freeze deadline of Aug 1st

2015-07-28 Thread Sean Owen
Right now, 603 issues have been resolved for 1.5.0. 424 are still targeted for 1.5.0, of which 33 are marked Blocker and 60 Critical. This count is not supposed to be 0 at this point, but must conceptually get to 0 at the time of 1.5.0's release. Most will simply be un-targeted or pushed down the r

Re: Custom UDFs with zero parameters support

2015-07-28 Thread Reynold Xin
BTW for 1.5, there is already a now like function being added, so it should work out of the box in 1.5.0, to be released end of Aug/early Sep. On Tue, Jul 28, 2015 at 11:38 PM, Reynold Xin wrote: > Yup - would you be willing to submit a patch to add UDF0? > > Should be pretty easy (really just

Re: Custom UDFs with zero parameters support

2015-07-28 Thread Sachith Withana
That's what I'm doing right now. I'm implementing UDF1 for the now() UDF and in the UDF registration I'm registering UDFs with zero parameters as a UDF1s. For the above example, although I add the now() UDF as is, since it's registered as an UDF1, I need to provide an empty parameter in the query

Re: Custom UDFs with zero parameters support

2015-07-28 Thread Reynold Xin
We should add UDF0 to it. For now, can you just create an one-arg UDF and don't use the argument? On Tue, Jul 28, 2015 at 10:59 PM, Sachith Withana wrote: > Hi Reynold, > > I'm implementing the interfaces given here ( > https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apa

Re: Custom UDFs with zero parameters support

2015-07-28 Thread Sachith Withana
Hi Reynold, I'm implementing the interfaces given here ( https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java ). But currently there is no UDF0 adapter. Any suggestions? I'm new to Spark and any help would be appreciated. -- Thanks, Sachith Withana O

Reminder about Spark 1.5.0 code freeze deadline of Aug 1st

2015-07-28 Thread Reynold Xin
Hey All, Just a friendly reminder that Aug 1st is the feature freeze for Spark 1.5, meaning major outstanding changes will need to land in the this week. After May 1st we'll package a release for testing and then go into the normal triage process where bugs are prioritized and some smaller featur

Re: Broadcast variable of size 1 GB fails with negative memory exception

2015-07-28 Thread Mike Hynes
Hi Imran, Thanks for your reply. I have double-checked the code I ran to generate an nxn matrix and nx1 vector for n = 2^27. There was unfortunately a bug in it, where instead of having typed 134,217,728 for n = 2^27, I included a third '7' by mistake, making the size 10x larger. However, even af

Fwd: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Priya Ch
Hi TD, Thanks for the info. I have the scenario like this. I am reading the data from kafka topic. Let's say kafka has 3 partitions for the topic. In my streaming application, I would configure 3 receivers with 1 thread each such that they would receive 3 dstreams (from 3 partitions of kafka to

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Michael Armbrust
Can you add your description of the problem as a comment to that ticket and we'll make sure to test both cases and break it out if the root cause ends up being different. On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang wrote: > Sweet! Does this cover DataFrame#rdd also using the cached query from >

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Justin Uang
Sweet! Does this cover DataFrame#rdd also using the cached query from DataFrame#cache? I think the ticket 9141 is mainly concerned with whether a derived DataFrame (B) of a cached DataFrame (A) uses the cached query of A, not whether the rdd from A.rdd or B.rdd uses the cached query of A. On Tue, J

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Joseph Bradley
Thanks for bringing this up! I talked with Michael Armbrust, and it sounds like this is a from a bug in DataFrame caching: https://issues.apache.org/jira/browse/SPARK-9141 It's marked as a blocker for 1.5. Joseph On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang wrote: > Hey guys, > > I'm running in

Re: Broadcast variable of size 1 GB fails with negative memory exception

2015-07-28 Thread Imran Rashid
Hi Mike, are you sure there the size isn't off 2x somehow? I just tried to reproduce with a simple test in BlockManagerSuite: test("large block") { store = makeBlockManager(4e9.toLong) val arr = new Array[Double](1 << 28) println(arr.size) val blockId = BlockId("rdd_3_10") val result =

Re: Rebase and Squash Commits to Revise PR?

2015-07-28 Thread Meihua Wu
Thanks Sean. Very helpful! On Tue, Jul 28, 2015 at 1:49 PM, Sean Owen wrote: > You only need to rebase if your branch/PR now conflicts with master. > you don't need to squash since the merge script will do that in the > end for you. You can squash commits and force-push if you think it > would he

Re: Rebase and Squash Commits to Revise PR?

2015-07-28 Thread Sean Owen
You only need to rebase if your branch/PR now conflicts with master. you don't need to squash since the merge script will do that in the end for you. You can squash commits and force-push if you think it would help clean up your intent, but, often it's clearer to leave the review and commit history

Rebase and Squash Commits to Revise PR?

2015-07-28 Thread Meihua Wu
I am planning to update my PR to incorporate comments from reviewers. Do I need to rebase/squash the commits into a single one? Thanks! -MW - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-ma

Re: update on git timeouts for jenkins builds

2015-07-28 Thread shane knapp
git caches are set up on all workers for the pull request builder, and builds are building w/the cache... however in the build logs it doesn't seem to be actually *hitting* the cache, so i guess i'll be doing some more poking and prodding to see wtf is going on. On Tue, Jul 28, 2015 at 12:49 PM,

Re: update on git timeouts for jenkins builds

2015-07-28 Thread shane knapp
btw, the directory perm issue was only happening on amp-jenkins-worker-04 and -05. both of the broken dirs were clobbered, so we won't be seeing any more of these again. On Tue, Jul 28, 2015 at 12:28 PM, shane knapp wrote: > ++joshrosen > > ok, i found out some of what's going on. some builds w

Re: update on git timeouts for jenkins builds

2015-07-28 Thread shane knapp
++joshrosen ok, i found out some of what's going on. some builds were failing as such: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38749/console note that it's unable to remove the target/ directory during the build... this is caused by 'git clean -fdx' running, and deep

Re: Opinion on spark-class script simplification and posix compliance

2015-07-28 Thread Marcelo Vanzin
On Tue, Jul 28, 2015 at 12:13 PM, Félix-Antoine Fortin < felix-antoine.for...@calculquebec.ca> wrote: > The while loop cannot be executed with sh, while the single line can > be. Since on my system, sh is simply a link on bash, with some options > activated, I guess this simply means that the whil

Opinion on spark-class script simplification and posix compliance

2015-07-28 Thread Félix-Antoine Fortin
Hi, Out of curiosity, I have tried to replace the dependence on bash by sh in the different scripts to launch Spark daemons and jobs. So far, most scripts work with sh, except "bin/spark-class". The culprit is the while loop that compose the final command by parsing the output of launcher library.

Re: Two joins in GraphX Pregel implementation

2015-07-28 Thread Ankur Dave
On 27 Jul 2015, at 16:42, Ulanov, Alexander wrote: > It seems that the mentioned two joins can be rewritten as one outer join You're right. In fact, the outer join can be streamlined further using a method from GraphOps: g = g.joinVertices(messages)(vprog).cache() Then, instead of passing new

update on git timeouts for jenkins builds

2015-07-28 Thread shane knapp
hey all, i'm just back in from my wedding weekend (woot!) and am working on figuring out what's happening w/the git timeouts for pull request builds. TL;DR: if your build fails due to a timeout, please retrigger your builds. i know this isn't the BEST solution, but until we get some stuff implem

Broadcast variable of size 1 GB fails with negative memory exception

2015-07-28 Thread Mike Hynes
Hello Devs, I am investigating how matrix vector multiplication can scale for an IndexedRowMatrix in mllib.linalg.distributed. Currently, I am broadcasting the vector to be multiplied on the right. The IndexedRowMatrix is stored across a cluster with up to 16 nodes, each with >200 GB of memory. T

Re: Custom UDFs with zero parameters support

2015-07-28 Thread Reynold Xin
I think we do support 0 arg UDFs: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2165 How are you using UDFs? On Tue, Jul 28, 2015 at 2:15 AM, Sachith Withana wrote: > Hi all, > > Currently I need to support custom UDFs with sparkSQL q

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Sorry this is more correct RDD and DStream Functions 1. BulkPut 2. BulkGet 3. BulkDelete 4. Foreach with connection 5. Map with connection 6. Distributed Scan 7. BulkLoad DataFrame Functions 1. BulkPut 2. BulkGet 3. Foreach with connection 4. Map with connection 5. Distributed Scan 6. BulkLoad

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Stuff that people are using is here. https://github.com/cloudera-labs/SparkOnHBase The stuff going into HBase is here https://issues.apache.org/jira/browse/HBASE-13992 If you want to add things to the hbase ticket lets do it in another jira. Like these jira https://issues.apache.org/jira/browse

Re: ReceiverTrackerSuite failing in master build

2015-07-28 Thread Patrick Wendell
Thanks ted for pointing this out. CC to Ryan and TD On Tue, Jul 28, 2015 at 8:25 AM, Ted Yu wrote: > Hi, > I noticed that ReceiverTrackerSuite is failing in master Jenkins build for > both hadoop profiles. > > The failure seems to start with: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Mas

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Yup you should be able to do that with the APIs that are going into HBase. Let me know if you need to chat about the problem and how to implement it with the HBase apis. We have tried to cover any possible way to use HBase with Spark. Let us know if we missed anything if we did we will add it.

Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Cool, will revisit, is your latest code visible publicly somewhere ? On 28 July 2015 at 17:14, Ted Malaska wrote: > Yup you should be able to do that with the APIs that are going into HBase. > > Let me know if you need to chat about the problem and how to implement it > with the HBase apis. > >

Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Oops, yes, I'm still messing with the repo on a daily basis.. fixed On 28 July 2015 at 17:11, Ted Yu wrote: > I got a compilation error: > > [INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling > [INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes > at 143809956

Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Yu
I got a compilation error: [INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling [INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes at 1438099569598 [ERROR] /home/hbase/s-on-hbase/src/main/scala/org/apache/spark/hbase/examples/simple/HBaseTableSimple.scala:36: err

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will h

Re: Generalised Spark-HBase integration

2015-07-28 Thread Jules Damji
Brilliant! Will check it out. Cheers Jules -- The Best Ideas Are Simple Jules Damji Developer Relations & Community Outreach jda...@hortonworks.com http://hortonworks.com On 7/28/15, 8:59 AM, "Michal Haris" mailto:michal.ha...@visualdna.com>> wrote: Hi all, last couple of months I've been wor

Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into a

ReceiverTrackerSuite failing in master build

2015-07-28 Thread Ted Yu
Hi, I noticed that ReceiverTrackerSuite is failing in master Jenkins build for both hadoop profiles. The failure seems to start with: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3104/ FYI

Re: PySpark on PyPi

2015-07-28 Thread Justin Uang
// ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman wrote: > Hey all, great discussion, just wanted to +1 that I see a lot of value in > steps that make it easier to use PySpark as an ordinary python library. >

[Spark SQL]Could not read parquet table after recreating it with the same table name

2015-07-28 Thread StanZhai
Hi all, I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet table after recreating it, we can reproduce the error as following: ```scala // hc is an instance of HiveContext hc.sql("select * from b").show() // this is ok and b is a parquet table val df = hc.sql("sel

DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Justin Uang
Hey guys, I'm running into some pretty bad performance issues when it comes to using a CrossValidator, because of caching behavior of DataFrames. The root of the problem is that while I have cached my DataFrame representing the features and labels, it is caching at the DataFrame level, while Cros

Custom UDFs with zero parameters support

2015-07-28 Thread Sachith Withana
Hi all, Currently I need to support custom UDFs with sparkSQL queries which have no parameters. ex: now() : which returns the current time in milliseconds. Spark currently have support for UDFs having 1 or more parameters but does not contain a UDF0 Adaptor. Is there a way to implement this? Or

RE: Two joins in GraphX Pregel implementation

2015-07-28 Thread Ulanov, Alexander
I’ve found two PRs (almost identical) for replacing mapReduceTriplets with aggregateMessages: https://github.com/apache/spark/pull/3782 https://github.com/apache/spark/pull/3883 First is closed by Dave’s suggestion, second is stale. Also there is a PR for the new Pregel API, which is also closed.

Re: ReceiverStream SPARK not able to cope up with 20,000 events /sec .

2015-07-28 Thread Akhil Das
You need to find the bottleneck here, it could your network (if the data is huge) or your producer code isn't pushing at 20k/s, If you are able to produce at 20k/s then make sure you are able to receive at that rate (try it without spark). Thanks Best Regards On Sat, Jul 25, 2015 at 3:29 PM, ansh

Re: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-07-28 Thread Debasish Das
That's awesome Yan. I was considering Phoenix for SQL calls to HBase since Cassandra supports CQL but HBase QL support was lacking. I will get back to you as I start using it on our loads. I am assuming the latencies won't be much different from accessing HBase through tsdb asynchbase as that's on