how to implement my own datasource?

2015-06-24 Thread 诺铁
hi, I can't find documentation about datasource api, how to implement custom datasource. any hint is appreciated.thanks.

Re: Error in invoking a custom StandaloneRecoveryModeFactory in java env (Spark v1.3.0)

2015-06-24 Thread Niranda Perera
thanks Josh. this looks very similar to my problem. On Thu, Jun 25, 2015 at 11:32 AM, Josh Rosen wrote: > This sounds like https://issues.apache.org/jira/browse/SPARK-7436, which > has been fixed in Spark 1.4+ and in branch-1.3 (for Spark 1.3.2). > > On Wed, Jun 24, 2015 at 10:57 PM, Niranda Pe

Re: Error in invoking a custom StandaloneRecoveryModeFactory in java env (Spark v1.3.0)

2015-06-24 Thread Josh Rosen
This sounds like https://issues.apache.org/jira/browse/SPARK-7436, which has been fixed in Spark 1.4+ and in branch-1.3 (for Spark 1.3.2). On Wed, Jun 24, 2015 at 10:57 PM, Niranda Perera wrote: > Hi all, > > I'm trying to implement a custom StandaloneRecoveryModeFactory in the Java > environmen

Error in invoking a custom StandaloneRecoveryModeFactory in java env (Spark v1.3.0)

2015-06-24 Thread Niranda Perera
Hi all, I'm trying to implement a custom StandaloneRecoveryModeFactory in the Java environment. Pls find the implementation here. [1] . I'm new to Scala, hence I'm trying to use Java environment as much as possible. when I start a master with spark.deploy.recoveryMode.factory property to be "CUST

Re: Loss of data due to congestion

2015-06-24 Thread anshu shukla
Thaks, I am talking about streaming. On 25 Jun 2015 5:37 am, "ayan guha" wrote: > Can you elaborate little more? Are you talking about receiver or streaming? > On 24 Jun 2015 23:18, "anshu shukla" wrote: > >> How spark guarantees that no RDD will fail /lost during its life cycle . >> Is there

Re: Problem with version compatibility

2015-06-24 Thread jimfcarroll
Hi Sean, I'm running a Mesos cluster. My driver app is built using maven against the maven 1.4.0 dependency. The Mesos slave machines have the spark distribution installed from the distribution link. I have a hard time understanding how this isn't a standard app deployment but maybe I'm missing

Re: parallelize method v.s. textFile method

2015-06-24 Thread Reynold Xin
How did you exclude it? I am not sure if it is possible since each task needs to contain the chunk of data. > On Jun 24, 2015, at 6:07 PM, xing wrote: > > When we compare the performance, we already excluded this part of time > difference. > > > > -- > View this message in context: > http://a

Re: Problem with version compatibility

2015-06-24 Thread Sean Owen
They are different classes even. Your problem isn't class-not-found though. You're also comparing different builds really. You should not be including Spark code in your app. On Wed, Jun 24, 2015, 9:48 PM jimfcarroll wrote: > These jars are simply incompatible. You can see this by looking at tha

Re: parallelize method v.s. textFile method

2015-06-24 Thread xing
When we compare the performance, we already excluded this part of time difference. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/parallelize-method-v-s-textFile-method-tp12871p12873.html Sent from the Apache Spark Developers List mailing list archive

Re: parallelize method v.s. textFile method

2015-06-24 Thread Reynold Xin
If you read the file one by one and then use parallelize, it is read by a single thread on a single machine. On Wednesday, June 24, 2015, xing wrote: > We have a large file and we used to read chunks and then use parallelize > method (distData = sc.parallelize(chunk)) and then do the map/reduce

parallelize method v.s. textFile method

2015-06-24 Thread xing
We have a large file and we used to read chunks and then use parallelize method (distData = sc.parallelize(chunk)) and then do the map/reduce chunk by chunk. Recently we read the whole file using textFile method and found the map/reduce job is much faster. Anybody can help us to understand why? We

Re: Python UDF performance at large scale

2015-06-24 Thread Justin Uang
Correct, I was running with a batch size of about 100 when I did the tests, because I was worried about deadlocks. Do you have any concerns regarding the batched synchronous version of communication between the Java and Python processes, and if not, should I file a ticket and starting writing it? O

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
>From you comment, the 2x improvement only happens when you have the batch size as 1, right? On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang wrote: > FYI, just submitted a PR to Pyrolite to remove their StopException. > https://github.com/irmen/Pyrolite/pull/30 > > With my benchmark, removing it ba

Re: Force inner join to shuffle the smallest table

2015-06-24 Thread Stephen Carman
Have you tried shuffle compression? spark.shuffle.compress (true|false) if you have a filesystem capable also I’ve noticed file consolidation helps disk usage a bit. spark.shuffle.consolidateFiles (true|false) Steve On Jun 24, 2015, at 3:27 PM, Ulanov, Alexander mailto:alexander.ula...@hp.co

RE: Force inner join to shuffle the smallest table

2015-06-24 Thread Ulanov, Alexander
It also fails, as I mentioned in the original question. From: CC GP [mailto:chandrika.gopalakris...@gmail.com] Sent: Wednesday, June 24, 2015 12:08 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Force inner join to shuffle the smallest table Try below and see if it makes a differe

Re: Python UDF performance at large scale

2015-06-24 Thread Justin Uang
FYI, just submitted a PR to Pyrolite to remove their StopException. https://github.com/irmen/Pyrolite/pull/30 With my benchmark, removing it basically made it about 2x faster. On Wed, Jun 24, 2015 at 8:33 AM Punyashloka Biswal wrote: > Hi Davies, > > In general, do we expect people to use CPyth

Re: Force inner join to shuffle the smallest table

2015-06-24 Thread CC GP
Try below and see if it makes a difference: val result = sqlContext.sql(“select big.f1, big.f2 from small inner join big on big.s=small.s and big.d=small.d”) On Wed, Jun 24, 2015 at 11:35 AM, Ulanov, Alexander wrote: > Hi, > > > > I try to inner join of two tables on two fields(string and doub

Re: Problem with version compatibility

2015-06-24 Thread jimfcarroll
These jars are simply incompatible. You can see this by looking at that class in both the maven repo for 1.4.0 here: http://central.maven.org/maven2/org/apache/spark/spark-core_2.10/1.4.0/spark-core_2.10-1.4.0.jar as well as the spark-assembly jar inside the .tgz file you can get from the officia

Force inner join to shuffle the smallest table

2015-06-24 Thread Ulanov, Alexander
Hi, I try to inner join of two tables on two fields(string and double). One table is 2B rows, the second is 500K. They are stored in HDFS in Parquet. Spark v 1.4. val big = sqlContext.paquetFile("hdfs://big") data.registerTempTable("big") val small = sqlContext.paquetFile("hdfs://small") data.reg

Problem with version compatibility

2015-06-24 Thread jimfcarroll
Hello all, I have a strange problem. I have a mesos spark cluster with Spark 1.4.0/Hadoop 2.4.0 installed and a client application use maven to include the same versions. However, I'm getting a serialUIDVersion problem on: ERROR Remoting - org.apache.spark.storage.BlockManagerMessages$RegisterB

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-24 Thread Josh Rosen
At least a couple of those issues are mistargeted; some of the flaky test JIRAs + test improvement tasks should probably be targeted for 1.5.0 instead. On Wed, Jun 24, 2015 at 8:56 AM, Patrick Wendell wrote: > Hey Sean, > > This is being shipped now because there is a severe bug in 1.4.0 that >

Re: how can I write a language "wrapper"?

2015-06-24 Thread Shivaram Venkataraman
The SparkR code is in the `R` directory i.e. https://github.com/apache/spark/tree/master/R Shivaram On Wed, Jun 24, 2015 at 8:45 AM, Vasili I. Galchin wrote: > Matei, > > Last night I downloaded the Spark bundle. > In order to save me time, can you give me the name of the SparkR example >

Re: OK to add committers active on JIRA to JIRA admin role?

2015-06-24 Thread Imran Rashid
+1 (partially b/c I would like jira admin myself) On Tue, Jun 23, 2015 at 3:47 AM, Sean Owen wrote: > There are some committers who are active on JIRA and sometimes need to > do things that require JIRA admin access -- in particular thinking of > adding a new person as "Contributor" in order to

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-24 Thread Patrick Wendell
Hey Sean, This is being shipped now because there is a severe bug in 1.4.0 that can cause data corruption for Parquet users. There are no blockers targeted for 1.4.1 - so I don't see that JIRA is inconsistent with shipping a release now. The goal of having every single targeted JIRA cleared by th

Re: how can I write a language "wrapper"?

2015-06-24 Thread Vasili I. Galchin
Matei, Last night I downloaded the Spark bundle. In order to save me time, can you give me the name of the SparkR example is and where it is in the Sparc tree? Thanks, Bill On Tuesday, June 23, 2015, Matei Zaharia wrote: > Just FYI, it would be easiest to follow SparkR's example and add

Re: [GraphX] Graph 500 graph generator

2015-06-24 Thread Burak Yavuz
Hi Ryan, If you can get past the paperwork, I'm sure this can make a great Spark Package (http://spark-packages.org). People then can use it for benchmarking purposes, and I'm sure people will be looking for graph generators! Best, Burak On Wed, Jun 24, 2015 at 7:55 AM, Carr, J. Ryan wrote: >

Spark SQL 1.3 Exception

2015-06-24 Thread Debasish Das
Hi, I have Impala created table with the following io format and serde: inputFormat:parquet.hive.DeprecatedParquetInputFormat, outputFormat:parquet.hive.DeprecatedParquetOutputFormat, serdeInfo:SerDeInfo(name:null, serializationLib:parquet.hive.serde.ParquetHiveSerDe, parameters:{}) I am trying t

[GraphX] Graph 500 graph generator

2015-06-24 Thread Carr, J. Ryan
Hi Spark Devs, As part of a project at work, I have written a graph generator for RMAT graphs consistent with the specifications in the Graph 500 benchmark (http://www.graph500.org/specifications). We had originally planned to use the rmatGenerator function in GraphGenerators, but found that

Loss of data due to congestion

2015-06-24 Thread anshu shukla
How spark guarantees that no RDD will fail /lost during its life cycle . Is there something like ask in storm or its does it by default . -- Thanks & Regards, Anshu Shukla

Re: Python UDF performance at large scale

2015-06-24 Thread Punyashloka Biswal
Hi Davies, In general, do we expect people to use CPython only for "heavyweight" UDFs that invoke an external library? Are there any examples of using Jython, especially performance comparisons to Java/Scala and CPython? When using Jython, do you expect the driver to send code to the executor as a

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-24 Thread Sean Owen
There are 44 issues still targeted for 1.4.1. None are Blockers; 12 are Critical. ~80% were opened and/or set by committers. Compare with 90 issues resolved for 1.4.1. I'm concerned that committers are targeting lots more for a release even in the short term than realistically can go in. On its fa

Re: [SparkSQL 1.4]Could not use concat with UDF in where clause

2015-06-24 Thread StanZhai
Hi Michael Armbrust, I have filed an issue on JIRA for this, https://issues.apache.org/jira/browse/SPARK-8588 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-1-4-Could-not-use-concat-with-