date:20150806

Re: SparkR driver side JNI

2015-08-06 Thread Shivaram Venkataraman

The in-process JNI only works out when the R process comes up first and we launch a JVM inside it. In many deploy modes like YARN (or actually in anything using spark-submit) the JVM comes up first and we launch R after that. Using an inter-process solution helps us cover both use cases Thanks Shi

SparkR driver side JNI

2015-08-06 Thread Renyi Xiong

why SparkR chose to uses inter-process socket solution eventually on driver side instead of in-process JNI showed in one of its doc's below (about page 20)? https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf

Re: Fixed number of partitions in RangePartitioner

2015-08-06 Thread Reynold Xin

Any reason why you need exactly a certain number of partitions? One way we can make that work is for RangePartitioner to return a bunch of empty partitions if the number of distinct elements is small. That would require changing Spark. If you want a quick work around, you can also append some ran

Workflow manager tool for scheduling spark jobs on cassandra

2015-08-06 Thread Vikram Kone

Hi, I'm looking for open source workflow tools/engines that allow us to schedule spark jobs on a cassandra cluster. Since there are tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to check with people here to see what they are using today. Some of the requiremen

Re: PySpark on PyPi

2015-08-06 Thread Davies Liu

We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot wrote: > +1 (once again :) ) > > 2015-07-28 14:51 GMT+02:00 Justin Uang : >> >> // ping >> >> do we have any signoff from the pyspark devs to submit a PR

Bucket mappings of map stage output

2015-08-06 Thread cheez

Hey all. I was trying to understand Spark Internals by looking in to (and hacking) the code. I was basically trying to explore the buckets which are generated when we partition the output of each map task and then let the reduce side fetch them on the basis of paritionId. I went into the write() m

Re: Make ML Developer APIs public (post-1.4)

2015-08-06 Thread Joseph Bradley

Eron, Thanks for sending out this list! We can make some of the critical ones public for 1.5, but they will be marked DeveloperApi since they may require changes in the future. Just made the JIRA: [ https://issues.apache.org/jira/browse/SPARK-9704] and I'll send a PR soon. Joseph On Mon, Aug 3

Re: Why SparkR didn't reuse PythonRDD

2015-08-06 Thread Shivaram Venkataraman

PythonRDD.scala has a number of PySpark specific conventions (for example worker reuse, exceptions etc.) and PySpark specific protocols (e.g. for communicating accumulators, broadcasts between the JVM and Python etc.). While it might be possible to refactor the two classes to share some more code I

Re:

2015-08-06 Thread Jonathan Winandy

Hello ! I think I found a performant and nice solution based on take' source code : def exists[T](rdd: RDD[T])(qualif: T => Boolean, num: Int): Boolean = { if (num == 0) { true } else { var count: Int = 0 val totalParts: Int = rdd.partitions.length var partsScanned: Int = 0

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-08-06 Thread Ted Yu

What is the JIRA number if a JIRA has been logged for this ? Thanks > On Jan 20, 2015, at 11:30 AM, Cheng Lian wrote: > > Hey Yi, > > I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like > to investigate this issue later. Would you please open an JIRA for it? Thanks

Why SparkR didn't reuse PythonRDD

2015-08-06 Thread Daniel Li

On behalf of Renyi Xiong - When reading Spark codebase, looks to me PythonRDD.scala is reusable, I wonder why SparkR choose to implement its own RRDD.scala? thanks Daniel

Re: SparkR driver side JNI

SparkR driver side JNI

Re: Fixed number of partitions in RangePartitioner

Workflow manager tool for scheduling spark jobs on cassandra

Re: PySpark on PyPi

Bucket mappings of map stage output

Re: Make ML Developer APIs public (post-1.4)

Re: Why SparkR didn't reuse PythonRDD

Re:

Re: Is there any way to support multiple users executing SQL on thrift server?

Why SparkR didn't reuse PythonRDD

11 matches

Site Navigation

Mail list logo

Footer information