Re: Data source API | Support for dynamic schema

2015-01-29 Thread Aniket Bhatnagar
Thanks Reynold and Cheng. It does seem quiet a bit of heavy lifting to have
schema per row. I will for now settle with having to do a union schema of
all the schema versions and complain any incompatibilities :-)

Looking forward to do great things with the API!

Thanks,
Aniket

On Thu Jan 29 2015 at 01:09:15 Reynold Xin  wrote:

> It's an interesting idea, but there are major challenges with per row
> schema.
>
> 1. Performance - query optimizer and execution use assumptions about
> schema and data to generate optimized query plans. Having to re-reason
> about schema for each row can substantially slow down the engine, but due
> to optimization and due to the overhead of schema information associated
> with each row.
>
> 2. Data model: per-row schema is fundamentally a different data model. The
> current relational model has gone through 40 years of research and have
> very well defined semantics. I don't think there are well defined semantics
> of a per-row schema data model. For example, what is the semantics of an
> UDF function that is operating on a data cell that has incompatible schema?
> Should we also coerce or convert the data type? If yes, will that lead to
> conflicting semantics with some other rules? We need to answer questions
> like this in order to have a robust data model.
>
>
>
>
>
> On Wed, Jan 28, 2015 at 11:26 AM, Cheng Lian 
> wrote:
>
>> Hi Aniket,
>>
>> In general the schema of all rows in a single table must be same. This is
>> a basic assumption made by Spark SQL. Schema union does make sense, and
>> we're planning to support this for Parquet. But as you've mentioned, it
>> doesn't help if types of different versions of a column differ from each
>> other. Also, you need to reload the data source table after schema changes
>> happen.
>>
>> Cheng
>>
>>
>> On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:
>>
>>> I saw the talk on Spark data sources and looking at the interfaces, it
>>> seems that the schema needs to be provided upfront. This works for many
>>> data sources but I have a situation in which I would need to integrate a
>>> system that supports schema evolutions by allowing users to change schema
>>> without affecting existing rows. Basically, each row contains a schema
>>> hint
>>> (id and version) and this allows developers to evolve schema over time
>>> and
>>> perform migration at will. Since the schema needs to be specified upfront
>>> in the data source API, one possible way would be to build a union of all
>>> schema versions and handle populating row values appropriately. This
>>> works
>>> in case columns have been added or deleted in the schema but doesn't work
>>> if types have changed. I was wondering if it is possible to change the
>>> API
>>>   to provide schema for each row instead of expecting data source to
>>> provide
>>> schema upfront?
>>>
>>> Thanks,
>>> Aniket
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: emergency jenkins restart soon

2015-01-29 Thread shane knapp
the master builds triggered around ~1am last night (according to the logs),
so it looks like we're back in business.

On Wed, Jan 28, 2015 at 10:32 PM, shane knapp  wrote:

> np!  the master builds haven't triggered yet, but let's give the rube
> goldberg machine a minute to get it's bearings.
>
> On Wed, Jan 28, 2015 at 10:31 PM, Reynold Xin  wrote:
>
>> Thanks for doing that, Shane!
>>
>>
>> On Wed, Jan 28, 2015 at 10:29 PM, shane knapp 
>> wrote:
>>
>>> jenkins is back up and all builds have been retriggered...  things are
>>> building and looking good, and i'll keep an eye on the spark master
>>> builds
>>> tonite and tomorrow.
>>>
>>> On Wed, Jan 28, 2015 at 9:56 PM, shane knapp 
>>> wrote:
>>>
>>> > the spark master builds stopped triggering ~yesterday and the logs
>>> don't
>>> > show anything.  i'm going to give the current batch of spark pull
>>> request
>>> > builder jobs a little more time (~30 mins) to finish, then kill
>>> whatever is
>>> > left and restart jenkins.  anything that was queued or killed will be
>>> > retriggered once jenkins is back up.
>>> >
>>> > sorry for the inconvenience, we'll get this sorted asap.
>>> >
>>> > thanks,
>>> >
>>> > shane
>>> >
>>>
>>
>>
>


Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-29 Thread Robert C Senkbeil

+1

I verified that the REPL jars published work fine with the Spark Kernel
project (can build/test against them).

Signed,
Chip Senkbeil



From:   Krishna Sankar 
To: Sean Owen 
Cc: Patrick Wendell , "dev@spark.apache.org"

Date:   01/28/2015 02:52 PM
Subject:Re: [VOTE] Release Apache Spark 1.2.1 (RC2)



+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:22 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0
-Phive -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
1.2.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   Fixed : org.apache.spark.SparkException in zip !
2.5. rdd operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lmbda) with itertools
OK

Cheers


On Wed, Jan 28, 2015 at 5:17 AM, Sean Owen  wrote:

> +1 (nonbinding). I verified that all the hash / signing items I
> mentioned before are resolved.
>
> The source package compiles on Ubuntu / Java 8. I ran tests and the
> passed. Well, actually I see the same failure I've seeing locally on
> OS X and on Ubuntu for a while, but I think nobody else has seen this?
>
> MQTTStreamSuite:
> - mqtt input stream *** FAILED ***
>   org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in
> progress
>   at
> org.eclipse.paho.client.mqttv3.internal.ClientState.send
(ClientState.java:423)
>
> Doesn't happen on Jenkins. If nobody else is seeing this, I suspect it
> is something perhaps related to my env that I haven't figured out yet,
> so should not be considered a blocker.
>
> On Wed, Jan 28, 2015 at 10:06 AM, Patrick Wendell 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark
version
> 1.2.1!
> >
> > The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> >
>
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b

> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1062/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
> >
> > Changes from rc1:
> > This has no code changes from RC1. Only minor changes to the release
> script.
> >
> > Please vote on releasing this package as Apache Spark 1.2.1!
> >
> > The vote is open until  Saturday, January 31, at 10:04 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.2.1
> > [ ] -1 Do not release this package because ...
> >
> > For a list of fixes in this release, see http://s.apache.org/Mpn.
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread Mohit Jaggi
Francois,
RDD.aggregate() does not support aggregation by key. But, indeed, that is the 
kind of implementation I am looking for, one that does not allocate 
intermediate space for storing (K,V) pairs. When working with large datasets 
this type of intermediate memory allocation wrecks havoc with garbage 
collection, not to mention unnecessarily increases the working memory 
requirement of the program.

I wonder if someone has already noticed this and there is an effort underway to 
optimize this. If not, I will take a shot at adding this functionality.

Mohit.

> On Jan 27, 2015, at 1:52 PM, francois.garil...@typesafe.com wrote:
> 
> Have you looked at the `aggregate` function in the RDD API ? 
> 
> If your way of extracting the “key” (identifier) and “value” (payload) parts 
> of the RDD elements is uniform (a function), it’s unclear to me how this 
> would be more efficient that extracting key and value and then using combine, 
> however.
> 
> —
> FG
> 
> 
> On Tue, Jan 27, 2015 at 10:17 PM, Mohit Jaggi  > wrote:
> 
> Hi All, 
> I have a use case where I have an RDD (not a k,v pair) where I want to do a 
> combineByKey() operation. I can do that by creating an intermediate RDD of 
> k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it 
> will be more efficient if I can avoid this intermediate RDD. Is there a way I 
> can do this by passing in a function that extracts the key, like in 
> RDD.groupBy()? [oops, RDD.groupBy seems to create the intermediate RDD 
> anyway, maybe a better implementation is possible for that too?] 
> If not, is it worth adding to the Spark API? 
> 
> Mohit. 
> - 
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Evan Chan
+1 having proper NA support is much cleaner than using null, at
least the Java null.

On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks  wrote:
> You've got to be a little bit careful here. "NA" in systems like R or pandas
> may have special meaning that is distinct from "null".
>
> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>
>
>
> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin  wrote:
>>
>> Isn't that just "null" in SQL?
>>
>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan 
>> wrote:
>>
>> > I believe that most DataFrame implementations out there, like Pandas,
>> > supports the idea of missing values / NA, and some support the idea of
>> > Not Meaningful as well.
>> >
>> > Does Row support anything like that?  That is important for certain
>> > applications.  I thought that Row worked by being a mutable object,
>> > but haven't looked into the details in a while.
>> >
>> > -Evan
>> >
>> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin 
>> > wrote:
>> > > It shouldn't change the data source api at all because data sources
>> > create
>> > > RDD[Row], and that gets converted into a DataFrame automatically
>> > (previously
>> > > to SchemaRDD).
>> > >
>> > >
>> >
>> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>> > >
>> > > One thing that will break the data source API in 1.3 is the location
>> > > of
>> > > types. Types were previously defined in sql.catalyst.types, and now
>> > moved to
>> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all
>> > > public
>> > APIs
>> > > have first class classes/objects defined in sql directly.
>> > >
>> > >
>> > >
>> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
>> > wrote:
>> > >>
>> > >> Hey guys,
>> > >>
>> > >> How does this impact the data sources API?  I was planning on using
>> > >> this for a project.
>> > >>
>> > >> +1 that many things from spark-sql / DataFrame is universally
>> > >> desirable and useful.
>> > >>
>> > >> By the way, one thing that prevents the columnar compression stuff in
>> > >> Spark SQL from being more useful is, at least from previous talks
>> > >> with
>> > >> Reynold and Michael et al., that the format was not designed for
>> > >> persistence.
>> > >>
>> > >> I have a new project that aims to change that.  It is a
>> > >> zero-serialisation, high performance binary vector library, designed
>> > >> from the outset to be a persistent storage friendly.  May be one day
>> > >> it can replace the Spark SQL columnar compression.
>> > >>
>> > >> Michael told me this would be a lot of work, and recreates parts of
>> > >> Parquet, but I think it's worth it.  LMK if you'd like more details.
>> > >>
>> > >> -Evan
>> > >>
>> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
>> > wrote:
>> > >> > Alright I have merged the patch (
>> > >> > https://github.com/apache/spark/pull/4173
>> > >> > ) since I don't see any strong opinions against it (as a matter of
>> > fact
>> > >> > most were for it). We can still change it if somebody lays out a
>> > strong
>> > >> > argument.
>> > >> >
>> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > >> > 
>> > >> > wrote:
>> > >> >
>> > >> >> The type alias means your methods can specify either type and they
>> > will
>> > >> >> work. It's just another name for the same type. But Scaladocs and
>> > such
>> > >> >> will
>> > >> >> show DataFrame as the type.
>> > >> >>
>> > >> >> Matei
>> > >> >>
>> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> > >> >> dirceu.semigh...@gmail.com> wrote:
>> > >> >> >
>> > >> >> > Reynold,
>> > >> >> > But with type alias we will have the same problem, right?
>> > >> >> > If the methods doesn't receive schemardd anymore, we will have
>> > >> >> > to
>> > >> >> > change
>> > >> >> > our code to migrade from schema to dataframe. Unless we have an
>> > >> >> > implicit
>> > >> >> > conversion between DataFrame and SchemaRDD
>> > >> >> >
>> > >> >> >
>> > >> >> >
>> > >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
>> > >> >> >
>> > >> >> >> Dirceu,
>> > >> >> >>
>> > >> >> >> That is not possible because one cannot overload return types.
>> > >> >> >>
>> > >> >> >> SQLContext.parquetFile (and many other methods) needs to return
>> > some
>> > >> >> type,
>> > >> >> >> and that type cannot be both SchemaRDD and DataFrame.
>> > >> >> >>
>> > >> >> >> In 1.3, we will create a type alias for DataFrame called
>> > >> >> >> SchemaRDD
>> > >> >> >> to
>> > >> >> not
>> > >> >> >> break source compatibility for Scala.
>> > >> >> >>
>> > >> >> >>
>> > >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> > >> >> >> dirceu.semigh...@gmail.com> wrote:
>> > >> >> >>
>> > >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
>> > removed
>> > >> >> >>> in
>> > >> >> the
>> > >> >> >>> release 1.5(+/- 1)  for example, and the new code been added
>> > >> >> >>> to
>> > >> >> DataFrame?
>> > >> >> >>> With this, we don't impact in existing code for th

TimeoutException on tests

2015-01-29 Thread Dirceu Semighini Filho
Hi All,
I'm trying to use a local build spark, adding the pr 1290 to the 1.2.0
build and after I do the build, I my tests start to fail.
 should create labeledpoint *** FAILED *** (10 seconds, 50 milliseconds)
[info]   java.util.concurrent.TimeoutException: Futures timed out after
[1 milliseconds]

It seems that this is related to a netty problem, I've already tried to
change the netty version but it didn't solved my problem (migrated from
3.4.0.Final, to 3.10.0.Final, does anyone here know how to fix it?

Kind Regards,
Dirceu


Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-29 Thread Octavian Geagla
Thanks for the responses.  How would something like HadamardProduct or
similar be in order to keep it explicit?  Would still be a VectorTransformer
so the name and trait would hopefully lead to a somewhat self-documenting
class.  

Xiangrui, do you mean Hadamard product or transform?  My initial proposal
was only a vector-vector product, but I can extend this to matrices. The
transform would require a bit more work, which I'm willing to do, but I'm
not sure where FFT comes in, can you elaborate?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-tp10265p10355.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



How to speed PySpark to match Scala/Java performance

2015-01-29 Thread rtshadow
Hi,

In my company, we've been trying to use PySpark to run ETLs on our data.
Alas, it turned out to be terribly slow compared to Java or Scala API (which
we ended up using to meet performance criteria). 

To be more quantitative, let's consider simple case:
I've generated test file (848MB): /seq 1 1 > /tmp/test/

and tried to run simple computation on it, which includes three steps: read
-> multiply each row by 2 -> take max
Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/

Here are the results of this simple benchmark:
CPython - 59s
PyPy - 26s
Scala version - 7s

I didn't dig into what exactly contributes to execution times of CPython /
PyPy, but it seems that serialization / deserialization, when sending data
to the worker may be the issue. 
I know some guys already have been asking about using Jython
(http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658,
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html),
but it seems, that no one have really done this with Spark.
It looks like performance gain from using jython can be huge - you wouldn't
need to spawn PythonWorkers, all the code would be just executed inside
SparkExecutor JVM, using python code compiled to java bytecode. Do you think
that's possible to achieve? Do you see any obvious obstacles? Of course,
jython doesn't have C extensions, but if one doesn't need them, then it
should fit here nicely.

I'm willing to try to marry Spark with Jython and see how it goes.

What do you think about this?





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Davies Liu
Hey,

Without having Python as fast as Scala/Java, I think it's impossible to similar
performance in PySpark as in Scala/Java. Jython is also much slower than
Scala/Java.

With Jython, we can avoid the cost of manage multiple process and RPC,
we may still need to do the data conversion between Java and Python.
Given that fact that Jython is not widely used in production, it may introduce
more troubles than the performance gain.

Spark jobs can be easily speed up by scaling out (by adding more resources).
I think the most advantage of PySpark is that it let you do fast prototype.
Once you got your ETL finalized, it's not that hard to translate your
pure Python
jobs into Scala to reduce the cost(it's optional).

Now days, engineer time is much more expensive than CPU time, I think we
should be more focus on the former.

That's my 2 cents.

Davies

On Thu, Jan 29, 2015 at 12:45 PM, rtshadow
 wrote:
> Hi,
>
> In my company, we've been trying to use PySpark to run ETLs on our data.
> Alas, it turned out to be terribly slow compared to Java or Scala API (which
> we ended up using to meet performance criteria).
>
> To be more quantitative, let's consider simple case:
> I've generated test file (848MB): /seq 1 1 > /tmp/test/
>
> and tried to run simple computation on it, which includes three steps: read
> -> multiply each row by 2 -> take max
> Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
> Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>
> Here are the results of this simple benchmark:
> CPython - 59s
> PyPy - 26s
> Scala version - 7s
>
> I didn't dig into what exactly contributes to execution times of CPython /
> PyPy, but it seems that serialization / deserialization, when sending data
> to the worker may be the issue.
> I know some guys already have been asking about using Jython
> (http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658,
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html),
> but it seems, that no one have really done this with Spark.
> It looks like performance gain from using jython can be huge - you wouldn't
> need to spawn PythonWorkers, all the code would be just executed inside
> SparkExecutor JVM, using python code compiled to java bytecode. Do you think
> that's possible to achieve? Do you see any obvious obstacles? Of course,
> jython doesn't have C extensions, but if one doesn't need them, then it
> should fit here nicely.
>
> I'm willing to try to marry Spark with Jython and see how it goes.
>
> What do you think about this?
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
Once the data frame API is released for 1.3, you can write your thing in
Python and get the same performance. It can't express everything, but for
basic things like projection, filter, join, aggregate and simple numeric
computation, it should work pretty well.


On Thu, Jan 29, 2015 at 12:45 PM, rtshadow 
wrote:

> Hi,
>
> In my company, we've been trying to use PySpark to run ETLs on our data.
> Alas, it turned out to be terribly slow compared to Java or Scala API
> (which
> we ended up using to meet performance criteria).
>
> To be more quantitative, let's consider simple case:
> I've generated test file (848MB): /seq 1 1 > /tmp/test/
>
> and tried to run simple computation on it, which includes three steps: read
> -> multiply each row by 2 -> take max
> Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
> Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>
> Here are the results of this simple benchmark:
> CPython - 59s
> PyPy - 26s
> Scala version - 7s
>
> I didn't dig into what exactly contributes to execution times of CPython /
> PyPy, but it seems that serialization / deserialization, when sending data
> to the worker may be the issue.
> I know some guys already have been asking about using Jython
> (
> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
> ,
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
> ),
> but it seems, that no one have really done this with Spark.
> It looks like performance gain from using jython can be huge - you wouldn't
> need to spawn PythonWorkers, all the code would be just executed inside
> SparkExecutor JVM, using python code compiled to java bytecode. Do you
> think
> that's possible to achieve? Do you see any obvious obstacles? Of course,
> jython doesn't have C extensions, but if one doesn't need them, then it
> should fit here nicely.
>
> I'm willing to try to marry Spark with Jython and see how it goes.
>
> What do you think about this?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Koert Kuipers
to me the word DataFrame does come with certain expectations. one of them
is that the data is stored columnar. in R data.frame internally uses a list
of sequences i think, but since lists can have labels its more like a
SortedMap[String, Array[_]]. this makes certain operations very cheap (such
as adding a column).

in Spark the closest thing would be a data structure where per Partition
the data is also stored columnar. does spark SQL already use something like
that? Evan mentioned "Spark SQL columnar compression", which sounds like
it. where can i find that?

thanks

On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan  wrote:

> +1 having proper NA support is much cleaner than using null, at
> least the Java null.
>
> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks 
> wrote:
> > You've got to be a little bit careful here. "NA" in systems like R or
> pandas
> > may have special meaning that is distinct from "null".
> >
> > See, e.g. http://www.r-bloggers.com/r-na-vs-null/
> >
> >
> >
> > On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin 
> wrote:
> >>
> >> Isn't that just "null" in SQL?
> >>
> >> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan 
> >> wrote:
> >>
> >> > I believe that most DataFrame implementations out there, like Pandas,
> >> > supports the idea of missing values / NA, and some support the idea of
> >> > Not Meaningful as well.
> >> >
> >> > Does Row support anything like that?  That is important for certain
> >> > applications.  I thought that Row worked by being a mutable object,
> >> > but haven't looked into the details in a while.
> >> >
> >> > -Evan
> >> >
> >> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin 
> >> > wrote:
> >> > > It shouldn't change the data source api at all because data sources
> >> > create
> >> > > RDD[Row], and that gets converted into a DataFrame automatically
> >> > (previously
> >> > > to SchemaRDD).
> >> > >
> >> > >
> >> >
> >> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >> > >
> >> > > One thing that will break the data source API in 1.3 is the location
> >> > > of
> >> > > types. Types were previously defined in sql.catalyst.types, and now
> >> > moved to
> >> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all
> >> > > public
> >> > APIs
> >> > > have first class classes/objects defined in sql directly.
> >> > >
> >> > >
> >> > >
> >> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan  >
> >> > wrote:
> >> > >>
> >> > >> Hey guys,
> >> > >>
> >> > >> How does this impact the data sources API?  I was planning on using
> >> > >> this for a project.
> >> > >>
> >> > >> +1 that many things from spark-sql / DataFrame is universally
> >> > >> desirable and useful.
> >> > >>
> >> > >> By the way, one thing that prevents the columnar compression stuff
> in
> >> > >> Spark SQL from being more useful is, at least from previous talks
> >> > >> with
> >> > >> Reynold and Michael et al., that the format was not designed for
> >> > >> persistence.
> >> > >>
> >> > >> I have a new project that aims to change that.  It is a
> >> > >> zero-serialisation, high performance binary vector library,
> designed
> >> > >> from the outset to be a persistent storage friendly.  May be one
> day
> >> > >> it can replace the Spark SQL columnar compression.
> >> > >>
> >> > >> Michael told me this would be a lot of work, and recreates parts of
> >> > >> Parquet, but I think it's worth it.  LMK if you'd like more
> details.
> >> > >>
> >> > >> -Evan
> >> > >>
> >> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
> >> > wrote:
> >> > >> > Alright I have merged the patch (
> >> > >> > https://github.com/apache/spark/pull/4173
> >> > >> > ) since I don't see any strong opinions against it (as a matter
> of
> >> > fact
> >> > >> > most were for it). We can still change it if somebody lays out a
> >> > strong
> >> > >> > argument.
> >> > >> >
> >> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> > >> > 
> >> > >> > wrote:
> >> > >> >
> >> > >> >> The type alias means your methods can specify either type and
> they
> >> > will
> >> > >> >> work. It's just another name for the same type. But Scaladocs
> and
> >> > such
> >> > >> >> will
> >> > >> >> show DataFrame as the type.
> >> > >> >>
> >> > >> >> Matei
> >> > >> >>
> >> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> > >> >> dirceu.semigh...@gmail.com> wrote:
> >> > >> >> >
> >> > >> >> > Reynold,
> >> > >> >> > But with type alias we will have the same problem, right?
> >> > >> >> > If the methods doesn't receive schemardd anymore, we will have
> >> > >> >> > to
> >> > >> >> > change
> >> > >> >> > our code to migrade from schema to dataframe. Unless we have
> an
> >> > >> >> > implicit
> >> > >> >> > conversion between DataFrame and SchemaRDD
> >> > >> >> >
> >> > >> >> >
> >> > >> >> >
> >> > >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
> >> > >> >> >
> >> > >> >> >> Dirceu,
> >> > >> >> >>
> >> > >> >> >> That is not pos

Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Cheng Lian
Forgot to mention that you can find it here 
.


On 1/29/15 1:59 PM, Cheng Lian wrote:

Yes, when a DataFrame is cached in memory, it's stored in an efficient 
columnar format. And you can also easily persist it on disk using 
Parquet, which is also columnar.


Cheng

On 1/29/15 1:24 PM, Koert Kuipers wrote:
to me the word DataFrame does come with certain expectations. one of 
them
is that the data is stored columnar. in R data.frame internally uses 
a list

of sequences i think, but since lists can have labels its more like a
SortedMap[String, Array[_]]. this makes certain operations very cheap 
(such

as adding a column).

in Spark the closest thing would be a data structure where per Partition
the data is also stored columnar. does spark SQL already use 
something like

that? Evan mentioned "Spark SQL columnar compression", which sounds like
it. where can i find that?

thanks

On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan  
wrote:



+1 having proper NA support is much cleaner than using null, at
least the Java null.

On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks 
wrote:

You've got to be a little bit careful here. "NA" in systems like R or

pandas

may have special meaning that is distinct from "null".

See, e.g. http://www.r-bloggers.com/r-na-vs-null/



On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin 

wrote:

Isn't that just "null" in SQL?

On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan 
wrote:

I believe that most DataFrame implementations out there, like 
Pandas,
supports the idea of missing values / NA, and some support the 
idea of

Not Meaningful as well.

Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin 
wrote:

It shouldn't change the data source api at all because data sources

create

RDD[Row], and that gets converted into a DataFrame automatically

(previously

to SchemaRDD).




https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 

One thing that will break the data source API in 1.3 is the 
location

of
types. Types were previously defined in sql.catalyst.types, and now

moved to

sql.types. After 1.3, sql.catalyst is hidden from users, and all
public

APIs

have first class classes/objects defined in sql directly.



On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
wrote:

Hey guys,

How does this impact the data sources API?  I was planning on 
using

this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff

in

Spark SQL from being more useful is, at least from previous talks
with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that. It is a
zero-serialisation, high performance binary vector library,

designed

from the outset to be a persistent storage friendly.  May be one

day

it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates 
parts of

Parquet, but I think it's worth it.  LMK if you'd like more

details.

-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 

wrote:

Alright I have merged the patch (
https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter

of

fact

most were for it). We can still change it if somebody lays out a

strong

argument.

On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia

wrote:


The type alias means your methods can specify either type and

they

will

work. It's just another name for the same type. But Scaladocs

and

such

will
show DataFrame as the type.

Matei


On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <

dirceu.semigh...@gmail.com> wrote:

Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have
to
change
our code to migrade from schema to dataframe. Unless we have

an

implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin :


Dirceu,

That is not possible because one cannot overload return

types.

SQLContext.parquetFile (and many other methods) needs to

return

some

type,

and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called
SchemaRDD
to

not

break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:


Can't the SchemaRDD remain the same, but deprecated, and be

removed

in

the

release 1.5(+/- 1)  for example, and the new code been added
to

DataFrame?

With this, we don't impact in existing code 

Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Cheng Lian
Yes, when a DataFrame is cached in memory, it's stored in an efficient 
columnar format. And you can also easily persist it on disk using 
Parquet, which is also columnar.


Cheng

On 1/29/15 1:24 PM, Koert Kuipers wrote:

to me the word DataFrame does come with certain expectations. one of them
is that the data is stored columnar. in R data.frame internally uses a list
of sequences i think, but since lists can have labels its more like a
SortedMap[String, Array[_]]. this makes certain operations very cheap (such
as adding a column).

in Spark the closest thing would be a data structure where per Partition
the data is also stored columnar. does spark SQL already use something like
that? Evan mentioned "Spark SQL columnar compression", which sounds like
it. where can i find that?

thanks

On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan  wrote:


+1 having proper NA support is much cleaner than using null, at
least the Java null.

On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks 
wrote:

You've got to be a little bit careful here. "NA" in systems like R or

pandas

may have special meaning that is distinct from "null".

See, e.g. http://www.r-bloggers.com/r-na-vs-null/



On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin 

wrote:

Isn't that just "null" in SQL?

On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan 
wrote:


I believe that most DataFrame implementations out there, like Pandas,
supports the idea of missing values / NA, and some support the idea of
Not Meaningful as well.

Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin 
wrote:

It shouldn't change the data source api at all because data sources

create

RDD[Row], and that gets converted into a DataFrame automatically

(previously

to SchemaRDD).





https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

One thing that will break the data source API in 1.3 is the location
of
types. Types were previously defined in sql.catalyst.types, and now

moved to

sql.types. After 1.3, sql.catalyst is hidden from users, and all
public

APIs

have first class classes/objects defined in sql directly.



On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
wrote:

Hey guys,

How does this impact the data sources API?  I was planning on using
this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff

in

Spark SQL from being more useful is, at least from previous talks
with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that.  It is a
zero-serialisation, high performance binary vector library,

designed

from the outset to be a persistent storage friendly.  May be one

day

it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates parts of
Parquet, but I think it's worth it.  LMK if you'd like more

details.

-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 

wrote:

Alright I have merged the patch (
https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter

of

fact

most were for it). We can still change it if somebody lays out a

strong

argument.

On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia

wrote:


The type alias means your methods can specify either type and

they

will

work. It's just another name for the same type. But Scaladocs

and

such

will
show DataFrame as the type.

Matei


On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <

dirceu.semigh...@gmail.com> wrote:

Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have
to
change
our code to migrade from schema to dataframe. Unless we have

an

implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin :


Dirceu,

That is not possible because one cannot overload return

types.

SQLContext.parquetFile (and many other methods) needs to

return

some

type,

and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called
SchemaRDD
to

not

break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:


Can't the SchemaRDD remain the same, but deprecated, and be

removed

in

the

release 1.5(+/- 1)  for example, and the new code been added
to

DataFrame?

With this, we don't impact in existing code for the next few
releases.



2015-01-27 0:02 GMT-02:00 Kushal Datta
:


I want to address the issue that Matei raised about the

heavy

lifting
required for a full SQL support. It is amazing that even
after

30

years

of

research there is not a single good open sou

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Sasha Kacanski
Hi Reynold,
In my project I want to use Python API too.
When you mention DF's are we talking about pandas or this is something
internal to spark py api.
If you could elaborate a bit on this or point me to alternate documentation.
Thanks much --sasha

On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin  wrote:

> Once the data frame API is released for 1.3, you can write your thing in
> Python and get the same performance. It can't express everything, but for
> basic things like projection, filter, join, aggregate and simple numeric
> computation, it should work pretty well.
>
>
> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow  >
> wrote:
>
> > Hi,
> >
> > In my company, we've been trying to use PySpark to run ETLs on our data.
> > Alas, it turned out to be terribly slow compared to Java or Scala API
> > (which
> > we ended up using to meet performance criteria).
> >
> > To be more quantitative, let's consider simple case:
> > I've generated test file (848MB): /seq 1 1 > /tmp/test/
> >
> > and tried to run simple computation on it, which includes three steps:
> read
> > -> multiply each row by 2 -> take max
> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
> >
> > Here are the results of this simple benchmark:
> > CPython - 59s
> > PyPy - 26s
> > Scala version - 7s
> >
> > I didn't dig into what exactly contributes to execution times of CPython
> /
> > PyPy, but it seems that serialization / deserialization, when sending
> data
> > to the worker may be the issue.
> > I know some guys already have been asking about using Jython
> > (
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
> > ,
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
> > ),
> > but it seems, that no one have really done this with Spark.
> > It looks like performance gain from using jython can be huge - you
> wouldn't
> > need to spawn PythonWorkers, all the code would be just executed inside
> > SparkExecutor JVM, using python code compiled to java bytecode. Do you
> > think
> > that's possible to achieve? Do you see any obvious obstacles? Of course,
> > jython doesn't have C extensions, but if one doesn't need them, then it
> > should fit here nicely.
> >
> > I'm willing to try to marry Spark with Jython and see how it goes.
> >
> > What do you think about this?
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>



-- 
Aleksandar Kacanski


Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
It is something like this: https://issues.apache.org/jira/browse/SPARK-5097

On the master branch, we have a Pandas like API already.


On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski  wrote:

> Hi Reynold,
> In my project I want to use Python API too.
> When you mention DF's are we talking about pandas or this is something
> internal to spark py api.
> If you could elaborate a bit on this or point me to alternate
> documentation.
> Thanks much --sasha
>
> On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin  wrote:
>
>> Once the data frame API is released for 1.3, you can write your thing in
>> Python and get the same performance. It can't express everything, but for
>> basic things like projection, filter, join, aggregate and simple numeric
>> computation, it should work pretty well.
>>
>>
>> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <
>> pastuszka.przemys...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > In my company, we've been trying to use PySpark to run ETLs on our data.
>> > Alas, it turned out to be terribly slow compared to Java or Scala API
>> > (which
>> > we ended up using to meet performance criteria).
>> >
>> > To be more quantitative, let's consider simple case:
>> > I've generated test file (848MB): /seq 1 1 > /tmp/test/
>> >
>> > and tried to run simple computation on it, which includes three steps:
>> read
>> > -> multiply each row by 2 -> take max
>> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
>> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>> >
>> > Here are the results of this simple benchmark:
>> > CPython - 59s
>> > PyPy - 26s
>> > Scala version - 7s
>> >
>> > I didn't dig into what exactly contributes to execution times of
>> CPython /
>> > PyPy, but it seems that serialization / deserialization, when sending
>> data
>> > to the worker may be the issue.
>> > I know some guys already have been asking about using Jython
>> > (
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
>> > ,
>> >
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
>> > ),
>> > but it seems, that no one have really done this with Spark.
>> > It looks like performance gain from using jython can be huge - you
>> wouldn't
>> > need to spawn PythonWorkers, all the code would be just executed inside
>> > SparkExecutor JVM, using python code compiled to java bytecode. Do you
>> > think
>> > that's possible to achieve? Do you see any obvious obstacles? Of course,
>> > jython doesn't have C extensions, but if one doesn't need them, then it
>> > should fit here nicely.
>> >
>> > I'm willing to try to marry Spark with Jython and see how it goes.
>> >
>> > What do you think about this?
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>>
>
>
>
> --
> Aleksandar Kacanski
>


Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Sasha Kacanski
thanks for quick reply, I will check the link.
Hopefully, with conversion to py3, or 3.4 we could take advantage of
asyncio and other cool new stuff ...

On Thu, Jan 29, 2015 at 7:41 PM, Reynold Xin  wrote:

> It is something like this:
> https://issues.apache.org/jira/browse/SPARK-5097
>
> On the master branch, we have a Pandas like API already.
>
>
> On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski 
> wrote:
>
>> Hi Reynold,
>> In my project I want to use Python API too.
>> When you mention DF's are we talking about pandas or this is something
>> internal to spark py api.
>> If you could elaborate a bit on this or point me to alternate
>> documentation.
>> Thanks much --sasha
>>
>> On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin  wrote:
>>
>>> Once the data frame API is released for 1.3, you can write your thing in
>>> Python and get the same performance. It can't express everything, but for
>>> basic things like projection, filter, join, aggregate and simple numeric
>>> computation, it should work pretty well.
>>>
>>>
>>> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <
>>> pastuszka.przemys...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > In my company, we've been trying to use PySpark to run ETLs on our
>>> data.
>>> > Alas, it turned out to be terribly slow compared to Java or Scala API
>>> > (which
>>> > we ended up using to meet performance criteria).
>>> >
>>> > To be more quantitative, let's consider simple case:
>>> > I've generated test file (848MB): /seq 1 1 > /tmp/test/
>>> >
>>> > and tried to run simple computation on it, which includes three steps:
>>> read
>>> > -> multiply each row by 2 -> take max
>>> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
>>> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>>> >
>>> > Here are the results of this simple benchmark:
>>> > CPython - 59s
>>> > PyPy - 26s
>>> > Scala version - 7s
>>> >
>>> > I didn't dig into what exactly contributes to execution times of
>>> CPython /
>>> > PyPy, but it seems that serialization / deserialization, when sending
>>> data
>>> > to the worker may be the issue.
>>> > I know some guys already have been asking about using Jython
>>> > (
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
>>> > ,
>>> >
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
>>> > ),
>>> > but it seems, that no one have really done this with Spark.
>>> > It looks like performance gain from using jython can be huge - you
>>> wouldn't
>>> > need to spawn PythonWorkers, all the code would be just executed inside
>>> > SparkExecutor JVM, using python code compiled to java bytecode. Do you
>>> > think
>>> > that's possible to achieve? Do you see any obvious obstacles? Of
>>> course,
>>> > jython doesn't have C extensions, but if one doesn't need them, then it
>>> > should fit here nicely.
>>> >
>>> > I'm willing to try to marry Spark with Jython and see how it goes.
>>> >
>>> > What do you think about this?
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> > Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Aleksandar Kacanski
>>
>
>


-- 
Aleksandar Kacanski