[Spark Core] [Feature] unionByName parameters

2022-02-05 Thread Daniel Davies
Hello dev@, I had a quick question about the unionByName function. This function currently seems to accept a parameter- "allowMissingColumns"- that allows some tolerance to merging datasets with different schemas [e.g. here

[Spark Core]: Support for un-pivoting data ('melt')

2022-01-02 Thread Daniel Davies
just caught out by this, and thought it would be useful to raise. I did see a thread in the Pony archive about this issue, but it looks like it didn't go anywhere. Does anyone else have context on this <https://lists.apache.org/list?dev@spark.apache.org:lte=60M:unpivot>? Kind Regards, -- *Daniel Davies*

Support for arrays parquet vectorized reader

2019-04-16 Thread Mick Davies
ave been considered or whether this work is something that could be useful to the wider community. Regards Mick Davies -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-

Re: Will higher order functions in spark SQL be pushed upstream?

2018-04-19 Thread Michael Davies
gt; > On Thu, Apr 19, 2018 at 11:20 AM, Mick Davies <mailto:michael.belldav...@gmail.com>> wrote: > Hi, > Regarding higher order functions > > > Yes, we intend to contribute this to open source. > > It doesn't look like this is in 2.3.0, at least I can'

Re: Will higher order functions in spark SQL be pushed upstream?

2018-04-19 Thread Mick Davies
Hi, Regarding higher order functions > Yes, we intend to contribute this to open source. It doesn't look like this is in 2.3.0, at least I can't find it. Do you know when it might reach open source. Thanks Mick -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-24 Thread Davies Liu
>>>>> — >> >>>>> Joris Van Remoortere >> >>>>> Mesosphere >> >>>>> >> >>>>> On Mon, Dec 19, 2016 at 11:26 AM, Timothy Chen >> >>>>> wrote: >> >>>>>> >> >>>>>> Hi Chawla, >> >>>>>> >> >>>>>> One possible reason is that Mesos fine grain mode also takes up >> cores >> >>>>>> to run the executor per host, so if you have 20 agents running Fine >> >>>>>> grained executor it will take up 20 cores while it's still running. >> >>>>>> >> >>>>>> Tim >> >>>>>> >> >>>>>> On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit < >> sumitkcha...@gmail.com> >> >>>>>> wrote: >> >>>>>> > Hi >> >>>>>> > >> >>>>>> > I am using Spark 1.6. I have one query about Fine Grained model in >> >>>>>> > Spark. >> >>>>>> > I have a simple Spark application which transforms A -> B. Its a >> >>>>>> > single >> >>>>>> > stage application. To begin the program, It starts with 48 >> >>>>>> > partitions. >> >>>>>> > When the program starts running, in mesos UI it shows 48 tasks and >> >>>>>> > 48 CPUs >> >>>>>> > allocated to job. Now as the tasks get done, the number of active >> >>>>>> > tasks >> >>>>>> > number starts decreasing. How ever, the number of CPUs does not >> >>>>>> > decrease >> >>>>>> > propotionally. When the job was about to finish, there was a >> single >> >>>>>> > remaininig task, however CPU count was still 20. >> >>>>>> > >> >>>>>> > My questions, is why there is no one to one mapping between tasks >> >>>>>> > and cpus >> >>>>>> > in Fine grained? How can these CPUs be released when the job is >> >>>>>> > done, so >> >>>>>> > that other jobs can start. >> >>>>>> > >> >>>>>> > >> >>>>>> > Regards >> >>>>>> > Sumit Chawla >> >>>>> >> >>>>> >> >>>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Michael Gummelt >> >>> Software Engineer >> >>> Mesosphere >> >> >> >> >> > >> > >> > >> > -- >> > Michael Gummelt >> > Software Engineer >> > Mesosphere >> -- - Davies - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Davies Liu
+1 On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release this package as Apache S

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Davies Liu
+1 for Matei's point. On Thu, Oct 27, 2016 at 8:36 AM, Matei Zaharia wrote: > Just to comment on this, I'm generally against removing these types of > things unless they create a substantial burden on project contributors. It > doesn't sound like Python 2.6 and Java 7 do that yet -- Scala 2.10 mi

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Davies Liu
+1 On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin wrote: > Greetings from Spark Summit Europe at Brussels. > > Please vote on releasing the following candidate as Apache Spark version > 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if a > majority of at least 3+1 PMC vote

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Davies Liu
+1 (non-binding) On Mon, Sep 26, 2016 at 9:36 AM, Joseph Bradley wrote: > +1 > > On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee wrote: >> >> +1 (non-binding) >> On Sun, Sep 25, 2016 at 23:20 Jeff Zhang wrote: >>> >>> +1 >>> >>> On Mon, Sep 26, 2016 at 2:03 PM, Shixiong(Ryan) Zhu >>> wrote: >>

Re: Making BatchPythonEvaluation actually Batch

2016-03-31 Thread Davies Liu
@Justin, it's fixed by https://github.com/apache/spark/pull/12057 On Thu, Feb 11, 2016 at 11:26 AM, Davies Liu wrote: > Had a quick look in your commit, I think that make sense, could you > send a PR for that, then we can review it. > > In order to support 2), we need to chan

Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-07 Thread Davies Liu
The underlying buffer for UnsafeRow is reused in UnsafeProjection. On Thu, Mar 3, 2016 at 9:11 PM, Rishi Mishra wrote: > Hi Davies, > When you say "UnsafeRow could come from UnsafeProjection, so We should copy > the rows for safety." do you intend to say that the underlying s

Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-02 Thread Davies Liu
vs. memory. > > -Matt Cheah > > On 3/2/16, 10:15 AM, "Davies Liu" wrote: > >>UnsafeHashedRelation and HashedRelation could also be used in Executor >>(for non-broadcast hash join), then the UnsafeRow could come from >>UnsafeProjection, >>so We should

Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-02 Thread Davies Liu
UnsafeHashedRelation and HashedRelation could also be used in Executor (for non-broadcast hash join), then the UnsafeRow could come from UnsafeProjection, so We should copy the rows for safety. We could have a smarter copy() for UnsafeRow (avoid the copy if it's already copied), but I don't think

Re: Making BatchPythonEvaluation actually Batch

2016-02-11 Thread Davies Liu
Had a quick look in your commit, I think that make sense, could you send a PR for that, then we can review it. In order to support 2), we need to change the serialized Python function from `f(iter)` to `f(x)`, process one row at a time (not a partition), then we can easily combine them together:

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661 On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers wrote: > i do not think so. > > does the python 2.7 need to be installed on all slaves? if so, we do not > have direct access to those. > > also, spark is easy for us to ship with our sof

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
+1 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas wrote: > +1 > > Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python > 2.6 is ancient history and the core Python developers stopped supporting it > in 2013. REHL 5 is not a good enough reason to continue support for Pytho

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-03 Thread Davies Liu
Does this https://github.com/apache/spark/pull/10134 is valid fix? (still worse than 1.5) On Thu, Dec 3, 2015 at 8:45 AM, mkhaitman wrote: > I reported this in the 1.6 preview thread, but wouldn't mind if someone can > confirm that ctrl-c is not keyboard interrupting / clearing the current line >

Re: pyspark with pypy not work for spark-1.5.1

2015-11-13 Thread Davies Liu
We already test CPython 2.6, CPython 3.4 and PyPy 2.5, it took more than 30 min to run (without parallelization), I think it should be enough. PyPy 2.2 is too old that we have not enough resource to support that. On Fri, Nov 6, 2015 at 2:27 AM, Chang Ya-Hsuan wrote: > Hi I run ./python/ru-tests

Re: ShuffledHashJoin Possible Issue

2015-10-19 Thread Davies Liu
Can you reproduce it on master? I can't reproduce it with the following code: >>> t2 = sqlContext.range(50).selectExpr("concat('A', id) as id") >>> t1 = sqlContext.range(10).selectExpr("concat('A', id) as id") >>> t1.join(t2).where(t1.id == t2.id).explain() ShuffledHashJoin [id#21], [id#19], Buil

Re: StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Davies Liu
Could you tell us a way to reproduce this failure? Reading from JSON or Parquet? On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov wrote: > Hi, > > We're building our own framework on top of spark and we give users pretty > complex schema to work with. That requires from us to build dataframes by >

Re: pyspark streaming DStream compute

2015-09-15 Thread Davies Liu
On Tue, Sep 15, 2015 at 1:46 PM, Renyi Xiong wrote: > Can anybody help understand why pyspark streaming uses py4j callback to > execute python code while pyspark batch uses worker.py? There are two kind of callback in pyspark streaming: 1) one operate on RDDs, it take an RDD and return an new RDD

Re: Pyspark DataFrame TypeError

2015-09-08 Thread Davies Liu
I tried with Python 2.7/3.4 and Spark 1.4.1/1.5-RC3, they all work as expected: ``` >>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([(1.0, Vectors.dense([1.0])), (0.0, >>> Vectors.sparse(1, [], []))], ["label", "featuers"]) >>> df.show() +-+-+ |label|

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-04 Thread Davies Liu
Could you update the notebook to use builtin SQL function month and year, instead of Python UDF? (they are introduced in 1.5). Once remove those two udfs, it runs successfully, also much faster. On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar wrote: > Yin, >It is the > https://github.com/xsan

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Davies Liu
+1, built 1.5 from source and ran TPC-DS locally and clusters, ran performance benchmark for aggregation and join with difference scales, all worked well. On Thu, Sep 3, 2015 at 10:05 AM, Michael Armbrust wrote: > +1 Ran TPC-DS and ported several jobs over to 1.5 > > On Thu, Sep 3, 2015 at 9:57 A

Re: PySpark on PyPi

2015-08-10 Thread Davies Liu
> Brian > > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu wrote: >> We could do that after 1.5 released, it will have same release cycle >> as Spark in the future. >> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot >> wrote: >>> +1 (once again :) )

Re: PySpark on PyPi

2015-08-06 Thread Davies Liu
ransitive dependencies (Pandas, Py4J) in a >>> way that pip can use. Contrast this with the current situation, where >>> df.toPandas() exists in the Spark API but doesn't actually work until you >>> install Pandas. >>> >>> Punya >>> On Wed, Jul 2

Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
ell, then? If we implement external-group-by should we implement it with > the map-side-combine semantics that Pyspark does? > -Matt Cheah > > On 7/15/15, 8:21 AM, "Davies Liu" wrote: > >>If the map-side-combine is not that necessary, given the fact that it >>ca

Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
If the map-side-combine is not that necessary, given the fact that it cannot reduce the size of data for shuffling much (do need to serialized the key for each value), but can reduce the number of key-value pairs, and potential reduce the number of operations later (repartition and groupby). On Tu

Re: pyspark.sql.tests: is test_time_with_timezone a flaky test?

2015-07-12 Thread Davies Liu
Will be fixed by https://github.com/apache/spark/pull/7363 On Sun, Jul 12, 2015 at 7:45 PM, Davies Liu wrote: > Thanks for reporting this, I'm working on it. It turned out that it's > a bug in when run with Python3.4, will sending out a fix soon. > > On Sun, Jul 12, 201

Re: pyspark.sql.tests: is test_time_with_timezone a flaky test?

2015-07-12 Thread Davies Liu
as a hot fix for this test case, and I already have > it in the commit log- > > commit 05ac023dc8d9004a27c2f06ee875b0ff3743ccdd > > Author: Davies Liu > Date: Fri Jul 10 13:05:23 2015 -0700 > [HOTFIX] fix flaky test in PySpark SQL > > I looked at the test code, and

Re: [PySpark DataFrame] When a Row is not a Row

2015-07-12 Thread Davies Liu
We finally fix this in 1.5 (next release), see https://github.com/apache/spark/pull/7301 On Sat, Jul 11, 2015 at 10:32 PM, Jerry Lam wrote: > Hi guys, > > I just hit the same problem. It is very confusing when Row is not the same > Row type at runtime. The worst thing is that when I use Spark in

Re: Python UDF performance at large scale

2015-06-25 Thread Davies Liu
00 when I did the tests, > because I was worried about deadlocks. Do you have any concerns regarding > the batched synchronous version of communication between the Java and Python > processes, and if not, should I file a ticket and starting writing it? > > On Wed, Jun 24, 2015

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
rk, removing it basically made it about 2x faster. > > On Wed, Jun 24, 2015 at 8:33 AM Punyashloka Biswal > wrote: >> >> Hi Davies, >> >> In general, do we expect people to use CPython only for "heavyweight" UDFs >> that invoke an external library? A

Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
the correctness and performance characteristics of the synchronous > blocking solution. > > > On Tue, Jun 23, 2015 at 7:21 PM Davies Liu wrote: >> >> Thanks for looking into it, I'd like the idea of having >> ForkingIterator. If we have unlimited buffer in it,

Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
Thanks for looking into it, I'd like the idea of having ForkingIterator. If we have unlimited buffer in it, then will not have the problem of deadlock, I think. The writing thread will be blocked by Python process, so there will be not much rows be buffered(still be a reason to OOM). At least, this

Unit tests can generate spurious shutdown messages

2015-06-02 Thread Mick Davies
If I write unit tests that indirectly initialize org.apache.spark.util.Utils, for example use sql types, but produce no logging, I get the following unpleasant stack trace in my test output. This caused by the the Utils class adding a shutdown hook which logs the message logDebug("Shutdown hook ca

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
I think relative imports can not help in this case. When you run scripts in pyspark/sql, it doesn't know anything about pyspark.sql, it just see types.py as a separate module. On Tue, May 26, 2015 at 12:44 PM, Punyashloka Biswal wrote: > Davies: Can we use relative imports (import .t

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
: > Thanks for clarifying! I don't understand python package and modules names > that well, but I thought that the package namespacing would've helped, since > you are in pyspark.sql.types. I guess not? > > On Tue, May 26, 2015 at 3:03 PM Davies Liu wrote: >> >>

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
There is a module called 'types' in python 3: davies@localhost:~/work/spark$ python3 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license"

Re: Tungsten's Vectorized Execution

2015-05-21 Thread Davies Liu
We have not start to prototype the vectorized one yet, will evaluated in 1.5 and may targeted for 1.6. We're glad to hear some feedback/suggestions/comments from your side! On Thu, May 21, 2015 at 9:37 AM, Yijie Shen wrote: > Hi all, > > I’ve seen the Blog of Project Tungsten here, it sounds awe

Re: [SparkR] is toDF() necessary

2015-05-17 Thread Davies Liu
toDF() is first introduced in Scala and Python (because createDataFrame is too long), is used in lots places, I think it's useful. On Fri, May 8, 2015 at 11:03 AM, Shivaram Venkataraman wrote: > Agree that toDF is not very useful. In fact it was removed from the > namespace in a recent change > h

回复: [PySpark DataFrame] When a Row is not a Row

2015-05-12 Thread Davies Liu
(called `Row`). -- Davies Liu Sent with Sparrow (http://www.sparrowmailapp.com/?sig) 已使用 Sparrow (http://www.sparrowmailapp.com/?sig) 在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道: > This is really strange. > > > > > # Spark 1.3.1 > >

Re: Query regarding infering data types in pyspark

2015-04-15 Thread Davies Liu
lter that compares the dates. > The query I am using is : > df.filter(df.Datecol > datetime.date(2015,1,1)).show() > > I do not want to use date as a string to compare them. Please suggest. > > > On Tue, Apr 14, 2015 at 4:59 AM, Davies Liu wrote: >> >&g

Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python & more)

2015-04-14 Thread Davies Liu
Hey Shane, Have you updated all the jenkins slaves? There is a run with old configurations (no Python 3, with 130 minutes timeout), see https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/666/consoleFull Davies On Thu, Apr 9, 2015 at 10:18 AM, shane knapp wrote: > ok, we

Re: Query regarding infering data types in pyspark

2015-04-13 Thread Davies Liu
Hey Suraj, You should use "date" for DataType: df.withColumn(df.DateCol.cast("date")) Davies On Sat, Apr 11, 2015 at 10:57 PM, Suraj Shetiya wrote: > Humble reminder > > On Sat, Apr 11, 2015 at 12:16 PM, Suraj Shetiya > wrote: >> >> Hi, >>

Re: Query regarding infering data types in pyspark

2015-04-10 Thread Davies Liu
What's the format you have in json file? On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya wrote: > Hi, > > In pyspark when if I read a json file using sqlcontext I find that the date > field is not infered as date instead it is converted to string. And when I > try to convert it to date using df.wi

Re: Haskell language Spark support

2015-04-03 Thread Davies Liu
The PR for integrate SparkR into Spark may help: https://github.com/apache/spark/pull/5096 -- Davies Liu Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, March 25, 2015 at 7:35 PM, danilo2 wrote: > Hi! > I'm a haskell developer and I have created many haskel

Re: Iterative pyspark / scala codebase development

2015-03-27 Thread Davies Liu
iPython shell is statefull, it will have unexpected behavior when you reload the library. > 2015-03-27 10:21 GMT-07:00 Davies Liu : > >> put these lines in your ~/.bash_profile >> >> export SPARK_PREPEND_CLASSES=true >> export SPARK_HOME=path_to_spark >> expor

Re: Iterative pyspark / scala codebase development

2015-03-27 Thread Davies Liu
stop this Then in another terminal you could run python tests as $ cd python/pyspark/ $ python rdd.py cc to dev list On Fri, Mar 27, 2015 at 10:15 AM, Stephen Boesch wrote: > Which aspect of that page are you suggesting provides a more optimized > alternative? > > 2015-03-27 10:1

Re: Iterative pyspark / scala codebase development

2015-03-27 Thread Davies Liu
I usually just open a terminal to do `build/sbt ~compile`, coding in IntelliJ, then run python tests in another terminal once it compiled successfully. On Fri, Mar 27, 2015 at 10:11 AM, Reynold Xin wrote: > Python is tough if you need to change Scala at the same time. > > sbt/sbt assembly/assembl

Re: Iterative pyspark / scala codebase development

2015-03-27 Thread Davies Liu
see https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Fri, Mar 27, 2015 at 10:02 AM, Stephen Boesch wrote: > I am iteratively making changes to the scala side of some new pyspark code > and re-testing from the python/pyspark side. > > Presently my only solution is to reb

Re: functools.partial as UserDefinedFunction

2015-03-25 Thread Davies Liu
It’s good to support functools.partial, could you file a JIRA for it? On Wednesday, March 25, 2015 at 5:42 AM, Karlson wrote: > > Hi all, > > passing a functools.partial-function as a UserDefinedFunction to > DataFrame.select raises an AttributeException, because functools.partial > does

Re: Caching tables at column level

2015-02-13 Thread Mick Davies
Thanks - we have tried this and it works nicely. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377p10618.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -

Re: Optimize encoding/decoding strings when using Parquet

2015-02-13 Thread Mick Davies
I have put in a PR on Parquet to support dictionaries when filters are pushed down, which should reduce binary conversion overhear when Spark pushes down string predicates on columns that are dictionary encoded. https://github.com/apache/incubator-parquet-mr/pull/117 It's blocked at the moment as

Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Davies Liu
Yes. On Wed, Feb 11, 2015 at 5:44 PM, Todd Gao wrote: > Thanks Davies. > I am not quite familiar with Spark Streaming. Do you mean that the compute > routine of DStream is only invoked in the driver node, > while only the compute routines of RDD are distributed to the slaves? > &

Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Davies Liu
The CallbackServer is part of Py4j, it's only used in driver, not used in slaves or workers. On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao wrote: > Hi all, > > I am reading the code of PySpark and its Streaming module. > > In PySpark Streaming, when the `compute` method of the instance of > PythonTr

Caching tables at column level

2015-02-01 Thread Mick Davies
I have been working a lot recently with denormalised tables with lots of columns, nearly 600. We are using this form to avoid joins. I have tried to use cache table with this data, but it proves too expensive as it seems to try to cache all the data in the table. For data sets such as the one I

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Davies Liu
u got your ETL finalized, it's not that hard to translate your pure Python jobs into Scala to reduce the cost(it's optional). Now days, engineer time is much more expensive than CPU time, I think we should be more focus on the former. That's my 2 cents. Davies On Thu, Jan

Re: Optimize encoding/decoding strings when using Parquet

2015-01-23 Thread Michael Davies
, 2015 at 10:10 AM, Mick Davies <mailto:michael.belldav...@gmail.com>> wrote: > > Looking at Parquet code - it looks like hooks are already in place to > support this. > > In particular PrimitiveConverter has methods hasDictionarySupport and > addValueFro

Are there any plans to run Spark on top of Succinct

2015-01-22 Thread Mick Davies
http://succinct.cs.berkeley.edu/wp/wordpress/ Looks like a really interesting piece of work that could dovetail well with Spark. I have been trying recently to optimize some queries I have running on Spark on top of Parquet but the support from Parquet for predicate push down especially for dict

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary for this purpose. These are not used by CatalystPrimitiveConverter. I think that it would be pretty straightforward to

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Here are some timings showing effect of caching last Binary->String conversion. Query times are reduced significantly and variation in timings due to reduction in garbage is very significant. Set of sample queries selecting various columns, applying some filtering and then aggregating Spark 1.2.0

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Added a JIRA to track https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html Sent from the Apache Spark Developers List mailing list arch

Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Mick Davies
Hi, It seems that a reasonably large proportion of query time using Spark SQL seems to be spent decoding Parquet Binary objects to produce Java Strings. Has anyone considered trying to optimize these conversions as many are duplicated. Details are outlined in the conversation in the user mailing

Re: Use of MapConverter, ListConverter in python to java object conversion

2015-01-13 Thread Davies Liu
It's not necessary, I will create a PR to remove them. For larger dict/list/tuple, the pickle approach may have less RPC calls, better performance. Davies On Tue, Jan 13, 2015 at 4:53 AM, Meethu Mathew wrote: > Hi all, > > In the python object to java conversion done in the meth

Re: Python to Java object conversion of numpy array

2015-01-13 Thread Davies Liu
it did not break any tests. Could you do it in your PR or I create a PR for it separately? > Hope its clear now. > > Regards, > Meethu > > On Monday 12 January 2015 11:35 PM, Davies Liu wrote: > > On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew > wrote: > > Hi, &

Re: Python to Java object conversion of numpy array

2015-01-12 Thread Davies Liu
hat's the Java API looks like? all the arguments of findPredict should be converted into java objects, so what should `mu` be converted to? > Regards, > Meethu > On Monday 12 January 2015 11:46 AM, Davies Liu wrote: > > Could you post a piece of code here? > > On Sun, Jan 1

Re: Python to Java object conversion of numpy array

2015-01-11 Thread Davies Liu
Could you post a piece of code here? On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew wrote: > Hi, > Thanks Davies . > > I added a new class GaussianMixtureModel in clustering.py and the method > predict in it and trying to pass numpy array from this method.I converted it > to D

Re: Python to Java object conversion of numpy array

2015-01-09 Thread Davies Liu
Hey Meethu, The Java API accepts only Vector, so you should convert the numpy array into pyspark.mllib.linalg.DenseVector. BTW, which class are you using? the KMeansModel.predict() accept numpy.array, it will do the conversion for you. Davies On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew

Re: Help, pyspark.sql.List flatMap results become tuple

2014-12-30 Thread Davies Liu
This should be fixed in 1.2, could you try it? On Mon, Dec 29, 2014 at 8:04 PM, guoxu1231 wrote: > Hi pyspark guys, > > I have a json file, and its struct like below: > > {"NAME":"George", "AGE":35, "ADD_ID":1212, "POSTAL_AREA":1, > "TIME_ZONE_ID":1, "INTEREST":[{"INTEREST_NO":1, "INFO":"x"}, > {

Re: Adding third party jars to classpath used by pyspark

2014-12-30 Thread Davies Liu
On Mon, Dec 29, 2014 at 7:39 PM, Jeremy Freeman wrote: > Hi Stephen, it should be enough to include > >> --jars /path/to/file.jar > > in the command line call to either pyspark or spark-submit, as in > >> spark-submit --master local --jars /path/to/file.jar myfile.py Unfortunately, you also need

Re: [VOTE] Designating maintainers for some Spark components

2014-11-07 Thread Davies Liu
x27;d revert my vote to +1. Sorry for this. Davies On Fri, Nov 7, 2014 at 3:18 PM, Davies Liu wrote: > -1 (not binding, +1 for maintainer, -1 for sign off) > > Agree with Greg and Vinod. In the beginning, everything is better > (more efficient, more focus), but after some time, fighting

Re: [VOTE] Designating maintainers for some Spark components

2014-11-07 Thread Davies Liu
erent components will have different code style (or others). Right now, maintainers are kind of first contact or best contacts, the best person to review the PR in that component. We could announce it, then new contributors can easily find the right one to review. My 2 cents. Davies On Thu, Nov 6, 2014

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-18 Thread Davies Liu
Cool, the recent 4 build had used the new configs, thanks! Let's run more builds. Davies On Fri, Oct 17, 2014 at 11:06 PM, Josh Rosen wrote: > I think that the fix was applied. Take a look at > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-17 Thread Davies Liu
How can we know the changes has been applied? I had checked several recent builds, they all use the original configs. Davies On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen wrote: > FYI, I edited the Spark Pull Request Builder job to try this out. Let’s see > if it works (I’ll be around to

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-17 Thread Davies Liu
a try? Davies [1] https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin On Fri, Oct 17, 2014 at 5:00 PM, shane knapp wrote: > actually, nvm, you have to be run that command from our servers to affect > our limit. run it all you want from your own machines! :P > &g

Re: TorrentBroadcast slow performance

2014-10-07 Thread Davies Liu
Could you create a JIRA for it? maybe it's a regression after https://issues.apache.org/jira/browse/SPARK-3119. We will appreciate that if you could tell how to reproduce it. On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel wrote: > Hi, > > I've had no answer to this on u...@spark.apache.org, so

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu wrote: > On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas > wrote: >> Yep, I thought it was a bogus comparison. >> >> I should rephrase my question as it was poorly phrased: on average, how >> much faster is Spark v.

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas wrote: > Yep, I thought it was a bogus comparison. > > I should rephrase my question as it was poorly phrased: on average, how > much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? > I've only used Spark and don't have a chance

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Davies Liu
Maybe we could try LZ4 [1], which has better performance and smaller footprint than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x higher than LZF[2], but memory used is 10x smaller than LZF (16k vs 190k). [1] https://github.com/jpountz/lz4-java [2] http://ning.github.io/jvm-compr

Re: PySpark Driver from Jython

2014-07-10 Thread davies
The function run in worker is serialized in driver, so the driver and worker should be run in the same Python interpreter. If you do not need c extension support, then Jython will be better than CPython, because of the cost of serialization is much lower. Davies -- View this message in