Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-27 Thread Reynold Xin
Marcelo - please submit a patch anyway. If we don't include it in this release, it will go into 1.5.1. On Thu, Aug 27, 2015 at 4:56 PM, Marcelo Vanzin wrote: > On Thu, Aug 27, 2015 at 4:42 PM, Marcelo Vanzin > wrote: > > The Windows issue Sen raised could be considered a regression / > > blo

Re: Tungsten off heap memory access for C++ libraries

2015-08-29 Thread Reynold Xin
Supporting non-JVM code without memory copying and serialization is actually one of the motivations behind Tungsten. We didn't talk much about it since it is not end-user-facing and it is still too early. There are a few challenges still: 1. Spark cannot run entirely in off-heap mode (by entirely

Re: Research of Spark scalability / performance issues

2015-08-29 Thread Reynold Xin
Both 2 and 3 are pretty good topics for master's project I think. You can also look into how one can improve Spark's scheduler throughput. Couple years ago Kay measured it but things have changed. It would be great to start with measurement, and then look at where the bottlenecks are, and see how

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Reynold Xin
On Sun, Aug 30, 2015 at 5:58 AM, Paul Weiss wrote: > > Also, is this work being done on a branch I could look into further and > try out? > > We don't have a branch yet -- because there is no code nor design for this yet. As I said, it is one of the motivations behind Tungsten, but it is fairly e

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Reynold Xin
:12 AM, Reynold Xin wrote: > > On Sun, Aug 30, 2015 at 5:58 AM, Paul Weiss > wrote: > >> >> Also, is this work being done on a branch I could look into further and >> try out? >> >> > We don't have a branch yet -- because there is no code nor

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-31 Thread Reynold Xin
5.1. com.databricks.spark.csv - read/write OK >> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But >> com.databricks:spark-csv_2.11:1.2.0 worked) >> 6.0. DataFrames >> 6.1. cast,dtypes OK >> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK &g

Re: Tungsten off heap memory access for C++ libraries

2015-09-01 Thread Reynold Xin
Please do. Thanks. On Mon, Aug 31, 2015 at 5:00 AM, Paul Weiss wrote: > Sounds good, want me to create a jira and link it to SPARK-9697? Will put > down some ideas to start. > On Aug 31, 2015 4:14 AM, "Reynold Xin" wrote: > >> BTW if you are interested in this,

[VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-01 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To

Re: Code generation for GPU

2015-09-03 Thread Reynold Xin
See responses inline. On Thu, Sep 3, 2015 at 1:58 AM, kiran lonikar wrote: > Hi, > >1. I found where the code generation > > > happens >in

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-04 Thread Reynold Xin
mes are lowercase ( i.e. now ‘sum(OrderPrice)’; >> previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)'). >> So programs that depend on the case of the synthetic column names would >> fail. >> 2. orders_3.groupBy("Year","Month&qu

Re: (Spark SQL) partition-scoped UDF

2015-09-04 Thread Reynold Xin
Can you say more about your transformer? This is a good idea, and indeed we are doing it for R already (the latest way to run UDFs in R is to pass the entire partition as a local R dataframe for users to run on). However, what works for R for simple data processing might not work for your high per

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-05 Thread Reynold Xin
Thanks, Krishna, for the report. We should fix your problem using the Python UDFs in 1.6 too. I'm going to close this vote now. Thanks everybody for voting. This vote passes with 8 +1 votes (3 binding) and no 0 or -1 votes. +1: Reynold Xin* Tom Graves* Burak Yavuz Michael Armbrust* Davie

Re: Fast Iteration while developing

2015-09-07 Thread Reynold Xin
I usually write a test case for what I want to test, and then run sbt/sbt "~module/test:test-only *MyTestSuite" On Mon, Sep 7, 2015 at 6:02 PM, Justin Uang wrote: > Hi, > > What is the normal workflow for the core devs? > > - Do we need to build the assembly jar to be able to run it from the

Re: groupByKey() and keys with many values

2015-09-08 Thread Reynold Xin
On Tue, Sep 8, 2015 at 6:51 AM, Antonio Piccolboni wrote: > As far as the DB writes, remember spark can retry a computation, so your > writes have to be idempotent (see this thread > , in > which Reynold is a bit optimistic about f

[ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Reynold Xin
Hi All, Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page. A huge thanks go to all of the individuals and organizations involved in development and testing of this r

Re: Did the 1.5 release complete?

2015-09-09 Thread Reynold Xin
Dev/user announcement was made just now. For Maven, I did publish it this afternoon (so it's been a few hours). If it is still not there tomorrow morning, I will look into it. On Wed, Sep 9, 2015 at 2:42 AM, Sean Owen wrote: > I saw the end of the RC3 vote: > > https://mail-archives.apache.or

Re: ClassCastException using DataFrame only when num-executors > 2 ...

2015-09-10 Thread Reynold Xin
Does this still happen on 1.5.0 release? On Mon, Aug 31, 2015 at 9:31 AM, Olivier Girardot wrote: > tested now against Spark 1.5.0 rc2, and same exceptions happen when > num-executors > 2 : > > 15/08/25 10:31:10 WARN scheduler.TaskSetManager: Lost task 0.1 in stage > 5.0 (TID 501, xxx): jav

Re: Spark 1.5.x: Java files in src/main/scala and vice versa

2015-09-10 Thread Reynold Xin
There isn't really any difference I think where you put them. Did you run into a problem? On Thu, Sep 10, 2015 at 6:38 AM, Sean Owen wrote: > I feel like I knew the answer to this but have forgotten. Reynold do > you know about this file? looks like you added it. > > On Thu, Sep 10, 2015 at 1:1

Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-11 Thread Reynold Xin
It is already there, but the search is not updated. Not sure what's going on with maven central search. http://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.10/1.5.0/ On Fri, Sep 11, 2015 at 10:21 AM, Ryan Williams < ryan.blake.willi...@gmail.com> wrote: > Any idea why 1.5.0 is not i

Re: Spark 1.5.x: Java files in src/main/scala and vice versa

2015-09-12 Thread Reynold Xin
Most these files are just package-info.java there for having a good package index for JavaDoc. If we move them, we will need to create a folder in the java one for each package that exposes any documentation. And it is very likely we will forget to update package-info.java when we update package.sc

Spark 1.5.1 release

2015-09-14 Thread Reynold Xin
Hi devs, FYI - we have already accumulated an "interesting" list of issues found with the 1.5.0 release. I will work on an RC in the next week or two, depending on how many blocker/critical issues are fixed. https://issues.apache.org/jira/issues/?filter=1221

Re: JDBC Dialect tests

2015-09-14 Thread Reynold Xin
SPARK-9818 you link to actually links to a pull request trying to bring them back. On Mon, Sep 14, 2015 at 1:34 PM, Luciano Resende wrote: > I was looking for the code mentioned in SPARK-9818 and SPARK-6136 that > supposedly is testing MySQL and PostgreSQL using Docker and it seems that > this

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
Is this on latest master / branch-1.5? out of the box we reserve only 16% (0.2 * 0.8) of the memory for execution (e.g. aggregate, join) / shuffle sorting. With a 3GB heap, that's 480MB. So each task gets 480MB / 32 = 15MB, and each operator reserves at least one page for execution. If your page s

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
Pete - can you do me a favor? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 Print the parameters that are passed into the getPageSize function, and check their values. On Mon, Sep 14, 2015 at 4:32 PM, Reynold Xin wrote

Re: And.eval short circuiting

2015-09-14 Thread Reynold Xin
rxin=# select null and true; ?column? -- (1 row) rxin=# select null and false; ?column? -- f (1 row) null and false should return false. On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson wrote: > It seems like And.eval can avoid calculating right.eval if left.eval > returns n

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
wrote: > Reynold, thanks for replying. > > getPageSize parameters: maxMemory=515396075, numCores=0 > Calculated values: cores=8, default=4194304 > > So am I getting a large page size as I only have 8 cores? > > On 15 September 2015 at 00:40, Reynold Xin wrote: >

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
cores to 32. >> >> object TestHive >> extends TestHiveContext( >> new SparkContext( >> System.getProperty("spark.sql.test.master", "local[32]"), >> >> >> On Mon, Sep 14, 2015 at 11:22 PM, Reynold Xin >> wrote: >

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
cores there will be some contention and so they >> remain active for longer. >> >> So I think this is a test case issue configuring the number of executors >> too high. >> >> On 15 September 2015 at 18:54, Reynold Xin wrote: >> >>> Maybe we ca

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Reynold Xin
experiment with the page size calculation to see what effect it has. > > Cheers, > > > > On 16 September 2015 at 06:53, Reynold Xin wrote: > >> It is exactly the issue here, isn't it? >> >> We are using memory / N, where N should be the maximum number of acti

Re: RDD API patterns

2015-09-16 Thread Reynold Xin
I'm not sure what we can do here. Nested RDDs are a pain to implement, support, and explain. The programming model is not well explored. Maybe a UDAF interface that allows going through the data twice? On Mon, Sep 14, 2015 at 4:36 PM, sim wrote: > I'd like to get some feedback on an API design

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-16 Thread Reynold Xin
Thanks Shane and Jon for the heads up. On Wednesday, September 16, 2015, shane knapp wrote: > good morning, denizens of the aether! > > your hard working build system (and some associated infrastructure) > has been in need of some updates and housecleaning for quite a while > now. we will be sp

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Reynold Xin
.scala:66) >> at org.apache.spark.scheduler.Task.run(Task.scala:88) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) >>

Re: SparkR streaming source code

2015-09-16 Thread Reynold Xin
You should reach out to the speakers directly. On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong wrote: > SparkR streaming is mentioned at about page 17 in below pdf, can anyone > share source code? (could not find it on GitHub) > > > > https://spark-summit.org/2015-east/wp-content/uploads/2015/03/S

Re: And.eval short circuiting

2015-09-16 Thread Reynold Xin
o the second filter. Even weirder is that if you call collect() after the > first filter you won't see nulls, and if you write the data to disk and > reread it, the NPE won't happen. > > It's bewildering! Is this the intended behavior? >

Re: New Spark json endpoints

2015-09-16 Thread Reynold Xin
Do we need to increment the version number if it is just strict additions? On Wed, Sep 16, 2015 at 7:10 PM, Kevin Chen wrote: > Just wanted to bring this email up again in case there were any thoughts. > Having all the information from the web UI accessible through a supported > json API is ver

Re: RDD: Execution and Scheduling

2015-09-17 Thread Reynold Xin
Your understanding is mostly correct. Replies inline. On Thu, Sep 17, 2015 at 5:23 AM, gsvic wrote: > After reading some parts of Spark source code I would like to make some > questions about RDD execution and scheduling. > > At first, please correct me if I am wrong at the following: > 1) The n

Re: And.eval short circuiting

2015-09-17 Thread Reynold Xin
er should not reorder the > filters for correctness. Please correct me if I have an incorrect > assumption about the guarantees of the optimizer. > > Is there a bug filed that tracks the change you suggested below, btw? I’d > like to follow the issue, if there’s one. > > Thanks,

Re: 答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-18 Thread Reynold Xin
Maybe we should add some inline comment explaining why it is ok for that message to be not serializable. On Thu, Sep 17, 2015 at 4:08 AM, Huangguowei wrote: > Thanks for your reply. I just want to do some monitors, never mind! > > > > *发件人:* Shixiong Zhu [mailto:zsxw...@gmail.com] > *发送时间:* 201

Re: 答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-18 Thread Reynold Xin
ards, > Shixiong Zhu > > 2015-09-18 15:10 GMT+08:00 Reynold Xin : > >> Maybe we should add some inline comment explaining why it is ok for that >> message to be not serializable. >> >> >> On Thu, Sep 17, 2015 at 4:08 AM, Huangguowei >> wrote: >&

Re: One element per node

2015-09-18 Thread Reynold Xin
Use a global atomic boolean and return nothing from that partition if the boolean is true. Note that your result won't be deterministic. On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander wrote: Thank you! How can I guarantee that I have only one element per executor (per worker, or per physical no

Re: One element per node

2015-09-18 Thread Reynold Xin
global long value and get the element on partition only if > someFunction(partitionId, globalLong)==true? Or by using some specific > partitioner that creates such partitionIds that can be decomposed into > nodeId and number of partitions per node? > > > > *From:* Reynold Xin [mai

Re: spark-shell 1.5 doesn't seem to work in local mode

2015-09-19 Thread Reynold Xin
Maybe you have a hdfs-site.xml lying around somewhere? On Sat, Sep 19, 2015 at 9:14 AM, Madhu wrote: > I downloaded spark-1.5.0-bin-hadoop2.6.tgz recently and installed on > CentOS. > All my local Spark code works fine locally. > > For some odd reason, spark-shell doesn't work in local mode. >

Re: BUILD SYSTEM: fire and power event at UC berkeley's IST colo, jenkins offline

2015-09-19 Thread Reynold Xin
Great! Jon / Shane: Thanks for handling this. On Saturday, September 19, 2015, shane knapp wrote: > we're up and building! time for breakfast... :) > > https://amplab.cs.berkeley.edu/jenkins/ > > On Sat, Sep 19, 2015 at 7:35 AM, shane knapp > wrote: > > it was definitely one of our servers..

Re: RDD: Execution and Scheduling

2015-09-20 Thread Reynold Xin
On Sun, Sep 20, 2015 at 3:58 PM, gsvic wrote: > Concerning answers 1 and 2: > > 1) How Spark determines a node as a "slow node" and how slow is that? > There are two cases here: 1. If a node is busy (e.g. all slots are already occupied), the scheduler cannot schedule anything on it. See "Delay

Re: Join operation on DStreams

2015-09-20 Thread Reynold Xin
stream.map(record => (keyFunction(record), record)) For future reference, this question should go to the user list, not dev list. On Sun, Sep 20, 2015 at 11:47 PM, guoxu1231 wrote: > Hi Spark Experts, > > I'm trying to use join(otherStream, [numTasks]) on DStreams, and it > requires called o

Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list Hi Dirceu, The answer to whether throwing an exception is better or null is better depends on your use case. If you are debugging and want to find bugs with your program, you might prefer throwing an exception. However, if you are running on a large real-world dataset (i.e. data is dirt

Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Reynold Xin
What's the plan if you run explain? In 1.5 the default should be TungstenAggregate, which does spill (switching from hash-based aggregation to sort-based aggregation). On Mon, Sep 21, 2015 at 5:34 PM, Matt Cheah wrote: > Hi everyone, > > I’m debugging some slowness and apparent memory pressure

Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-22 Thread Reynold Xin
There is an implicit conversion in scope https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153 /** * An implicit conversion function internal to this class for us to avoid doing * "new DataFrame(...)" everywhere. */ @inline pri

Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-22 Thread Reynold Xin
There is an implicit conversion in scope https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153 /** * An implicit conversion function internal to this class for us to avoid doing * "new DataFrame(...)" everywhere. */ @inline pri

[VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.1 [ ] -1 Do not release this package because ... The

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
I forked a new thread for this. Please discuss NOTICE file related things there so it doesn't hijack this thread. On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote: > On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas > wrote: > > Under your guidance, I would be happy to help compile a NOTICE f

[Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Reynold Xin
Richard, Thanks for bringing this up and this is a great point. Let's start another thread for it so we don't hijack the release thread. On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote: > On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas > wrote: > > Under your guidance, I would be happy t

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
I'm going to +1 this myself. Tested on my laptop. On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin wrote: > I forked a new thread for this. Please discuss NOTICE file related things > there so it doesn't hijack this thread. > > > On Thu, Sep 24, 2015 at 10:51 AM, Sean

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
getY work now (Thanks). Slower; saturates the CPU. A > non-scientific snapshot below. I know that this really has to be done more > rigorously, on a bigger machine, with more cores et al.. > [image: Inline image 1] [image: Inline image 2] > > On Thu, Sep 24, 2015 at 12:27 AM, Reynold X

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-27 Thread Reynold Xin
Thanks everybody for voting. I'm going to close the vote now. The vote passes with 17 +1 votes and 1 -1 vote. I will work on packaging this asap. +1: Reynold Xin* Sean Owen Hossein Falaki Xiangrui Meng* Krishna Sankar Joseph Bradley Sean McNamara* Luciano Resende Doug Balog Eugene Zhu

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Reynold Xin
You can pass the schema into json directly, can't you? On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith wrote: > Hi all, > > > > We really like the ability to infer a schema from JSON contained in an > RDD, but when we’re using Spark Streaming on small batches of data, we > sometimes find that Spark

[ANNOUNCE] Announcing Spark 1.5.1

2015-10-01 Thread Reynold Xin
Hi All, Spark 1.5.1 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.0 users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.1 http://spark.apach

Re: [Build] repo1.maven.org: spark libs 1.5.0 for scala 2.10 poms are broken (404)

2015-10-02 Thread Reynold Xin
Both work for me. It's possible maven.org is having problems with some servers. On Fri, Oct 2, 2015 at 11:08 AM, Ted Yu wrote: > Andy: > 1.5.1 has been released. > > Maybe you can use this: > > https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.5.1/spark-streaming_2.10-1.5.1

Re: Python UDAFs

2015-10-02 Thread Reynold Xin
No, not yet. On Fri, Oct 2, 2015 at 12:20 PM, Justin Uang wrote: > Hi, > > Is there a Python API for UDAFs? > > Thanks! > > Justin >

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
You can write the data to local hdfs (or local disk) and just load it from there. On Mon, Oct 5, 2015 at 4:37 PM, Jegan wrote: > Thanks for your suggestion Ted. > > Unfortunately at this point of time I cannot go beyond 1000 partitions. I > am writing this data to BigQuery and it has a limit of

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
t; Jegan > > On Mon, Oct 5, 2015 at 4:42 PM, Reynold Xin wrote: > >> You can write the data to local hdfs (or local disk) and just load it >> from there. >> >> >> On Mon, Oct 5, 2015 at 4:37 PM, Jegan wrote: >> >>> Thanks for your suggestion Ted. >

Re: Pyspark dataframe read

2015-10-06 Thread Reynold Xin
I think the problem is that comma is actually a legitimate character for file name, and as a result ... On Tuesday, October 6, 2015, Josh Rosen wrote: > Could someone please file a JIRA to track this? > https://issues.apache.org/jira/browse/SPARK > > On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers

multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
The current implementation of multiple count distinct in a single query is very inferior in terms of performance and robustness, and it is also hard to guarantee correctness of the implementation in some of the refactorings for Tungsten. Supporting a better version of it is possible in the future,

Re: multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
(distinct colA, colB) from foo; On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin wrote: > The current implementation of multiple count distinct in a single query is > very inferior in terms of performance and robustness, and it is also hard > to guarantee correctness of the implementation in so

Fwd: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Reynold Xin
Adding user list too. -- Forwarded message -- From: Reynold Xin Date: Tue, Oct 6, 2015 at 5:54 PM Subject: Re: multiple count distinct in SQL/DataFrame? To: "dev@spark.apache.org" To provide more context, if we do remove this feature, the following SQL query woul

Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-08 Thread Reynold Xin
The problem only applies to the sbt build because it treats warnings as errors. @Iulian - how about we disable warnings -> errors for 2.11? That would seem better until we switch 2.11 to be the default build. On Thu, Oct 8, 2015 at 7:55 AM, Ted Yu wrote: > I tried building with Scala 2.11 on L

Re: spark over drill

2015-10-08 Thread Reynold Xin
You probably saw that in a presentation given by the drill team. You should check with them on that. On Thu, Oct 8, 2015 at 11:51 AM, Pranay Tonpay wrote: > hi ,, > Is spark-drill integration already done ? if yes, which spark version > supports it ... it was in the "upcming list for 2015" is wh

a few major changes / improvements for Spark 1.6

2015-10-12 Thread Reynold Xin
Hi Spark devs, It is hard to track everything going on in Spark with so many pull requests and JIRA tickets. Below are 4 major improvements that will likely be in Spark 1.6. We have already done prototyping for all of them, and want feedback on their design. 1. SPARK-9850 Adaptive query executio

Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-14 Thread Reynold Xin
+dev list On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo wrote: > All, > > Does anyone meet memory leak issue with spark streaming and spark sql in > spark 1.5.1? I can see the memory is increasing all the time when running > this simple sample: > > val sc = new SparkContext(conf) >

If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-14 Thread Reynold Xin
Can you reply to this email and provide us with reasons why you disable it? Thanks.

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Reynold Xin
That could break a lot of applications. In particular, a lot of input data sources (csv, json) don't have clean schema, and can have duplicate column names. For the case of join, maybe a better solution is to ask the left/right prefix/suffix in the user code, similar to what Pandas does. On Wed,

Re: Spark Implicit Functions

2015-10-16 Thread Reynold Xin
Thanks for sharing, Bill. On Fri, Oct 16, 2015 at 2:06 PM, Bill Bejeck wrote: > All, > > I just did a post on adding groupByKeyToList and groupByKeyUnique using > implicit classes. I thought it might be useful to someone. > > http://codingjunkie.net/learning-scala-implicits-with-spark/ > > Tha

flaky test "map stage submission with multiple shared stages and failures"

2015-10-17 Thread Reynold Xin
I just saw this happening: [info] - map stage submission with multiple shared stages and failures *** FAILED *** (566 milliseconds) [info] java.lang.IndexOutOfBoundsException: 2 [info] at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) [info] at scala.collection.

Re: MapStatus too large for drvier

2015-10-20 Thread Reynold Xin
How big is your driver heap size? And any reason why you'd need 200k map and 200k reduce tasks? On Mon, Oct 19, 2015 at 11:59 PM, yaoqin wrote: > Hi everyone, > > When I run a spark job contains quite a lot of tasks(in my case is > 200,000*200,000), the driver occured OOM mainly caused by t

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Reynold Xin
Jerry - I think that's been fixed in 1.5.1. Do you still see it? On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam wrote: > I disabled it because of the "Could not acquire 65536 bytes of memory". It > happens to fail the job. So for now, I'm not touching it. > > On Tue, Oct 20, 2015 at 4:48 PM, charmee

Fwd: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Reynold Xin
With Jerry's permission, sending this back to the dev list to close the loop. -- Forwarded message -- From: Jerry Lam Date: Tue, Oct 20, 2015 at 3:54 PM Subject: Re: If you use Spark 1.5 and disabled Tungsten mode ... To: Reynold Xin Yup, coarse grained mode works just

Re: Exception when using cosh

2015-10-21 Thread Reynold Xin
I think we made a mistake and forgot to register the function in the registry: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala Do you mind submitting a pull request to fix this? Should be an one line change. I fi

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-21 Thread Reynold Xin
at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:56) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:339) > > > On Tue, Oct 20, 2015 at 9:10 PM, Reynold Xin wrote: > >> With Jerry&#x

Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-22 Thread Reynold Xin
Why do you do a glom? It seems unnecessarily expensive to materialize each partition in memory. On Thu, Oct 22, 2015 at 2:02 AM, 周千昊 wrote: > Hi, spark community > I have an application which I try to migrate from MR to Spark. > It will do some calculations from Hive and output to h

[VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-25 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.2 [ ] -1 Do not release this package because ... The

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-27 Thread Reynold Xin
kar > wrote: > >> Guys, >>The sc.version returns 1.5.1 in python and scala. Is anyone getting >> the same results ? Probably I am doing something wrong. >> Cheers >> >> >> On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin > > wrote: >> >>

Re: Exception when using some aggregate operators

2015-10-27 Thread Reynold Xin
Try count(distinct columnane) In SQL distinct is not part of the function name. On Tuesday, October 27, 2015, Shagun Sodhani wrote: > Oops seems I made a mistake. The error message is : Exception in thread > "main" org.apache.spark.sql.AnalysisException: undefined function > countDistinct > On

Re: Pickle Spark DataFrame

2015-10-28 Thread Reynold Xin
What are you trying to accomplish to pickle a Spark DataFrame? If your dataset is large, it doesn't make much sense to pickle it. If your dataset is small, maybe it's best to just pickle a Pandas dataframe. On Tue, Oct 27, 2015 at 9:47 PM, agg212 wrote: > Hi, I'd like to "pickle" a Spark DataFr

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
)""") >>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show() >>>>>> >>>>>> Cheers >>>>>> >>>>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sod

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
ctions$> > to > be treated as sql operators as well? I do see that these are mentioned as > Functions > available for DataFrame > <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html> > but > it would be great if you can clarify this. >

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-11-01 Thread Reynold Xin
>>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) >>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) >>> at >>> org.apache.spark.rdd.MapPartitionsWithPreparati

Re: Unchecked contribution (JIRA and PR)

2015-11-03 Thread Reynold Xin
Sergio, Usually it takes a lot of effort to get something merged into Spark itself, especially for relatively new algorithms that might not have established itself yet. I will leave it to mllib maintainers to comment on the specifics of the individual algorithms proposed here. Just another genera

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Reynold Xin
I don't think there is any special handling w.r.t. Tachyon vs in-heap caching. As a matter of fact, I think the current offheap caching implementation is pretty bad, because: 1. There is no namespace sharing in offheap mode 2. Similar to 1, you cannot recover the offheap memory once Spark driver o

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Reynold Xin
am using tachyon for caching, if an executor is lost, then > that partition is lost for the purposes of spark? > > On Tue, Nov 3, 2015 at 5:53 PM Reynold Xin wrote: > >> I don't think there is any special handling w.r.t. Tachyon vs in-heap >> caching. As a matte

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Reynold Xin
contexts, > but where the notebooks can be idle for long periods of time while holding > onto cached rdds. > > On Tue, Nov 3, 2015 at 10:15 PM Reynold Xin wrote: > >> It is lost unfortunately (although can be recomputed automatically). >> >> >> On Tue, Nov 3

[VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-03 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.2 [ ] -1 Do not release this package because ... The r

Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
ecutor, causing entire > stages to be retried. In fine-grained mode, only the task fails and > subsequently gets retried without taking out an entire stage or worse. > > On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin wrote: > >> If you are using Spark with Mesos fine grained mode, can yo

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data from in-memory binary f

Re: Sort Merge Join from the filesystem

2015-11-04 Thread Reynold Xin
It's not supported yet, and not sure if there is a ticket for it. I don't think there is anything fundamentally hard here either. On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > (this is kind of a cross-post from the user list) > > Does Spark support doi

Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Reynold Xin
Can you use the broadcast hint? e.g. df1.join(broadcast(df2)) the broadcast function is in org.apache.spark.sql.functions On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel wrote: > Hi, > > If I have a hive table, analyze table compute statistics will ensure Spark > SQL has statistics of that t

Re: How to force statistics calculation of Dataframe?

2015-11-05 Thread Reynold Xin
n dataframe api. > > On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin wrote: > >> Can you use the broadcast hint? >> >> e.g. >> >> df1.join(broadcast(df2)) >> >> the broadcast function is in org.apache.spark.sql.functions >> >> >> >

Re: Need advice on hooking into Sql query plan

2015-11-05 Thread Reynold Xin
You can hack around this by constructing logical plans yourself and then creating a DataFrame in order to execute them. Note that this is all depending on internals of the framework and can break when Spark upgrades. On Thu, Nov 5, 2015 at 4:18 PM, Yana Kadiyska wrote: > I don't think a view wo

Re: Looking for the method executors uses to write to HDFS

2015-11-06 Thread Reynold Xin
Are you looking for this? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L69 On Wed, Nov 4, 2015 at 5:11 AM, Tóth Zoltán wrote: > Hi, > > I'd like to write a parquet file from the driver. I could use

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-07 Thread Reynold Xin
on using yesterday's code, > build locally. > > Regression running in Yarn Cluster mode against few internal ML ( logistic > regression, linear regression, random forest and statistic summary) as well > Mlib KMeans. all seems to work fine. > > Chester > > > On Tue,

<    5   6   7   8   9   10   11   12   13   14   >