Re: Stickers and Swag

2022-06-14 Thread Reynold Xin
Nice! Going to order a few items myself ... On Tue, Jun 14, 2022 at 7:54 PM, Gengliang Wang < ltn...@gmail.com > wrote: > > FYI now you can find the shopping information on https:/ / spark. apache. org/ > community ( https://spark.apache.org/community ) as well :) > > > > Gengliang > > > >

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
One of the problem in the past when something like this was brought up was that the ASF couldn't have officially blessed venues beyond the already approved ones. So that's something to look into. Now of course you are welcome to run unofficial things unblessed as long as they follow trademark r

Re: DataFrame Column Alias problem

2015-05-21 Thread Reynold Xin
In 1.4 it actually shows col1 by default. In 1.3, you can add "col1" to the output, i.e. df.groupBy($"col1").agg($"col1", count($"col1").as("c")).show() On Thu, May 21, 2015 at 11:22 PM, SLiZn Liu wrote: > However this returns a single column of c, without showing the original > col1. > ​ > >

Re: rdd.sample() methods very slow

2015-05-21 Thread Reynold Xin
You can do something like this: val myRdd = ... val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i => Random.nextDouble() < 0.1) // this samples 10% of the partitions rddSampledByPartition.mapPartitions { iter => iter.take(10) } // take the first 10 elements out of each partition

Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Reynold Xin
I'm not sure if it is possible to overload the map function twice, once for just KV pairs, and another for K and V separately. On Fri, May 22, 2015 at 10:26 AM, Justin Pihony wrote: > This ticket improved > the RDD API, but it could be even mor

Re: Exception when using CLUSTER BY or ORDER BY

2015-06-12 Thread Reynold Xin
Tom, Can you file a JIRA and attach a small reproducible test case if possible? On Tue, May 19, 2015 at 1:50 PM, Thomas Dudziak wrote: > Under certain circumstances that I haven't yet been able to isolate, I get > the following error when doing a HQL query using HiveContext (Spark 1.3.1 > on M

Re: Building scaladoc using "build/sbt unidoc" failure

2015-06-12 Thread Reynold Xin
Try build/sbt clean first. On Tue, May 26, 2015 at 4:45 PM, Justin Yip wrote: > Hello, > > I am trying to build scala doc from the 1.4 branch. But it failed due to > [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: > List(object package$DebugNode, object package$DebugNo

Re: Grouping runs of elements in a RDD

2015-06-30 Thread Reynold Xin
Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling wrote: > Hi all, > > I have a problem where I have a RDD of elements: > > Item1 Item2 Item3 Item4 Item5 Item6 ... > > and I want to run a function over them to deci

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
Are you looking for "relaxed" mode that simply return nulls for fields that doesn't exist or have incompatible schema? On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith wrote: > Thanks Michael, it's not a great example really, as the data I'm working with > has some source files that do fit the sche

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
I don't think that exists right now, but it's definitely a good option to have. I myself have run into this issue a few times. Can you create a JIRA ticket so we can track it? Would be even better if you are interested in working on a patch! Thanks. On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith w

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
Thanks. Once you create the jira just reply to this email with the link. On Wednesday, March 2, 2016, Ewan Leith wrote: > Thanks, I'll create the JIRA for it. Happy to help contribute to a patch if > we can, not sure if my own scala skills will be up to it but perhaps one of > my colleagues' w

Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Reynold Xin
You just want to be able to replicate hot cached blocks right? On Tuesday, March 8, 2016, Prabhu Joseph wrote: > Hi All, > > When a Spark Job is running, and one of the Spark Executor on Node A > has some partitions cached. Later for some other stage, Scheduler tries to > assign a task to No

[discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Reynold Xin
Any objections? Please articulate your use case. SparkEnv is a weird one because it was documented as "private" but not marked as so in class visibility. * NOTE: This is not intended for external use. This is exposed for Shark and may be made private * in a future release. I do see Hive

Re: [discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Reynold Xin
On Wed, Mar 16, 2016 at 3:29 PM, Mridul Muralidharan wrote: > b) Shuffle manager (to get shuffle reader) > What's the use case for shuffle manager/reader? This seems like using super internal APIs in applications.

Re: df.dtypes -> pyspark.sql.types

2016-03-20 Thread Reynold Xin
We probably should have the alias. Is this still a problem on master branch? On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov wrote: > Running following: > > #fix schema for gaid which should not be Double >> from pyspark.sql.types import * >> customSchema = StructType() >> for (col,typ) in ts

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
spark.shuffle.spill actually has nothing to do with whether we write shuffle files to disk. Currently it is not possible to not write shuffle files to disk, and typically it is not a problem because the network fetch throughput is lower than what disks can sustain. In most cases, especially with SS

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
;> > others. >> > >> > Hence, the performance of Spark is gated by the performance of >> > spark.local.dir, even on large memory systems. >> > >> > "Currently it is not possible to not write shuffle files to disk.” >> > >> > Wh

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
ence, the performance of Spark is gated by the performance of > > spark.local.dir, even on large memory systems. > > > > "Currently it is not possible to not write shuffle files to disk.” > > > > What changes >would< make it possible? > > > > The onl

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Reynold Xin
+1 This is a no brainer IMO. On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley wrote: > +1 By the way, the JIRA for tracking (Scala) API parity is: > https://issues.apache.org/jira/browse/SPARK-4591 > > On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia > wrote: > >> This sounds good to me as well.

Re: Executor shutdown hooks?

2016-04-06 Thread Reynold Xin
On Wed, Apr 6, 2016 at 4:39 PM, Sung Hwan Chung wrote: > My option so far seems to be using JVM's shutdown hook, but I was > wondering if Spark itself had an API for tasks. > Spark would be using that under the hood anyway, so you might as well just use the jvm shutdown hook directly.

Re: How Spark handles dead machines during a job.

2016-04-08 Thread Reynold Xin
The driver has the data and wouldn't need to rerun. On Friday, April 8, 2016, Sung Hwan Chung wrote: > Hello, > > Say, that I'm doing a simple rdd.map followed by collect. Say, also, that > one of the executors finish all of its tasks, but there are still other > executors running. > > If the ma

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Reynold Xin
Usually no - but sortByKey does because it needs the range boundary to be built in order to have the RDD. It is a long standing problem that's unfortunately very difficult to solve without breaking the RDD API. In DataFrame/Dataset we don't have this issue though. On Sun, Apr 24, 2016 at 10:54 P

Re: Spark 2.0 - SQL Subqueries.

2016-05-21 Thread Reynold Xin
https://issues.apache.org/jira/browse/SPARK-15078 was just a bunch of test harness and added no new functionality. To reduce confusion, I just backported it into branch-2.0 so SPARK-15078 is now in 2.0 too. Can you paste a query you were testing? On Sat, May 21, 2016 at 10:49 AM, Kamalesh Nair

Re: feedback on dataset api explode

2016-05-25 Thread Reynold Xin
Based on this discussion I'm thinking we should deprecate the two explode functions. On Wednesday, May 25, 2016, Koert Kuipers wrote: > wenchen, > that definition of explode seems identical to flatMap, so you dont need it > either? > > michael, > i didn't know about the column expression version

Re: Pros and Cons

2016-05-25 Thread Reynold Xin
On Wed, May 25, 2016 at 9:52 AM, Jörn Franke wrote: > Spark is more for machine learning working iteravely over the whole same > dataset in memory. Additionally it has streaming and graph processing > capabilities that can be used together. > Hi Jörn, The first part is actually no true. Spark c

Re: feedback on dataset api explode

2016-05-25 Thread Reynold Xin
ily replaced by .flatMap (to do explosion) and > .select (to rename output columns) > > Cheng > > > On 5/25/16 12:30 PM, Reynold Xin wrote: > > Based on this discussion I'm thinking we should deprecate the two explode > functions. > > On Wednesday, May 25, 2016, Ko

Re: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Reynold Xin
It's probably a good idea to have the vertica dialect too, since it doesn't seem like it'd be too difficult to maintain. It is not going to be as performant as the native Vertica data source, but is going to be much lighter weight. On Thu, May 26, 2016 at 3:09 PM, Mohammed Guller wrote: > Verti

Re: Spark 2.0 Release Date

2016-06-07 Thread Reynold Xin
It'd be great to cut an RC as soon as possible. Looking at the blocker/critical issue list, majority of them are API audits. I think people will get back to those once Spark Summit is over, and then we should see some good progress towards an RC. On Tue, Jun 7, 2016 at 6:20 AM, Jacek Laskowski wr

Re: Thanks For a Job Well Done !!!

2016-06-18 Thread Reynold Xin
Thanks for the kind words, Krishna! Please keep the feedback coming. On Saturday, June 18, 2016, Krishna Sankar wrote: > Hi all, >Just wanted to thank all for the dataset API - most of the times we see > only bugs in these lists ;o). > >- Putting some context, this weekend I was updating

[ANNOUNCE] Announcing Spark 1.6.2

2016-06-27 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.2! This maintenance release includes fixes across several areas of Spark. You can find the list of changes here: https://s.apache.org/spark-1.6.2 And download the release here: http://spark.apache.org/downloads.html

Re: Logical Plan

2016-06-30 Thread Reynold Xin
Which version are you using here? If the underlying files change, technically we should go through optimization again. Perhaps the real "fix" is to figure out why is logical plan creation so slow for 700 columns. On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh wrote: > Is there a way I can use

Re: ml and mllib persistence

2016-07-12 Thread Reynold Xin
Also Java serialization isn't great for cross platform compatibility. On Tuesday, July 12, 2016, aka.fe2s wrote: > Okay, I think I found an answer on my question. Some models (for instance > org.apache.spark.mllib.recommendation.MatrixFactorizationModel) hold RDDs, > so just serializing these ob

Re: Spark Website

2016-07-13 Thread Reynold Xin
Thanks for reporting. This is due to https://issues.apache.org/jira/servicedesk/agent/INFRA/issue/INFRA-12055 On Wed, Jul 13, 2016 at 11:52 AM, Pradeep Gollakota wrote: > Worked for me if I go to https://spark.apache.org/site/ but not > https://spark.apache.org > > On Wed, Jul 13, 2016 at 11:4

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Reynold Xin
Good idea. https://github.com/apache/spark/pull/14252 On Mon, Jul 18, 2016 at 12:16 PM, Michael Armbrust wrote: > + dev, reynold > > Yeah, thats a good point. I wonder if SparkSession.sqlContext should be > public/deprecated? > > On Mon, Jul 18, 2016 at 8:37 AM, Koert Kuipers wrote: > >> in

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Reynold Xin
Yes. But in order to access methods available only in HiveContext a user cast is required. On Tuesday, July 19, 2016, Maciej Bryński wrote: > @Reynold Xin, > How this will work with Hive Support ? > SparkSession.sqlContext return HiveContext ? > > 2016-07-19 0:26 GMT+02

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Reynold Xin
The presentation at Spark Summit SF was probably referring to Structured Streaming. The existing Spark Streaming (dstream) in Spark 2.0 has the same production stability level as Spark 1.6. There is also Kafka 0.10 support in dstream. On July 25, 2016 at 10:26:49 AM, Andy Davidson ( a...@santacruz

[ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-26 Thread Reynold Xin
Hi all, Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes 2500+ patches from 300+ contributors. To download Spark 2.0, head over to the download page: http://spark.apache.org/downloads.html To view the release notes: http://spark.apache.org/releases/spark-release-2-0-0.html

Re: RDD vs Dataset performance

2016-07-28 Thread Reynold Xin
The performance difference is coming from the need to serialize and deserialize data to AnnotationText. The extra stage is probably very quick and shouldn't impact much. If you try cache the RDD using serialized mode, it would slow down a lot too. On Thu, Jul 28, 2016 at 9:52 AM, Darin McBeath

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Reynold Xin
That is unfortunately the way how Scala compiler captures (and defines) closures. Nothing is really final in the JVM. You can always use reflection or unsafe to modify the value of fields. On Mon, Aug 8, 2016 at 8:16 PM, Simon Scott wrote: > But does the “notSer” object have to be serialized? >

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Reynold Xin
rks. > > The workaround I can imagine is just to cache and materialize `df` by > `df.cache.count()`, and then call `df.filter(...).show()`. > It should work, just a little bit tedious. > > > > On Mon, Aug 8, 2016 at 10:00 PM, Reynold Xin wrote: > >> That is unfortun

[discuss] dropping Python 2.6 support

2016-01-04 Thread Reynold Xin
Does anybody here care about us dropping support for Python 2.6 in Spark 2.0? Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json parsing) when compared with Python 2.7. Some libraries that Spark depend on stopped supporting 2.6. We can still convince the library maintainers to su

Re: XML column not supported in Database

2016-01-11 Thread Reynold Xin
Can you file a JIRA ticket? Thanks. The URL is issues.apache.org/jira/browse/SPARK On Mon, Jan 11, 2016 at 1:44 AM, Gaini Rajeshwar < raja.rajeshwar2...@gmail.com> wrote: > Hi All, > > I am using PostgreSQL database. I am using the following jdbc call to > access a customer table (*customer_id i

[discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

2016-01-13 Thread Reynold Xin
We've dropped Hadoop 1.x support in Spark 2.0. There is also a proposal to drop Hadoop 2.2 and 2.3, i.e. the minimal Hadoop version we support would be Hadoop 2.4. The main advantage is then we'd be able to focus our Jenkins resources (and the associated maintenance of Jenkins) to create builds fo

Spark Summit San Francisco 2016 call for presentations (CFP)

2016-02-11 Thread Reynold Xin
FYI, Call for presentations is now open for Spark Summit. The event will take place on June 6-8 in San Francisco. Submissions are welcome across a variety of Spark-related topics, including applications, development, data science, business value, spark ecosystem and research. Please submit by Febr

Spark Summit (San Francisco, June 6-8) call for presentation due in less than week

2016-02-24 Thread Reynold Xin
Just want to send a reminder in case people don't know about it. If you are working on (or with, using) Spark, consider submitting your work to Spark Summit, coming up in June in San Francisco. https://spark-summit.org/2016/call-for-presentations/ Cheers.

Re: DirectFileOutputCommiter

2016-02-26 Thread Reynold Xin
It could lose data in speculation mode, or if any job fails. On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman wrote: > Takeshi, do you know the reason why they wanted to remove this commiter in > SPARK-10063? > the jira has no info inside > as far as I understand the direct committer can't be used w

Re: Is spark.driver.maxResultSize used correctly ?

2016-02-27 Thread Reynold Xin
But sometimes you might have skew and almost all the result data are in one or a few tasks though. On Friday, February 26, 2016, Jeff Zhang wrote: > > My job get this exception very easily even when I set large value of > spark.driver.maxResultSize. After checking the spark code, I found > spark

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Reynold Xin
ble, but not the common case. I think we should > design for the common case, for the skew case, we may can set some > parameter of fraction to allow user to tune it. > > On Sat, Feb 27, 2016 at 4:51 PM, Reynold Xin > wrote: > >> But sometimes you might have skew and almost

Re: [Proposal] Enabling time series analysis on spark metrics

2016-03-01 Thread Reynold Xin
Is the suggestion just to use a different config (and maybe fallback to appid) in order to publish metrics? Seems reasonable. On Tue, Mar 1, 2016 at 8:17 AM, Karan Kumar wrote: > +dev mailing list > > Time series analysis on metrics becomes quite useful when running spark > jobs using a workflo

[ANNOUNCE] Announcing Spark 1.5.1

2015-10-01 Thread Reynold Xin
Hi All, Spark 1.5.1 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.0 users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.1 http://spark.apach

Fwd: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Reynold Xin
Adding user list too. -- Forwarded message -- From: Reynold Xin Date: Tue, Oct 6, 2015 at 5:54 PM Subject: Re: multiple count distinct in SQL/DataFrame? To: "d...@spark.apache.org" To provide more context, if we do remove this feature, the following SQL query woul

Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-14 Thread Reynold Xin
+dev list On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo wrote: > All, > > Does anyone meet memory leak issue with spark streaming and spark sql in > spark 1.5.1? I can see the memory is increasing all the time when running > this simple sample: > > val sc = new SparkContext(conf) >

If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-14 Thread Reynold Xin
Can you reply to this email and provide us with reasons why you disable it? Thanks.

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-11-01 Thread Reynold Xin
>>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) >>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) >>> at >>> org.apache.spark.rdd.MapPartitionsWithPreparati

Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
ecutor, causing entire > stages to be retried. In fine-grained mode, only the task fails and > subsequently gets retried without taking out an entire stage or worse. > > On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin wrote: > >> If you are using Spark with Mesos fine grained mode, can yo

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data from in-memory binary f

Re: Looking for the method executors uses to write to HDFS

2015-11-06 Thread Reynold Xin
Are you looking for this? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L69 On Wed, Nov 4, 2015 at 5:11 AM, Tóth Zoltán wrote: > Hi, > > I'd like to write a parquet file from the driver. I could use

[ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Reynold Xin
Hi All, Spark 1.5.2 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.x users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.2 http://spark.apach

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
It's a completely different path. On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar wrote: > I would like to know if Hive on Spark uses or shares the execution code > with Spark SQL or DataFrames? > > More specifically, does Hive on Spark benefit from the changes made to > Spark SQL, project Tung

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
No it does not -- although it'd benefit from some of the work to make shuffle more robust. On Sun, Nov 15, 2015 at 10:45 PM, kiran lonikar wrote: > So does not benefit from Project Tungsten right? > > > On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin wrote: > >> It

Re: orc read issue n spark

2015-11-18 Thread Reynold Xin
What do you mean by starts delay scheduling? Are you saying it is no longer doing local reads? If that's the case you can increase the spark.locality.read timeout. On Wednesday, November 18, 2015, Renu Yadav wrote: > Hi , > I am using spark 1.4.1 and saving orc file using > df.write.format("orc

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Reynold Xin
I just updated the page to say "email dev" instead of "email user". On Tue, Nov 24, 2015 at 1:16 AM, Sean Owen wrote: > Not sure who generally handles that, but I just made the edit. > > On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote: > > Sorry to be a nag, I realize folks with edit rights o

Re: Memory allocation error with Spark 1.5

2015-08-05 Thread Reynold Xin
In Spark 1.5, we have a new way to manage memory (part of Project Tungsten). The default unit of memory allocation is 64MB, which is way too high when you have 1G of memory allocated in total and have more than 4 threads. We will reduce the default page size before releasing 1.5. For now, you can

Re: Tungsten and sun.misc.Unsafe

2015-08-21 Thread Reynold Xin
I'm actually somewhat involved with the Google Docs you linked to. I don't think Oracle will remove Unsafe in JVM 9. As you said, JEP 260 already proposes making Unsafe available. Given the widespread use of Unsafe for performance and advanced functionalities, I don't think Oracle can just remove

Re: DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Reynold Xin
You've probably hit this bug: https://issues.apache.org/jira/browse/SPARK-7180 It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to false and see if it goes away. On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov wrote: > Hi, > > I'm using spark 1.3.1 built against hadoop 1

Re: How to avoid shuffle errors for a large join ?

2015-08-29 Thread Reynold Xin
Can you try 1.5? This should work much, much better in 1.5 out of the box. For 1.4, I think you'd want to turn on sort-merge-join, which is off by default. However, the sort-merge join in 1.4 can still trigger a lot of garbage, making it slower. SMJ performance is probably 5x - 1000x better in 1.5

Re: How to avoid shuffle errors for a large join ?

2015-09-05 Thread Reynold Xin
it takes about 12h to finish (with > 1 shuffle partitions). My hunch is that the reason for that is this: > > INFO ExternalSorter: Thread 3733 spilling in-memory map of 174.9 MB to > disk (62 times so far) > > (and lots more where this comes from). > > On Sat, Aug 29, 2

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-07 Thread Reynold Xin
On Wed, Sep 2, 2015 at 12:03 AM, Anders Arpteg wrote: > > BTW, is it possible (or will it be) to use Tungsten with dynamic > allocation and the external shuffle manager? > > Yes - I think this already works. There isn't anything specific here related to Tungsten.

Re: Best way to import data from Oracle to Spark?

2015-09-09 Thread Reynold Xin
Using the JDBC data source is probably the best way. http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#jdbc-to-other-databases On Tue, Sep 8, 2015 at 10:11 AM, Cui Lin wrote: > What's the best way to import data from Oracle to Spark? Thanks! > > > -- > Best regards! > > Lin,Cui >

[ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Reynold Xin
Hi All, Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page. A huge thanks go to all of the individuals and organizations involved in development and testing of this r

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
Java 7 / 8? On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza wrote: > I just upgraded the spark-timeseries > project to run on top of > 1.5, and I'm noticing that tests are failing with OOMEs. > > I ran a jmap -histo on the process and discovered the top

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
Sandy Ryza wrote: > Java 7. > > FWIW I was just able to get it to work by increasing MaxPermSize to 256m. > > -Sandy > > On Wed, Sep 9, 2015 at 11:37 AM, Reynold Xin wrote: > >> Java 7 / 8? >> >> On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza >> wro

Re: Perf impact of BlockManager byte[] copies

2015-09-10 Thread Reynold Xin
This is one problem I'd like to address soon - providing a binary block management interface for shuffle (and maybe other things) that avoids serialization/copying. On Fri, Feb 27, 2015 at 3:39 PM, Paul Wais wrote: > Dear List, > > I'm investigating some problems related to native code integrat

Re: How to avoid shuffle errors for a large join ?

2015-09-16 Thread Reynold Xin
Only SQL and DataFrame for now. We are thinking about how to apply that to a more general distributed collection based API, but it's not in 1.5. On Sat, Sep 5, 2015 at 11:56 AM, Gurvinder Singh wrote: > On 09/05/2015 11:22 AM, Reynold Xin wrote: > > Try increase the shuffle memor

Re: in joins, does one side stream?

2015-09-18 Thread Reynold Xin
Yes for RDD -- both are materialized. No for DataFrame/SQL - one side streams. On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers wrote: > in scalding we join with the smaller side on the left, since the smaller > side will get buffered while the bigger side streams through the join. > > looking a

Re: in joins, does one side stream?

2015-09-19 Thread Reynold Xin
aborate on this. I thought RDD also opens only an > iterator. Does it get materialized for joins? > > Rishi > > On Saturday, September 19, 2015, Reynold Xin wrote: > >> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side >> streams. >> >>

Re: in joins, does one side stream?

2015-09-20 Thread Reynold Xin
;> >> they dont seem specific to structured data analysis to me. >> >> On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra < >> rishi80.mis...@gmail.com> wrote: >> >>> Got it..thnx Reynold.. >>> On 20 Sep 2015 07:08, "Reynold Xin"

Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list Hi Dirceu, The answer to whether throwing an exception is better or null is better depends on your use case. If you are debugging and want to find bugs with your program, you might prefer throwing an exception. However, if you are running on a large real-world dataset (i.e. data is dirt

[ANNOUNCE] Announcing Spark 2.0.1

2016-10-04 Thread Reynold Xin
We are happy to announce the availability of Spark 2.0.1! Apache Spark 2.0.1 is a maintenance release containing 300 stability and bug fixes. This release is based on the branch-2.0 maintenance branch of Spark. We strongly recommend all 2.0.0 users to upgrade to this stable release. To download A

Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-04 Thread Reynold Xin
They have been published yesterday, but can take a while to propagate. On Tue, Oct 4, 2016 at 12:58 PM, Prajwal Tuladhar wrote: > Hi, > > It seems like, 2.0.1 artifact hasn't been published to Maven Central. Can > anyone confirm? > > On Tue, Oct 4, 2016 at 5:39 PM, Reyn

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Reynold Xin
Does Kafka 0.10 work on a Kafka 0.8/0.9 cluster? On Fri, Oct 7, 2016 at 1:14 PM, Jeremy Smith wrote: > +1 > > We're on CDH, and it will probably be a while before they support Kafka > 0.10. At the same time, we don't use their Spark and we're looking forward > to upgrading to 2.0.x and using st

Re: This Exception has been really hard to trace

2016-10-09 Thread Reynold Xin
You should probably check with DataStax who build the Cassandra connector for Spark. On Sun, Oct 9, 2016 at 8:13 PM, kant kodali wrote: > > I tried SpanBy but look like there is a strange error that happening no > matter which way I try. Like the one here described for Java solution. > > http:/

SPARK-17845 - window function frame boundary API

2016-10-09 Thread Reynold Xin
Hi all, I tried to use the window function DataFrame API this weekend and found it awkward to use, especially with respect to specifying frame boundaries. I wrote down some options here and am curious your thoughts. If you have suggestions on the API beyond what's already listed in the JIRA ticket

Mark DataFrame/Dataset APIs stable

2016-10-12 Thread Reynold Xin
I took a look at all the public APIs we expose in o.a.spark.sql tonight, and realized we still have a large number of APIs that are marked experimental. Most of these haven't really changed, except in 2.0 we merged DataFrame and Dataset. I think it's long overdue to mark them stable. I'm tracking

[ANNOUNCE] Announcing Apache Spark 1.6.3

2016-11-07 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.3! This maintenance release includes fixes across several areas of Spark and encourage users on the 1.6.x line to upgrade to 1.6.3. Head to the project's download page to download the new version: http://spark.apache.org/downloads.html

[ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
We are happy to announce the availability of Spark 2.0.2! Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along with Kafka 0.10 support and runtime metrics for Structured Streaming. This release is based on the branch-2.0 maintenance branch of Spark. We strongly recommend all 2

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread Reynold Xin
Adding a new data type is an enormous undertaking and very invasive. I don't think it is worth it in this case given there are clear, simple workarounds. On Thu, Nov 17, 2016 at 12:24 PM, kant kodali wrote: > Can we have a JSONType for Spark SQL? > > On Wed, Nov 16, 2016 at 8:41 PM, Nathan Land

Re: Re: Multiple streaming aggregations in structured streaming

2016-11-20 Thread Reynold Xin
Can you use the approximate count distinct? On Sun, Nov 20, 2016 at 11:51 PM, Xinyu Zhang wrote: > > MapWithState is also very useful. > I want to calculate UV in real time, but "distinct count" and "multiple > streaming aggregations" are not supported. > Is there any method to calculate real-t

Re: Re: Re: Multiple streaming aggregations in structured streaming

2016-11-22 Thread Reynold Xin
It's just the "approx_count_distinct" aggregate function. On Tue, Nov 22, 2016 at 6:51 PM, Xinyu Zhang wrote: > Could you please tell me how to use the approximate count distinct? Is > there any docs? > > Thanks > > > At 2016-11-21 15:56:21, "

Re: Third party library

2016-11-25 Thread Reynold Xin
bcc dev@ and add user@ This is more a user@ list question rather than a dev@ list question. You can do something like this: object MySimpleApp { def loadResources(): Unit = // define some idempotent way to load resources, e.g. with a flag or lazy val def main() = { ... sc.paralleli

Re: Third party library

2016-11-26 Thread Reynold Xin
Is I am missing something ? If possible, can you point me to existing > implementation which i can refer to. > > Thanks again. > > ~ > > On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin wrote: > >> bcc dev@ and add user@ >> >> >> This is more a user@ list

Re: Bit-wise AND operation between integers

2016-11-28 Thread Reynold Xin
Bcc dev@ and add user@ The dev list is not meant for users to ask questions on how to use Spark. For that you should use StackOverflow or the user@ list. scala> sql("select 1 & 2").show() +---+ |(1 & 2)| +---+ | 0| +---+ scala> sql("select 1 & 3").show() +---+ |(1 & 3)| +-

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Reynold Xin
This should fix it: https://github.com/apache/spark/pull/16080 On Wed, Nov 30, 2016 at 10:55 AM, Timur Shenkao wrote: > Hello, > > Yes, I used hiveContext, sqlContext, sparkSession from Java, Scala, > Python. > Via spark-shell, spark-submit, IDE (PyCharm, Intellij IDEA). > Everything is perfec

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Reynold Xin
You can just write some files out directly (and idempotently) in your map/mapPartitions functions. It is just a function that you can run arbitrary code after all. On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit wrote: > Any suggestions on this one? > > Regards > Sumit Chawla > > > On Tue, Dec 1

Re: is dataframe thread safe?

2017-02-13 Thread Reynold Xin
Yes your use case should be fine. Multiple threads can transform the same data frame in parallel since they create different data frames. On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf wrote: > Hi, > > I was wondering if dataframe is considered thread safe. I know the spark > session and spar

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Reynold Xin
A join? On Thu, Jun 15, 2017 at 1:11 AM 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue about this. > Is it possible that its size is gre

Re: Question on Spark code

2017-07-23 Thread Reynold Xin
It means the same object ("this") is returned. On Sun, Jul 23, 2017 at 8:16 PM, tao zhan wrote: > Hello, > > I am new to scala and spark. > What does the "this.type" in set function for? > > > ​ > https://github.com/apache/spark/blob/481f0792944d9a77f0fe8b5e2596da > 1d600b9d0a/mllib/src/main/sca

Re: Question on Spark code

2017-07-23 Thread Reynold Xin
>> Doesn't it mean the return type will be type of "this" class. So, it >> doesn't have to be this instance of the class but it has to be type of this >> instance of the class. When you have a stack of inheritance and call that >> function, it will retur

Re: SQL specific documentation for recent Spark releases

2017-08-11 Thread Reynold Xin
This PR should help you in the next release: https://github.com/apache/spark/pull/18702 On Thu, Aug 10, 2017 at 7:46 PM, Stephen Boesch wrote: > > The correct link is https://docs.databricks.com/ > spark/latest/spark-sql/index.html . > > This link does have the core syntax such as the BNF for

  1   2   >