Re: spark 1.4 GC issue

2015-11-15 Thread Ted Yu
Please take a look at http://www.infoq.com/articles/tuning-tips-G1-GC

Cheers

On Sat, Nov 14, 2015 at 10:03 PM, Renu Yadav  wrote:

> I have tried with G1 GC .Please if anyone can provide their setting for GC.
> At code level I am :
> 1.reading orc table usind dataframe
> 2.map df to rdd of my case class
> 3. changed that rdd to paired rdd
> 4.Applied combineByKey
> 5. saving the result to orc file
>
> Please suggest
>
> Regards,
> Renu Yadav
>
> On Fri, Nov 13, 2015 at 8:01 PM, Renu Yadav  wrote:
>
>> am using spark 1.4 and my application is taking much time in GC around
>> 60-70% of time for each task
>>
>> I am using parallel GC.
>> please help somebody as soon as possible.
>>
>> Thanks,
>> Renu
>>
>
>


Are map tasks spilling data to disk?

2015-11-15 Thread gsvic
According to  this paper

  
Spak's map tasks writes the results to disk. 

My actual question is, in  BroadcastHashJoin

  
doExecute() method at line  109 the mapPartitions

  
method is called. At this step, Spark will schedule a number of tasks for
execution in order to perform the hash join operation. The results of these
tasks will be written to each worker's disk?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Are-map-tasks-spilling-data-to-disk-tp15216.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Map Tasks - Disk Spill (?)

2015-11-15 Thread gsvic
According to  this paper

  
Spak's map tasks writes the results to disk. 

My actual question is, in  BroadcastHashJoin

  
doExecute() method at line  109 the mapPartitions

  
method is called. At this step, Spark will schedule a number of tasks for
execution in order to perform the hash join operation. The results of these
tasks will be written to each worker's disk?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Map-Tasks-Disk-Spill-tp15217.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Are map tasks spilling data to disk?

2015-11-15 Thread Reynold Xin
It depends on what the next operator is. If the next operator is just an
aggregation, then no, the hash join won't write anything to disk. It will
just stream the data through to the next operator. If the next operator is
shuffle (exchange), then yes.

On Sun, Nov 15, 2015 at 10:52 AM, gsvic  wrote:

> According to  this paper
> <
> http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
> >
> Spak's map tasks writes the results to disk.
>
> My actual question is, in  BroadcastHashJoin
> <
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala#L100
> >
> doExecute() method at line  109 the mapPartitions
> <
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala#L109
> >
> method is called. At this step, Spark will schedule a number of tasks for
> execution in order to perform the hash join operation. The results of these
> tasks will be written to each worker's disk?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Are-map-tasks-spilling-data-to-disk-tp15216.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: A proposal for Spark 2.0

2015-11-15 Thread Prashant Sharma
Hey Matei,


> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.


Our REPL specific changes were merged in scala/scala and are available as
part of 2.11.7 and hopefully be part of 2.12 too. If I am not wrong, REPL
stuff is taken care of, we don;t need to keep upgrading REPL code for every
scala release now. http://www.scala-lang.org/news/2.11.7

I am +1 on the proposal for Spark 2.0.

Thanks,


Prashant Sharma



On Thu, Nov 12, 2015 at 3:02 AM, Matei Zaharia 
wrote:

> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
> Matei
>
> > On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
> >
> > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
> wrote:
> >> to the Spark community. A major release should not be very different
> from a
> >> minor release and should not be gated based on new features. The main
> >> purpose of a major release is an opportunity to fix things that are
> broken
> >> in the current API and remove certain deprecated APIs (examples follow).
> >
> > Agree with this stance. Generally, a major release might also be a
> > time to replace some big old API or implementation with a new one, but
> > I don't see obvious candidates.
> >
> > I wouldn't mind turning attention to 2.x sooner than later, unless
> > there's a fairly good reason to continue adding features in 1.x to a
> > 1.7 release. The scope as of 1.6 is already pretty darned big.
> >
> >
> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but
> >> it has been end-of-life.
> >
> > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> > be quite stable, and 2.10 will have been EOL for a while. I'd propose
> > dropping 2.10. Otherwise it's supported for 2 more years.
> >
> >
> >> 2. Remove Hadoop 1 support.
> >
> > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > sort of 'alpha' and 'beta' releases) and even <2.6.
> >
> > I'm sure we'll think of a number of other small things -- shading a
> > bunch of stuff? reviewing and updating dependencies in light of
> > simpler, more recent dependencies to support from Hadoop etc?
> >
> > Farming out Tachyon to a module? (I felt like someone proposed this?)
> > Pop out any Docker stuff to another repo?
> > Continue that same effort for EC2?
> > Farming out some of the "external" integrations to another repo (?
> > controversial)
> >
> > See also anything marked version "2+" in JIRA.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Hive Context incompatible with Sentry enabled Cluster

2015-11-15 Thread Charmee Patel
Hi,

We have recently run into this issue:
https://issues.apache.org/jira/browse/SPARK-9042

My organization's application reads raw data from files, processes/cleanses
it and pushes the results to Hive tables. To keep reads efficient, we have
partitioned our tables. In a Sentry enabled cluster, our writes to Hive
tables fail as Hive Context tries to edit partitions in meta store directly
and Sentry has disabled direct edits in Hive Meta Store.

After discussing our options with Cloudera Support, current workaround for
us is to generate bunch of files at the end of Spark process and open a
separate connection to HiveServer2 to load those files. We can change our
tables to be external tables to reduce data movement. Regardless, it's a
stop gap measure as we need to open separate connection to HiveServer2 to
manage the partitions.

This also affects all Hive CTAS + DDLs supported from within Hive Context.
We'd like to know where Hive Support within Spark is headed with Security
products like Sentry or Ranger in place.

Thanks,
Charmee


Re: Support for local disk columnar storage for DataFrames

2015-11-15 Thread Reynold Xin
This (updates) is something we are going to think about in the next release
or two.

On Thu, Nov 12, 2015 at 8:57 AM, Cristian O  wrote:

> Sorry, apparently only replied to Reynold, meant to copy the list as well,
> so I'm self replying and taking the opportunity to illustrate with an
> example.
>
> Basically I want to conceptually do this:
>
> val bigDf = sqlContext.sparkContext.parallelize((1 to 100)).map(i => (i, 
> 1)).toDF("k", "v")
> val deltaDf = sqlContext.sparkContext.parallelize(Array(1, 5)).map(i => 
> (i, 1)).toDF("k", "v")
>
> bigDf.cache()
>
> bigDf.registerTempTable("big")
> deltaDf.registerTempTable("delta")
>
> val newBigDf = sqlContext.sql("SELECT big.k, big.v + IF(delta.v is null, 0, 
> delta.v) FROM big LEFT JOIN delta on big.k = delta.k")
>
> newBigDf.cache()
> bigDf.unpersist()
>
>
> This is essentially an update of keys "1" and "5" only, in a dataset
> of 1 million keys.
>
> This can be achieved efficiently if the join would preserve the cached
> blocks that have been unaffected, and only copy and mutate the 2 affected
> blocks corresponding to the matching join keys.
>
> Statistics can determine which blocks actually need mutating. Note also
> that shuffling is not required assuming both dataframes are pre-partitioned
> by the same key K.
>
> In SQL this could actually be expressed as an UPDATE statement or for a
> more generalized use as a MERGE UPDATE:
> https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
>
> While this may seem like a very special case optimization, it would
> effectively implement UPDATE support for cached DataFrames, for both
> optimal and non-optimal usage.
>
> I appreciate there's quite a lot here, so thank you for taking the time to
> consider it.
>
> Cristian
>
>
>
> On 12 November 2015 at 15:49, Cristian O 
> wrote:
>
>> Hi Reynold,
>>
>> Thanks for your reply.
>>
>> Parquet may very well be used as the underlying implementation, but this
>> is more than about a particular storage representation.
>>
>> There are a few things here that are inter-related and open different
>> possibilities, so it's hard to structure, but I'll give it a try:
>>
>> 1. Checkpointing DataFrames - while a DF can be saved locally as parquet,
>> just using that as a checkpoint would currently require explicitly reading
>> it back. A proper checkpoint implementation would just save (perhaps
>> asynchronously) and prune the logical plan while allowing to continue using
>> the same DF, now backed by the checkpoint.
>>
>> It's important to prune the logical plan to avoid all kinds of issues
>> that may arise from unbounded expansion with iterative use-cases, like this
>> one I encountered recently:
>> https://issues.apache.org/jira/browse/SPARK-11596
>>
>> But really what I'm after here is:
>>
>> 2. Efficient updating of cached DataFrames - The main use case here is
>> keeping a relatively large dataset cached and updating it iteratively from
>> streaming. For example one would like to perform ad-hoc queries on an
>> incrementally updated, cached DataFrame. I expect this is already becoming
>> an increasingly common use case. Note that the dataset may require merging
>> (like adding) or overrriding values by key, so simply appending is not
>> sufficient.
>>
>> This is very similar in concept with updateStateByKey for regular RDDs,
>> i.e. an efficient copy-on-write mechanism, albeit perhaps at CachedBatch
>> level  (the row blocks for the columnar representation).
>>
>> This can be currently simulated with UNION or (OUTER) JOINs however is
>> very inefficient as it requires copying and recaching the entire dataset,
>> and unpersisting the original one. There are also the aforementioned
>> problems with unbounded logical plans (physical plans are fine)
>>
>> These two together, checkpointing and updating cached DataFrames, would
>> give fault-tolerant efficient updating of DataFrames, meaning streaming
>> apps can take advantage of the compact columnar representation and Tungsten
>> optimisations.
>>
>> I'm not quite sure if something like this can be achieved by other means
>> or has been investigated before, hence why I'm looking for feedback here.
>>
>> While one could use external data stores, they would have the added IO
>> penalty, plus most of what's available at the moment is either HDFS
>> (extremely inefficient for updates) or key-value stores that have 5-10x
>> space overhead over columnar formats.
>>
>> Thanks,
>> Cristian
>>
>>
>>
>>
>>
>>
>> On 12 November 2015 at 03:31, Reynold Xin  wrote:
>>
>>> Thanks for the email. Can you explain what the difference is between
>>> this and existing formats such as Parquet/ORC?
>>>
>>>
>>> On Wed, Nov 11, 2015 at 4:59 AM, Cristian O <
>>> cristian.b.op...@googlemail.com> wrote:
>>>
 Hi,

 I was wondering if there's any planned support for local disk columnar
 storage.

 This could be an extension of the in-memory columnar store, or possibly
 something similar to the recently add

Hive on Spark Vs Spark SQL

2015-11-15 Thread kiran lonikar
I would like to know if Hive on Spark uses or shares the execution code
with Spark SQL or DataFrames?

More specifically, does Hive on Spark benefit from the changes made to
Spark SQL, project Tungsten? Or is it completely different execution path
where it creates its own plan and executes on RDD?

-Kiran


Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
It's a completely different path.


On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar  wrote:

> I would like to know if Hive on Spark uses or shares the execution code
> with Spark SQL or DataFrames?
>
> More specifically, does Hive on Spark benefit from the changes made to
> Spark SQL, project Tungsten? Or is it completely different execution path
> where it creates its own plan and executes on RDD?
>
> -Kiran
>
>


Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread kiran lonikar
So does not benefit from Project Tungsten right?


On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin  wrote:

> It's a completely different path.
>
>
> On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar  wrote:
>
>> I would like to know if Hive on Spark uses or shares the execution code
>> with Spark SQL or DataFrames?
>>
>> More specifically, does Hive on Spark benefit from the changes made to
>> Spark SQL, project Tungsten? Or is it completely different execution path
>> where it creates its own plan and executes on RDD?
>>
>> -Kiran
>>
>>
>


Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
No it does not -- although it'd benefit from some of the work to make
shuffle more robust.


On Sun, Nov 15, 2015 at 10:45 PM, kiran lonikar  wrote:

> So does not benefit from Project Tungsten right?
>
>
> On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin  wrote:
>
>> It's a completely different path.
>>
>>
>> On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar 
>> wrote:
>>
>>> I would like to know if Hive on Spark uses or shares the execution code
>>> with Spark SQL or DataFrames?
>>>
>>> More specifically, does Hive on Spark benefit from the changes made to
>>> Spark SQL, project Tungsten? Or is it completely different execution path
>>> where it creates its own plan and executes on RDD?
>>>
>>> -Kiran
>>>
>>>
>>
>


releasing Spark 1.4.2

2015-11-15 Thread Niranda Perera
Hi,

I am wondering when spark 1.4.2 will be released?

is it in the voting stage at the moment?

rgds

-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/