Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Ovidiu-Cristian MARCU
ts, where he makes an eloquent case of their merits & > motivation, while also elaborates on RDDs. > > https://youtu.be/1a4pgYzeFwE <https://youtu.be/1a4pgYzeFwE> > > Cheers > > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :) > > > > S

Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Ovidiu-Cristian MARCU
Thank you, I like and agree with your point. RDD evolved to Datasets by means of an optimizer. I just wonder what are the use cases for RDDs (other than current version of GraphX leveraging RDDs)? Best, Ovidiu > On 01 Sep 2016, at 16:26, Sean Owen wrote: > > Here's my paraphrase: > > Dataset

Re: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Ovidiu-Cristian MARCU
Probably the yellow warning message can be confusing even more than not receiving an answer/opinion on his post. Best, Ovidiu > On 08 Aug 2016, at 20:10, Sean Owen wrote: > > I also don't know what's going on with the "This post has NOT been > accepted by the mailing list yet" message, because

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
Interesting opinion, thank you Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2]. Other than this presentation [3], do you guys know any other benchmark? [1]https://parquet.apache.org/documentation

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
So did you tried actually to run your use case with spark 2.0 and orc files? It’s hard to understand your ‘apparently..’. Best, Ovidiu > On 26 Jul 2016, at 13:10, Gourav Sengupta wrote: > > If you have ever tried to use ORC via SPARK you will know that SPARK's > promise of accessing ORC files i

Re: Strategies for propery load-balanced partitioning

2016-06-03 Thread Ovidiu-Cristian MARCU
I suppose you are running on 1.6. I guess you need some solution based on [1], [2] features which are coming in 2.0. [1] https://issues.apache.org/jira/browse/SPARK-12538 / https://issues.apache.org/jira/browse/SPARK-12394

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ovidiu-Cristian MARCU
Hi Ted, Any chance to develop more on the SQLConf parameters in the sense to have more explanations for changing these settings? Not all of them are made clear in the descriptions. Thanks! Best, Ovidiu > On 31 May 2016, at 16:30, Ted Yu wrote: > > Maciej: > You can refer to the doc in > sql/

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Ovidiu-Cristian MARCU
Spark in relation to Tez can be like a Flink runner for Apache Beam? The use case of Tez however may be interesting (but current implementation only YARN-based?) Spark is efficient (or faster) for a number of reasons, including its ‘in-memory’ execution (from my understanding and experiments).

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Ovidiu-Cristian MARCU
nk to the “Technical Vision” paper so there it > is - > https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing > > From: "Sela, Amit" mailto:ans...@paypal.com>> > Date: Saturday, May 21, 2016 at 11:52 PM > To: Ovidiu

Re: Spark.default.parallelism can not set reduce number

2016-05-20 Thread Ovidiu-Cristian MARCU
You can check org.apache.spark.sql.internal.SQLConf for other default settings as well. val SHUFFLE_PARTITIONS = SQLConfigBuilder("spark.sql.shuffle.partitions") .doc("The default number of partitions to use when shuffling data for joins or aggregations.") .intConf .createWithDefaul

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
h-streaming-102> > On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU > wrote: > > Hi, > > We can see in [2] many interesting (and expected!) improvements (promises) > like extended SQL support, unified API (DataFrames, DataSets), improved > engine (Tungsten rela

What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
Hi, We can see in [2] many interesting (and expected!) improvements (promises) like extended SQL support, unified API (DataFrames, DataSets), improved engine (Tungsten relates to ideas from modern compilers and MPP databases - similar to Flink [3]), structured streaming etc. It seems we somehow

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 17 April 2016

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
with > Flink. > > HTH > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://ta

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
ps://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 17 April 2016 at 12

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
out an answer from Spark team, do they plan to do something similar. > On 17 Apr 2016, at 15:33, Silvio Fiorito > wrote: > > Actually there were multiple responses to it on the GitHub project, including > a PR to improve the Spark code, but they weren’t acknowledged. > > &g

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
You probably read this benchmark at Yahoo, any comments from Spark? https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at > On 17 Apr 2016, at 12:41, andy petrella

Re: Graphx

2016-03-11 Thread Ovidiu-Cristian MARCU
Hi, I wonder what version of Spark and different parameter configuration you used. I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters) John: I suppose your C++ app (algorithm) does not scale if you used o

Re: off-heap certain operations

2016-02-16 Thread Ovidiu-Cristian MARCU
; developer to know whether to use, and if you're a developer and > curious, you can just grep the code for this flag, and/or read into > what Tungsten does. > > Personally, I would leave this off. > > On Fri, Feb 12, 2016 at 6:10 PM, Ovidiu-Cristian MARCU > wrote: >>

Lost executors failed job unable to execute spark examples Triangle Count (Analytics triangles)

2016-02-16 Thread Ovidiu-Cristian MARCU
Hi, I am able to run the Triangle Count example with some smaller graphs but when I am using http://snap.stanford.edu/data/com-Friendster.html I am not able to get the job finished ok. For some reason Spark loses its executors. No matter what

spark examples Analytics ConnectedComponents - keep running, nothing in output

2016-02-16 Thread Ovidiu-Cristian MARCU
Hi I’m trying to run Analytics cc (ConnectedComponents) but it is running without ending. Logs are fine, but I just keep getting Job xyz finished, reduce took some time: ... INFO DAGScheduler: Job 29 finished: reduce at VertexRDDImpl.scala:90, took 14.828033 s INFO DAGScheduler: Job 30 finished

Re: off-heap certain operations

2016-02-12 Thread Ovidiu-Cristian MARCU
I found nothing about the certain operations. Still not clear, certain is poor documentation. Can someone give an answer so I can consider using this new release? spark.memory.offHeap.enabled If true, Spark will attempt to use off-heap memory for certain operations. > On 12 Feb 2016, at 13:21,

off-heap certain operations

2016-02-11 Thread Ovidiu-Cristian MARCU
Hi, Reading though the latest documentation for Memory management I can see that the parameter spark.memory.offHeap.enabled (true by default) is described with ‘If true, Spark will attempt to use off-heap memory for certain operations’ [1]. Can you please describe the certain operations you are