Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Yiannis Gkoufas
Hi there, I would recommend checking out https://github.com/spark-jobserver/spark-jobserver which I think gives the functionality you are looking for. I haven't tested it though. BR On 5 June 2015 at 01:35, Olivier Girardot wrote: > You can use it as a broadcast variable, but if it's "too" lar

Sorted Multiple Outputs

2015-07-14 Thread Yiannis Gkoufas
Hi there, I have been using the approach described here: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job In addition to that, I was wondering if there is a way to set the customize the order of those values contained in each file. Thanks a lot!

Re: Sorted Multiple Outputs

2015-07-16 Thread Yiannis Gkoufas
t all values for specific key will > reside in just one partition, but it might happen that one partition might > contain more, than one key (with values). This I’m not sure, but that > shouldn’t be a big deal as you would iterate over tuple Iterable> and store one key to a specific file. &

Re: Spark Metrics Framework?

2016-04-01 Thread Yiannis Gkoufas
Hi Mike, I am forwarding you a mail I sent a while ago regarding some related work I did, hope you find it useful Hi all, I recently sent to the dev mailing list about this contribution, but I thought it might be useful to post it here, since I have seen a lot of people asking about OS-level met

Executors killed in Workers with Error: invalid log directory

2016-06-22 Thread Yiannis Gkoufas
Hi there, I have been getting a strange error in spark-1.6.1 The job submitted uses only the executor launched on the Master node while the other workers are idle. When I check the errors from the web ui to investigate on the killed executors I see the error: Error: invalid log directory /disk/spa

Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Hi there, I have been using Spark 1.5.2 on my cluster without a problem and wanted to try Spark 1.6.0. I have the exact same configuration on both clusters. I am able to start the Standalone Cluster but I fail to submit a job getting errors like the following: 16/01/05 14:24:14 INFO AppClient$Cli

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
un Java 8? > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > &g

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
afe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Tue, Jan 5, 2016 at 9:01 AM, Yiannis Gkoufas > wrote: > >> Hi Dean, >> >> thanks so much for the response! It works without a problem now! >> >&g

Trying to understand dynamic resource allocation

2016-01-11 Thread Yiannis Gkoufas
Hi, I am exploring a bit the dynamic resource allocation provided by the Standalone Cluster Mode and I was wondering whether this behavior I am experiencing is expected. In my configuration I have 3 slaves with 24 cores each. I have in my spark-defaults.conf: spark.shuffle.service.enabled true sp

Re: spark metrics question

2016-02-03 Thread Yiannis Gkoufas
Hi Matt, there is some related work I recently did in IBM Research for visualizing the metrics produced. You can read about it here http://www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/ We recently opensourced it if you are interested to ha

SparkOscope: Enabling Spark Optimization through Cross-stack Monitoring and Visualization

2016-02-03 Thread Yiannis Gkoufas
Hi all, I recently sent to the dev mailing list about this contribution, but I thought it might be useful to post it here, since I have seen a lot of people asking about OS-level metrics of Spark. This is the result of the work we have been doing recently in IBM Research around Spark. Essentially,

Re: spark metrics question

2016-02-03 Thread Yiannis Gkoufas
pre-deployed on all Executor nodes? > > On Wed, Feb 3, 2016 at 10:36 AM, Yiannis Gkoufas > wrote: > >> Hi Matt, >> >> there is some related work I recently did in IBM Research for visualizing >> the metrics produced. >> You can read about it here >>

Avoid Shuffling on Partitioned Data

2015-12-04 Thread Yiannis Gkoufas
Hi there, I have my data stored in HDFS partitioned by month in Parquet format. The directory looks like this: -month=201411 -month=201412 -month=201501 - I want to compute some aggregates for every timestamp. How is it possible to achieve that by taking advantage of the existing partitionin

Re: Sorted Multiple Outputs

2015-08-14 Thread Yiannis Gkoufas
ld be to try to split RDD into multiple RDDs by key. > You can get distinct keys, collect them on driver and have a loop over they > keys and filter out new RDD out of the original one by that key. > > for( key : keys ) { > RDD.filter( key ).saveAsTextfile() > } > >

Setting the number of executors in standalone mode

2015-02-20 Thread Yiannis Gkoufas
Hi there, I try to increase the number of executors per worker in the standalone mode and I have failed to achieve that. I followed a bit the instructions of this thread: http://stackoverflow.com/questions/26645293/spark-configuration-memory-instance-cores and did that: spark.executor.memory 1g S

Re: Setting the number of executors in standalone mode

2015-02-20 Thread Yiannis Gkoufas
Spark on each worker node. Nothing to do with > # of executors. > > > > > > Mohammed > > > > *From:* Yiannis Gkoufas [mailto:johngou...@gmail.com] > *Sent:* Friday, February 20, 2015 4:55 AM > *To:* user@spark.apache.org > *Subject:* Setting the numb

Re: Worker and Nodes

2015-02-21 Thread Yiannis Gkoufas
Hi, I have experienced the same behavior. You are talking about standalone cluster mode right? BR On 21 February 2015 at 14:37, Deep Pradhan wrote: > Hi, > I have been running some jobs in my local single node stand alone cluster. > I am varying the worker instances for the same job, and the t

Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
Hi all, I am trying to do the following. val myObject = new MyObject(); val myObjectBroadcasted = sc.broadcast(myObject); val rdd1 = sc.textFile("/file1").map(e => { myObject.insert(e._1); (e._1,1) }); rdd.cache.count(); //to make sure it is transformed. val rdd2 = sc.textFile("/file2").map(e

Re: Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
ng to modify myObjrct directly which won't work because you > are modifying the serialized copy on the executor. You want to do > myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup. > > > > Sent with Good (www.good.com) > > > > -Original Me

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
Hi there, I assume you are using spark 1.2.1 right? I faced the exact same issue and switched to 1.1.1 with the same configuration and it was solved. On 24 Feb 2015 19:22, "Ted Yu" wrote: > Here is a tool which may give you some clue: > http://file-leak-detector.kohsuke.org/ > > Cheers > > On Tu

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
e's a bug report for this regression? For some > other (possibly connected) reason I upgraded from 1.1.1 to 1.2.1, but I > can't remember what the bug was. > > Joe > > > > > On 24 February 2015 at 19:26, Yiannis Gkoufas > wrote: > >> Hi there, >> &

Re: Brodcast Variable updated from one transformation and used from another

2015-02-25 Thread Yiannis Gkoufas
, Yiannis Gkoufas wrote: > Sorry for the mistake, I actually have it this way: > > val myObject = new MyObject(); > val myObjectBroadcasted = sc.broadcast(myObject); > > val rdd1 = sc.textFile("/file1").map(e => > { > myObjectBroadcasted.value.insert(e._1); >

Problems running version 1.3.0-rc1

2015-03-02 Thread Yiannis Gkoufas
Hi all, I have downloaded version 1.3.0-rc1 from https://github.com/apache/spark/archive/v1.3.0-rc1.zip, extracted it and built it using: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -DskipTests clean package It doesn't complain for any issues, but when I call sbin/start-all.sh I get on logs:

Re: Problems running version 1.3.0-rc1

2015-03-04 Thread Yiannis Gkoufas
t really the error? > > java.lang.NoClassDefFoundError: L akka/event/LogSou > > Looks like an incomplete class name. Is something corrupt, maybe a config > file? > > On Tue, Mar 3, 2015 at 2:13 AM, Yiannis Gkoufas > wrote: > > Hi all, > > > > I have downlo

DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
Hi there, I was trying the new DataFrame API with some basic operations on a parquet dataset. I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a standalone cluster mode. The code is the following: val people = sqlContext.parquetFile("/data.parquet"); val res = people.groupBy("na

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
//spark.apache.org/docs/latest/configuration.html > > Cheng > > > On 3/18/15 9:15 PM, Yiannis Gkoufas wrote: > >> Hi there, >> >> I was trying the new DataFrame API with some basic operations on a >> parquet dataset. >> I have 7 nodes of 12 cores and 8GB RA

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
aggregation in the > second stage. Can you take a look at the numbers of tasks launched in these > two stages? > > Thanks, > > Yin > > On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas > wrote: > >> Hi there, I set the executor memory to 8g but it didn't help

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yiannis Gkoufas
nd increase it until the OOM disappears. Hopefully this > will help. > > Thanks, > > Yin > > > On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas > wrote: > >> Hi Yin, >> >> Thanks for your feedback. I have 1700 parquet files, sized 100MB each. >> T

How to handle under-performing nodes in the cluster

2015-03-20 Thread Yiannis Gkoufas
Hi all, I have 6 nodes in the cluster and one of the nodes is clearly under-performing: ​ I was wandering what is the impact of having such issues? Also what is the recommended way to workaround it? Thanks a lot, Yiannis

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yiannis Gkoufas
number of tasks and again the same error. :( Thanks a lot! On 19 March 2015 at 17:40, Yiannis Gkoufas wrote: > Hi Yin, > > thanks a lot for that! Will give it a shot and let you know. > > On 19 March 2015 at 16:30, Yin Huai wrote: > >> Was the OOM thrown during the exe

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yiannis Gkoufas
Actually I realized that the correct way is: sqlContext.sql("set spark.sql.shuffle.partitions=1000") but I am still experiencing the same behavior/error. On 20 March 2015 at 16:04, Yiannis Gkoufas wrote: > Hi Yin, > > the way I set the configuration is: >

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Yiannis Gkoufas
k applications to use on the machine, > e.g. 1000m, 2g (default: total memory minus 1 GB); note that each > application's individual memory is configured using its > spark.executor.memory property. > > > On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas > wr

Start ThriftServer Error

2015-04-22 Thread Yiannis Gkoufas
Hi all, I am trying to start the thriftserver and I get some errors. I have hive running and placed hive-site.xml under the conf directory. >From the logs I can see that the error is: Call From localhost to localhost:54310 failed I am assuming that it tries to connect to the wrong port for the n

Re: Start ThriftServer Error

2015-04-22 Thread Yiannis Gkoufas
t 3:52 PM, Yiannis Gkoufas > wrote: > >> Hi all, >> >> I am trying to start the thriftserver and I get some errors. >> I have hive running and placed hive-site.xml under the conf directory. >> From the logs I can see that the error is: >> >> Call Fro