Re: Running Javascript from scala spark

2015-05-26 Thread andy petrella
Yop, why not using like you said a js engine le rhino? But then I would suggest using mapPartition instead si only one engine per partition. Probably broadcasting the script is also a good thing to do. I guess it's for add hoc transformations passed by a remote client, otherwise you could simply c

Re: [Streaming] Configure executor logging on Mesos

2015-05-30 Thread andy petrella
e spark context ourselfves and relying on the mesos sheduler backend only. Unles the spark.executor.uri (or a another one) can take more than one downloadable path. my.2¢ andy On Fri, May 29, 2015 at 5:09 PM Gerard Maas wrote: > Hi Tim, > > Thanks for the info. We (Andy Petrel

Re: Machine Learning on GraphX

2015-06-18 Thread andy petrella
I guess that belief propagation could help here (at least, I find the ideas enough similar), thus this article might be a good start : http://arxiv.org/pdf/1004.1003.pdf (it's on my todo list, hence cannot really help further ^^) On Thu, Jun 18, 2015 at 11:44 AM Timothée Rebours wrote: > Thanks

Re: Real-time data visualization with Zeppelin

2015-07-12 Thread andy petrella
Heya, You might be looking for something like this I guess: https://www.youtube.com/watch?v=kB4kRQRFAVc. The Spark-Notebook (https://github.com/andypetrella/spark-notebook/) can bring that to you actually, it uses fully reactive bilateral communication streams to update data and viz, plus it hide

Re: Apache Flink

2016-04-17 Thread andy petrella
Just adding one thing to the mix: `that the latency for streaming data is eliminated` is insane :-D On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh wrote: > It seems that Flink argues that the latency for streaming data is > eliminated whereas with Spark RDD there is this latency. > > I notice

[Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread andy petrella
Heya folks, Just wondering if there are some doc regarding using kafka directly from the reader.stream? Has it been integrated already (I mean the source)? Sorry if the answer is RTFM (but then I'd appreciate a pointer anyway^^) thanks, cheers andy -- andy

Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread andy petrella
gt; > On Tue, Jun 14, 2016 at 9:21 AM, andy petrella > wrote: > > Heya folks, > > > > Just wondering if there are some doc regarding using kafka directly from > the > > reader.stream? > > Has it been integrated already (I mean the source)? > &g

Re: Spark 2.0 release date

2016-06-15 Thread andy petrella
> tomorrow lunch time Which TZ :-) → I'm working on the update of some materials that Dean Wampler and myself will give tomorrow at Scala Days (well tomorrow CEST). Hence, I'm upgrading the materials on spark 2.0.0-previ

Re: Spark 2.0 release date

2016-06-15 Thread andy petrella
e in Mich's response. > > To my knowledge, there hasn't been an RC for the 2.0 release yet. > Once we have an RC, it goes through the normal voting process. > > FYI > > On Wed, Jun 15, 2016 at 7:38 AM, andy petrella > wrote: > >> > tomorrow lunch

Re: spark and plot data

2016-07-23 Thread andy petrella
Heya, Might be worth checking the spark-notebook I guess, it offers custom and reactive dynamic charts (scatter, line, bar, pie, graph, radar, parallel, pivot, …) for any kind of data from an intuitive and easy Scala API (with server side, incl. spark based, sampling if

Re: Scala from Jupyter

2016-02-16 Thread andy petrella
Hello Alex! Rajeev is right, come over the spark notebook gitter room, you'll be helped by many experienced people if you have some troubles: https://gitter.im/andypetrella/spark-notebook The spark notebook has many integrated, reactive (scala) and extendable (scala) plotting capabilities. cheer

Re: Scala from Jupyter

2016-02-16 Thread andy petrella
otebook. > > On Tue, Feb 16, 2016 at 3:37 PM, andy petrella > wrote: > >> Hello Alex! >> >> Rajeev is right, come over the spark notebook gitter room, you'll be >> helped by many experienced people if you have some troubles: >> https://gitter.im/andypetr

Re: StructType for oracle.sql.STRUCT

2015-11-28 Thread andy petrella
Warf... such an heavy tasks man! I'd love to follow your work on that (I've a long XP in geospatial too), is there a repo available already for that? The hard part will be to support all descendant types I guess (line, mutlilines, and so on), then creating the spatial operators. The only project

Re: Graph visualization tool for GraphX

2015-12-08 Thread andy petrella
Hello Lin, This is indeed a tough scenario when you have many vertices (and even worst) many edges... So two-fold answer: First, technically, there is a graph plotting support in the spark notebook (https://github.com/andypetrella/spark-notebook/ → check this notebook: https://github.com/andypetr

Re: Re: Real-time data visualization with Zeppelin

2015-08-06 Thread andy petrella
> > At 2015-07-13 02:45:57, "andy petrella" wrote: > > Heya, > > You might be looking for something like this I guess: > https://www.youtube.com/watch?v=kB4kRQRFAVc. > > The Spark-Notebook (https://github.com/andypetrella/spark-notebook/) can > bring that to

Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread andy petrella
Exactly! The sharing part is used in the Spark Notebook (this one ) so we can share stuffs between notebooks which are different SparkContext (in diff JVM). OTOH, we have a project that creates micro services

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread andy petrella
Hey, Actually, for Scala, I'd better using https://github.com/andypetrella/spark-notebook/ It's deployed at several places like *Alibaba*, *EBI*, *Cray* and is supported by both the Scala community and the company Data Fellas. For instance, it was part of the Big Scala Pipeline training given thi

Re: Scala Vs Python

2016-09-02 Thread andy petrella
looking at the examples, indeed they make nonsense :D On Fri, 2 Sep 2016 16:48 Mich Talebzadeh, wrote: > Right so. We are back into religious arguments. Best of luck > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Using Zeppelin with Spark FP

2016-09-11 Thread andy petrella
Heya, probably worth giving the Spark Notebook a go then. It can plot any scala data (collection, rdd, df, ds, custom, ...), all are reactive so they can deal with any sort of incoming data. You can ask on the gitter

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-22 Thread andy petrella
heya, I'd say if you wanna go the spark and scala way, please yourself and go for the Spark Notebook (check http://spark-notebook.io/ for pre-built distro or build your own) hth On Thu, Sep 22, 2016 at 12:45 AM Arif,Mubaraka wrote: > we installed

Re: Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread andy petrella
Hello Adamantios, Thanks for the poke and the interest. Actually, you're the second asking about backporting it. Yesterday (late), I created a branch for it... and the simple local spark test worked! \o/. However, it'll be the 'old' UI :-/. Since I didn't ported the code using play 2.2.6 to the ne

Re: Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread andy petrella
ked out in this branch and launch `sbt run`. HTH, andy On Tue Feb 03 2015 at 2:45:43 PM andy petrella wrote: > Hello Adamantios, > > Thanks for the poke and the interest. > Actually, you're the second asking about backporting it. Yesterday (late), > I created a branch for it

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread andy petrella
That purely awesome! Don't hesitate to contribute your notebook back to the spark notebook repo, even rough, I'll help cleaning up if needed. The vagrant is also appealing 😆 Congrats! Le jeu 26 mars 2015 22:22, David Holiday a écrit : > w0t! that did it! t/y so much! > > I'm

Re: Spark 1.2.0 with Play/Activator

2015-04-06 Thread andy petrella
Hello Manish, you can take a look at the spark-notebook build, it's a bit tricky to get rid of some clashes but at least you can refer to this build to have ideas. LSS, I have stripped out akka from play deps. ref: https://github.com/andypetrella/spark-notebook/blob/master/build.sbt https://githu

Re: Spark 1.2.0 with Play/Activator

2015-04-07 Thread andy petrella
e.spark#spark-repl_2.10;1.2.0-cdh5.3.0: not found > > > > Any suggestions? > > > > Thanks > > > > *From:* Manish Gupta 8 [mailto:mgupt...@sapient.com] > *Sent:* Tuesday, April 07, 2015 12:04 PM > *To:* andy petrella; user@spark.apache.org > *Subject:* RE:

Re: Processing Large Images in Spark?

2015-04-07 Thread andy petrella
Heya, You might be interesting at looking at GeoTrellis They use RDDs of Tiles to process big images like Landsat ones can be (specially 8). However, I see you have only 1G per file, so I guess you only care of a single band? Or is it a reboxed pic? Note: I think the GeoTrellis image format is s

Re: Spark and accumulo

2015-04-21 Thread andy petrella
Hello Madvi, Some work has been done by @pomadchin using the spark notebook, maybe you should come on https://gitter.im/andypetrella/spark-notebook and poke him? There are some discoveries he made that might be helpful to know. Also you can poke @lossyrob from Azavea, he did that for geotrellis

Re: solr in spark

2015-04-28 Thread andy petrella
AFAIK Datastax is heavily looking at it. they have a good integration of Cassandra with it. the next was clearly to have a strong combination of the three in one of the coming releases Le mar. 28 avr. 2015 18:28, Jeetendra Gangele a écrit : > Does anyone tried using solr inside spark? > below is

Re: Spark and RDF

2014-06-20 Thread andy petrella
For RDF, may GraphX be particularly approriated? aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] On Thu, Jun 19, 2014 at 4:49 PM, Flavio Pompermaier wrote: > Hi guys, > > I'm analyzing the possibility to use Spark to analyze RDF files and define

Re: Spark and RDF

2014-06-20 Thread andy petrella
to SparkSQL it would be slightly hard but much better effort would > be to shift Gremlin to Spark (though a much beefier one :) ) > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > >

Re: Spark and RDF

2014-06-20 Thread andy petrella
> or a seperate RDD for sparql operations ala SchemaRDD .. operators for > sparql can be defined thr.. not a bad idea :) > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > O

Re: Generic Interface between RDD and DStream

2014-07-11 Thread andy petrella
A while ago, I wrote this: ``` package com.virdata.core.compute.common.api import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.StreamingContext sealed trait SparkEnvironment extends Serializable

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread andy petrella
heya, Without a bit of gymnastic at the type level, nope. Actually RDD doesn't share any functions with the scala lib (the simple reason I could see is that the Spark's ones are lazy, the default implementations in Scala aren't). However, it'd be possible by implementing an implicit converter fro

Re: Debugging "Task not serializable"

2014-07-28 Thread andy petrella
Also check the guides for the JVM option that prints messages for such problems. Sorry, sent from phone and don't know it by heart :/ Le 28 juil. 2014 18:44, "Akhil Das" a écrit : > A quick fix would be to implement java.io.Serializable in those classes > which are causing this exception. > > > >

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread andy petrella
'lookup' on RDD (pair) maybe? Le 29 juil. 2014 12:04, "Yifan LI" a écrit : > Hi Bin, > > Maybe you could get the vertex, for instance, which id is 80, by using: > > *graph.vertices.filter{case(id, _) => id==80}.collect* > > but I am not sure this is the exactly efficient way.(it will scan the > w

Re: Spark streaming vs. spark usage

2014-07-29 Thread andy petrella
Yep, But RDD/DStream would hardly fit the Monad contract (discussed several time, and still under discussions here and there ;)) For instance, look at the signature of flatMap in both traits. Albeit, an RDD that can generates other RDD (flatMap) is rather somethi.g like a DStream or 'CRDD

Re: iScala or Scala-notebook

2014-07-29 Thread andy petrella
Some people started some work on that topic using the notebook (the original or the n8han one, cannot remember)... Some issues have ben created already ^^ Le 29 juil. 2014 19:59, "Nick Pentreath" a écrit : > IScala itself seems to be a bit dead unfortunately. > > I did come across this today: htt

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread andy petrella
Oh I was almost sure that lookup was optimized using the partition info Le 29 juil. 2014 21:25, "Ankur Dave" a écrit : > Yifan LI writes: > > Maybe you could get the vertex, for instance, which id is 80, by using: > > > > graph.vertices.filter{case(id, _) => id==80}.collect > > > > but I am not

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread andy petrella
👍thx! Le 29 juil. 2014 22:09, "Ankur Dave" a écrit : > andy petrella writes: > > Oh I was almost sure that lookup was optimized using the partition info > > It does use the partitioner to run only one task, but within that task it > has to scan the entire parti

Re: creating a distributed index

2014-08-01 Thread andy petrella
Hey, There is some work that started on IndexedRDD (on master I think). Meanwhile, checking what has been done in GraphX regarding vertex index in partitions could be worthwhile I guess Hth Andy Le 1 août 2014 22:50, "Philip Ogren" a écrit : > > Suppose I want to take my large text data input and

Re: New sbt plugin to deploy jobs to EC2

2014-09-05 Thread andy petrella
\o/ => will test it soon or sooner, gr8 idea btw aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] On Fri, Sep 5, 2014 at 12:37 PM, Felix Garcia Borrego wrote: > As far as I know in other to deploy and execute jobs in EC2 you need to > assembly you

Re: Spark & NLP

2014-09-10 Thread andy petrella
never tried but might fit your need: http://www.scalanlp.org/ It's the parent project of both breeze (already part of spark) and epic. However you'll have to train for IT (not part of the supported list) (actually I never used it because for my very small needs, I generally just perform a small n

Re: Serving data

2014-09-13 Thread andy petrella
however, the cache is not guaranteed to remain, if other jobs are launched in the cluster and require more memory than what's left in the overall caching memory, previous RDDs will be discarded. Using an off heap cache like tachyon as a dump repo can help. In general, I'd say that using a persist

Re: Serving data

2014-09-15 Thread andy petrella
t quick enough I’ll go > the usual route with either read-only or normal database. > > On 13.09.2014, at 12:45, andy petrella wrote: > > however, the cache is not guaranteed to remain, if other jobs are launched > in the cluster and require more memory than what's left in t

Re: Serving data

2014-09-15 Thread andy petrella
15.09.2014, at 13:50, andy petrella wrote: > > I'm using Parquet in ADAM, and I can say that it works pretty fine! > Enjoy ;-) > > aℕdy ℙetrella > about.me/noootsab > [image: aℕdy ℙetrella on about.me] > > <http://about.me/noootsab> > > On Mon, Sep 15, 2

Re: Example of Geoprocessing with Spark

2014-09-20 Thread andy petrella
It's probably slw as you say because it's actually also doing the map phase that will do the RTree search and so on, and only then saving to hdfs on 60 partition. If you want to see the time spent in saving to hdfs, you could do a count for instance before saving. Also saving from 60 partition

Re: REPL like interface for Spark

2014-09-29 Thread andy petrella
Heya, I started to port the scala-notebook to Spark some weeks ago (but doing it in my sparse time and for my Spark talks ^^). It's a WIP but works quite fine ftm, you can check my fork and branch over here: https://github.com/andypetrella/scala-notebook/tree/spark Feel free to ask any questions,

Re: REPL like interface for Spark

2014-09-29 Thread andy petrella
estions, request > features to the mailing list. > > Thanks. > > - moon > > On Mon, Sep 29, 2014 at 5:27 PM, andy petrella > wrote: > >> Heya, >> >> I started to port the scala-notebook to Spark some weeks ago (but doing >> it in my spa

Re: REPL like interface for Spark

2014-09-29 Thread andy petrella
However (I must say ^^) that it's funny that it has been build using usual plain old Java stuffs :-D. aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] <http://about.me/noootsab> On Mon, Sep 29, 2014 at 10:51 AM, andy petrella wrote: > Cool!!! I'll

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread andy petrella
I'll try my best ;-). 1/ you could create a abstract type for the types (1 on top of Vs, 1 other on top of Es types) than use the subclasses as payload in your VertexRDD or in your Edge. Regarding storage and files, it doesn't really matter (unless you want to use the OOTB loading method, thus you

Re: Interactive interface tool for spark

2014-10-08 Thread andy petrella
Heya You can check Zeppellin or my fork of the Scala notebook. I'm going this week end to push some efforts on the doc, because it supports for realtime graphing, Scala, SQL, dynamic loading of dependencies and I started this morning a widget to track the progress of the jobs. I'm quite happy with

Re: Interactive interface tool for spark

2014-10-08 Thread andy petrella
s awesome. Please keep us posted. Meanwhile, can you share a > link to your project? I wasn't able to find it. > > Cheers, > > Michael > > On Oct 8, 2014, at 3:38 AM, andy petrella wrote: > > Heya > > You can check Zeppellin or my fork of the Scala note

Re: Interactive interface tool for spark

2014-10-09 Thread andy petrella
raries in > the Scala Notebook? If not, what is suggested approaches for visualization > there? Thanks. > > Kelvin > > On Wed, Oct 8, 2014 at 9:14 AM, andy petrella > wrote: > >> Sure! I'll post updates as well in the ML :-) >> I'm doing it on twitter

Re: Interactive interface tool for spark

2014-10-12 Thread andy petrella
; On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman wrote: > Hi Andy, > > This sounds awesome. Please keep us posted. Meanwhile, can you share a > link to your project? I wasn't able to find it. > > Cheers, > > Michael > > On Oct 8, 2014, at 3:38 AM, andy petrella

Re: Interactive interface tool for spark

2014-10-12 Thread andy petrella
about Hue http://gethue.com ? > > On Sun, Oct 12, 2014 at 1:26 PM, andy petrella > wrote: > >> Dear Sparkers, >> >> As promised, I've just updated the repo with a new name (for the sake of >> clarity), default branch but specially with a dedicated README con

Re: Scala Spark IDE help

2014-10-28 Thread andy petrella
Also, I'm following to master students at the University of Liège (one for computing prob conditional density on massive data and the other implementing a Markov Chain method on georasters), I proposed them to use the Spark-Notebook to learn the framework, they're quite happy with it (so far at lea

Re: Getting spark job progress programmatically

2014-11-18 Thread andy petrella
I started some quick hack for that in the notebook, you can head to: https://github.com/andypetrella/spark-notebook/blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > I am writing yet an

Re: Getting spark job progress programmatically

2014-11-18 Thread andy petrella
ersion JobProgressListener that stores stageId to > group Id mapping. > > I will submit a JIRA ticket and seek spark dev's opinion on this. Many > thanks for your prompt help Andy. > > Thanks, > Aniket > > > On Tue Nov 18 2014 at 19:40:06 andy petrella > wro

Re: Getting spark job progress programmatically

2014-11-20 Thread andy petrella
t;> experiences (& hacks) and submit them as a feature request for public API. >>> >>> On Tue Nov 18 2014 at 20:35:00 andy petrella >>> wrote: >>> >>>> yep, we should also propose to add this stuffs in the public API. >>>> >&

[GraphX] Mining GeoData (OSM)

2014-11-20 Thread andy petrella
Guys, After talking with Ankur, it turned out that sharing the talk we gave at ScalaIO (France) would be worthy. So there you go, and don't hesitate to share your thoughts ;-)/ http://www.slideshare.net/noootsab/machine-learning-and-graphx Greetz, andy

Re: Using TF-IDF from MLlib

2014-11-20 Thread andy petrella
/Someone will correct me if I'm wrong./ Actually, TF-IDF scores terms for a given document, an specifically TF. Internally, these things are holding a Vector (hopefully sparsed) representing all the possible words (up to 2²⁰) per document. So each document afer applying TF, will be transformed in

Re: Spark Streaming Metrics

2014-11-21 Thread andy petrella
Yo, I've discussed with some guyz from cloudera that are working (only oO) on spark-core and streaming. The streaming was telling me the same thing about the scheduling part. Do you have some nice screenshots and info about stages running, task time, akka health and things like these -- I said th

Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
Actually, it's a real On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer wrote: > Hi, > > see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for > one solution. > > One issue with those XML files is that they cannot be processed line by > line in parallel; plus you inherently need

Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
(sorry about the previous spam... google inbox didn't allowed me to cancel the miserable sent action :-/) So what I was about to say: it's a real PAIN tin the ass to parse the wikipedia articles in the dump due to this mulitline articles... However, there is a way to manage that "quite" easily, a

Re: Using TF-IDF from MLlib

2014-11-21 Thread andy petrella
Yeah, I initially used zip but I was wondering how reliable it is. I mean, it's the order guaranteed? What if some mode fail, and the data is pulled out from different nodes? And even if it can work, I found this implicit semantic quite uncomfortable, don't you? My0.2c Le ven 21 nov. 2014 15:26,

Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
Not quite sure which geo processing you're doing are they raster, vector? More info will be appreciated for me to help you further. Meanwhile I can try to give some hints, for instance, did you considered GeoMesa ? Since you need a WMS (or alike), did you

Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
add the database > location as map source once the task starts. > > Cheers > Ben > > [1] http://www.deep-map.com > > Von: andy petrella > Datum: Montag, 1. Dezember 2014 15:07 > An: Benjamin Stadin , " > user@spark.apache.org" > Betreff: Re: Is Spark the

Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
not applicable to your problem, but interesting enough to share on this thread: http://basepub.dauphine.fr/bitstream/handle/123456789/5260/SD-Rtree.PDF?sequence=2 On Mon Dec 01 2014 at 3:48:14 PM andy petrella wrote: > Indeed. However, I guess the important load and stress is in

Re: Is Spark the right tool for me?

2014-12-02 Thread andy petrella
gn all of the above as a workflow, runnable (or assignable) as >part of a user session. => Oozie > > Do you think this is ok? > > ~Ben > > > Von: andy petrella > Datum: Montag, 1. Dezember 2014 15:48 > > An: Benjamin Stadin , " > user@spark.apache.org

Re: Is Spark the right tool for me?

2014-12-02 Thread andy petrella
). After all parts have been preprocessed and > gathered in one place, the initial creation of the preview geo file is > taking a fraction of the time (inserting all data in one transaction, > taking somewhere between sub-second and < 10 seconds for very large > projects). It’s currentl

Re: Implementing a spark version of Haskell's partition

2014-12-17 Thread andy petrella
yo, First, here is the scala version: http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A= >Boolean):(Repr,Repr) Second: RDD is distributed so what you'll have to do is to partition each partition each partition (:-D) or create two RDDs with by filtering twice → he

Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread andy petrella
nstead of partitions of a single > RDD, so I could use RDD methods like stats() on the fragments. But I think > there is currently no RDD method that returns more than one RDD for a > single input RDD, so maybe there is some design limitation on Spark that > prevents this? > > Again,

Re: spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-21 Thread andy petrella
Actually yes, things like interactive notebooks f.i. On Sun Dec 21 2014 at 11:35:18 AM Sean Owen wrote: > I'm only speculating, but I wonder if it was on purpose? would people > ever build an app against the REPL? > > On Sun, Dec 21, 2014 at 5:50 AM, Peng Cheng wrote: > > Everything else is the

Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread andy petrella
Nice idea, although it needs a plan on their hosting, or spark to host it if I'm not wrong. I've been using Slack for discussions, it's not exactly the same of discourse, the ML or SO but offers interesting features. It's more in the mood of IRC integrated with external services. my2c On Wed Dec

Re: Using TF-IDF from MLlib

2014-12-29 Thread andy petrella
Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen a écrit : > Given (label, terms) you can just transform the values to a TF vector, > then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can > make a LabeledPoint from (labe

[re-cont] map and flatMap

2014-03-12 Thread andy petrella
Folks, I want just to pint something out... I didn't had time yet to sort it out and to think enough to give valuable strict explanation of -- event though, intuitively I feel they are a lot ===> need spark people or time to move forward. But here is the thing regarding *flatMap*. Actually, it lo

Re: [re-cont] map and flatMap

2014-03-15 Thread andy petrella
oitot Dev < pascal.voitot@gmail.com> wrote: > On Wed, Mar 12, 2014 at 3:06 PM, andy petrella >wrote: > > > Folks, > > > > I want just to pint something out... > > I didn't had time yet to sort it out and to think enough to give valuable > > st

Re: Relation between DStream and RDDs

2014-03-20 Thread andy petrella
also consider creating pairs and use *byKey* operators, and then the key will be the structure that will be used to consolidate or deduplicate your data my2c On Thu, Mar 20, 2014 at 11:50 AM, Pascal Voitot Dev < pascal.voitot@gmail.com> wrote: > Actually it's quite simple... > > DStream[T] i

Re: Relation between DStream and RDDs

2014-03-20 Thread andy petrella
in the StreaminContext to run a batch job on the data. But I'm only thinking out of "loud", so I might be completely wrong. hth Andy Petrella Belgium (Liège) * * Data Engineer in *NextLab <http://nextlab.be/> sprl* (owner) Engaged Citizen Coder for *WAJ

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
I hijack the thread, but my2c is that this feature is also important to enable ad-hoc queries which is done at runtime. It doesn't remove interests for such macro for precompiled jobs of course, but it may not be the first use case envisioned with this Spark SQL. Again, only my0.2c (ok I divided b

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
< pascal.voitot@gmail.com> wrote: > > Le 27 mars 2014 09:47, "andy petrella" a écrit : > > > > > I hijack the thread, but my2c is that this feature is also important to > enable ad-hoc queries which is done at runtime. It doesn't remove interests > for s

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
nope (what I said :-P) On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev < pascal.voitot@gmail.com> wrote: > > > > On Thu, Mar 27, 2014 at 10:22 AM, andy petrella > wrote: > >> I just mean queries sent at runtime ^^, like for any RDBMS. >> In our proj

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
gt; > > Original message > From: andy petrella > Date:03/27/2014 6:08 AM (GMT-05:00) > To: user@spark.apache.org > Subject: Re: Announcing Spark SQL > > nope (what I said :-P) > > > On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev < > pascal.

Re: Calling Spahk enthusiasts in Boston

2014-04-02 Thread andy petrella
Count me in! On Wed, Apr 2, 2014 at 10:32 AM, Pierre Borckmans < pierre.borckm...@realimpactanalytics.com> wrote: > There’s at least one in Johannesburg, I confirm that ;) > > We are actually both in Brussels(Belgium) and Jo’burg… > > We would love to host a meetup in Brussels, but we don’t know

Re: ActorNotFound problem for mesos driver

2014-04-02 Thread andy petrella
Heya, Yep this is a problem in the Mesos scheduler implementation that has been fixed after 0.9.0 (https://spark-project.atlassian.net/browse/SPARK-1052 => MesosSchedulerBackend) So several options, like applying the patch, upgrading to 0.9.1 :-/ Cheers, Andy On Wed, Apr 2, 2014 at 5:30 PM, Le

Re: ActorNotFound problem for mesos driver

2014-04-02 Thread andy petrella
np ;-) On Wed, Apr 2, 2014 at 5:50 PM, Leon Zhang wrote: > Aha, thank you for your kind reply. > > Upgrading to 0.9.1 is a good choice. :) > > > On Wed, Apr 2, 2014 at 11:35 PM, andy petrella wrote: > >> Heya, >> >> Yep this is a problem in the Mesos

Re: Need suggestions

2014-04-02 Thread andy petrella
TL;DR Your classes are missing on the workers, pass the jar containing the "class" main.scala.Utils to the SparkContext Longer: I miss some information, like how the SparkContext is configured but my best guess is that you didn't provided the jars (addJars on SparkConf or use the SC's constructor

Re: Need suggestions

2014-04-02 Thread andy petrella
I cannot access your repo, however my gut feeling is that *"target/scala-2.10/simple-project_2.10-1.0.jar"* is not enough (say, that your cwd is not the folder containing *target*). I'd say that it'd be easier to put an absolute path... My2c - On Wed, Apr 2, 2014 at 11:07 PM, yh18190 wrote

Re: Need suggestions

2014-04-02 Thread andy petrella
Sorry I was not clear perhaps, anyway, could you try with the path in the *List* to be the absolute one; e.g. List("/home/yh/src/pj/spark-stuffs/target/scala-2.10/simple-project_2.10-1.0.jar") In order to provide a relative path, you need first to figure out your CWD, so you can do (to be really s

Re: Error when run Spark on mesos

2014-04-02 Thread andy petrella
Hello, It's indeed due to a known bug, but using another IP for the driver won't be enough (other problems will pop up). A easy solution would be to switch to 0.9.1 ... see http://apache-spark-user-list.1001560.n3.nabble.com/ActorNotFound-problem-for-mesos-driver-td3636.html Hth Andy Le 3 avr. 20

Re: what does SPARK_EXECUTOR_URI in spark-env.sh do ?

2014-04-03 Thread andy petrella
tasks set (from the driver). So this env var is used by these guys, however we prefer to set "spark.executor.uri" in our SparkConf so that it's very specific to the job and easier to control. hth Andy Petrella Belgium (Liège) * * Data Engineer in *NextLab <

Re: How to use addJar for adding external jars in spark-0.9?

2014-04-03 Thread andy petrella
Could you try by building an "uber jar" with all deps... using mvn shade or sbt assembly (or whatever that can do that ^^). I think it'll be easier as a first step than trying to add deps independently (and reduce the "number" of network traffic) And

Re: what does SPARK_EXECUTOR_URI in spark-env.sh do ?

2014-04-03 Thread andy petrella
Indeed, it's how mesos works actually. So the tarball just has to be somewhere accessible by the mesos slaves. That's why it is often put in hdfs. Le 3 avr. 2014 18:46, "felix" a écrit : > So, if I set this parameter, there is no need to copy the spark tarball to > every mesos slave nodes? am I

Re: ETL for postgres to hadoop

2014-04-08 Thread andy petrella
ing such import tool for Oracle Spatial cartridge would be great as well :-P. my2c, Andy Petrella Belgium (Liège) * * Data Engineer in *NextLab <http://nextlab.be/> sprl* (owner) Engaged Citizen Coder for *WAJUG <http://wajug.be/>* (co-founder) Author of *

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
uot;com.google.protobuf" % "protobuf-java"% "2.5.0" That's only my0.02c in order to try to move things forward until someone from the Spark team to step in with more clear and strong advices... kr, Andy Petrella Belgium (Liège) * * Data Eng

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
No of course, but I was guessing some native libs imported (to communicate with Mesos) in the project that... could miserably crash the JVM. Anyway, so you tell us that using this oracle version, you don't have any issues when using spark on mesos 0.18.0, that's interesting 'cause AFAIR, my last t

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
; # If you would like to submit a bug report, please visit: > > # http://bugreport.sun.com/bugreport/crash.jsp > > > Steve > > > -- > *From:* andy petrella [andy.petre...@gmail.com] > *Sent:* Thursday, April 17, 2014 3:21 PM > *To:* user@sp

Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread andy petrella
SparkListener offers good stuffs. But I also completed it with another metrics stuffs on my own that use Akka to aggregate metrics from anywhere I'd like to collect them (without any deps on ganglia yet on Codahale). However, this was useful to gather some custom metrics (from within the tasks then

Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread andy petrella
lt;http://user/SendEmail.jtp?type=node&node=6259&i=0> > > *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans > > > > > > > On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List] <[hidden > email] <http://user/SendEmail.jtp?type=node&node=62