Tuple join

2015-04-17 Thread Flavio Pompermaier
Hi to all, I have 2 rdd D1 and D2 like: D1: A,p1,a A,p2,a2 A,p3,X B,p3,Y B,p1,b1 D2: X,s,V X,r,2 Y,j,k I'd like to have a unique rdd D3(Tuple4) like A,X,a1,a2 B,Y,b1,null Basically filling with when D1.f2==D2.f0. Is that possible and how? Could you show me a simple snippet? Thanks in advance

Re: Batch of updates

2014-10-28 Thread Flavio Pompermaier
/21185092/apache-spark-map-vs-mappartitions> > and here > <http://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/>). > So for simpler code, you can go with map, and for efficiency, you can go > with mapPartitions. > > Regards, > Kamal >

Batch of updates

2014-10-27 Thread Flavio Pompermaier
Hi to all, I'm trying to convert my old mapreduce job to a spark one but I have some doubts.. My application basically buffers a batch of updates and every 100 elements it flushes the batch to a server. This is very easy in mapreduce but I don't know how you can do that in scala.. For example, if

Re: Dedup

2014-10-08 Thread Flavio Pompermaier
Maybe you could implement something like this (i don't know if something similar already exists in spark): http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf Best, Flavio On Oct 8, 2014 9:58 PM, "Nicholas Chammas" wrote: > Multiple values may be different, yet still be considered dup

RE: Does HiveContext support Parquet?

2014-08-16 Thread Flavio Pompermaier
Hi to all, sorry for not being fully on topic but I have 2 quick questions about Parquet tables registered in Hive/sparq: 1) where are the created tables stored? 2) If I have multiple hiveContexts (one per application) using the same parquet table, is there any problem if inserting concurrently fr

Re: Save an RDD to a SQL Database

2014-08-07 Thread Flavio Pompermaier
Isn't sqoop export meant for that? http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 On Aug 7, 2014 7:59 PM, "Nicholas Chammas" wrote: > Vida, > > What kind of database are you trying to write to? > > For example, I found that for loading into Redshift, by far the ea

Streaming on different store types

2014-07-30 Thread Flavio Pompermaier
Hi everybody, I have a scenario where I would like to stream data to different persistency types (i.e. sql db, graphdb ,hdfs, etc) and perform some filtering and trasformation as the the data comes in. The problem is to maintain consistency between all datastores (maybe some operation could fail)

Shark vs Impala

2014-06-22 Thread Flavio Pompermaier
Hi folks, I was looking at the benchmark provided by Cloudera at http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/ . Is it real that Shark cannot execute some query if you don't have enough memory? And is it true/reliable that Impala

Spark and RDF

2014-06-19 Thread Flavio Pompermaier
Hi guys, I'm analyzing the possibility to use Spark to analyze RDF files and define reusable Shark operators on them (custom filtering, transforming, aggregating, etc). Is that possible? Any hint? Best, Flavio

Re: Spark streaming and rate limit

2014-06-19 Thread Flavio Pompermaier
/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala> > to > see how they manage it using several worker threads. > > My suggestion would be to knock-up a basic custom receiver and give it a > shot! > > MC > > > On 19 June 2014 09:31, Flavio Pompe

Re: Spark streaming and rate limit

2014-06-19 Thread Flavio Pompermaier
nsmitted with it are confidential and may also > be privileged. It is intended only for the person to whom it is addressed. > If you have received this email in error, please inform the sender > immediately. > If you are not the intended recipient you must not use, disclose, copy, >

Re: Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
of event and only send events to Spark streams when it's > ready to process more messages. > > Hope this helps. > > -Soumya > > > > > On Wed, Jun 18, 2014 at 6:50 PM, Flavio Pompermaier > wrote: > >> Thanks for the quick reply soumya. Un

Re: Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
This component can control in input rate to spark. > > > On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier > wrote: > > > > Hi to all, > > in my use case I'd like to receive events and call an external service > as they pass through. Is it possible to limit the

Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
Hi to all, in my use case I'd like to receive events and call an external service as they pass through. Is it possible to limit the number of contemporaneous call to that service (to avoid DoS) using Spark streaming? if so, limiting the rate implies a possible buffer growth...how can I control the

Re: Using Spark to analyze complex JSON

2014-05-22 Thread Flavio Pompermaier
Is there a way to query fields by similarity (like Lucene or using a similarity metric) to be able to query something like WHERE language LIKE "it~0.5" ? Best, Flavio On Thu, May 22, 2014 at 8:56 AM, Michael Cutler wrote: > Hi Nick, > > Here is an illustrated example which extracts certain fie

Spark and Solr indexing

2014-05-17 Thread Flavio Pompermaier
Hi to all, I've read about how to create an Elastiscearch index using Spark at http://loads.pickle.me.uk/2013/11/12/spark-and-elasticsearch.html. I have 2 questions: 1 - How is Elasticsearch able to autodetect that the hdfs index files have changed? 2 - Is there anybody that has done the same for

Re: Schema view of HadoopRDD

2014-05-16 Thread Flavio Pompermaier
Is there any Spark plugin/add-on that facilitate the query to a JSON content? Best, Flavio On Thu, May 15, 2014 at 6:53 PM, Michael Armbrust wrote: > Here is a link with more info: > http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html > > > On Wed, May 7, 2014 at 10:09 PM

Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Flavio Pompermaier
Great work!thanks! On May 13, 2014 3:16 AM, "zhen" wrote: > Hi Everyone, > > I found it quite difficult to find good examples for Spark RDD API calls. > So > my student and I decided to go through the entire API and write examples > for > the vast majority of API calls (basically examples for any

Re: RDD collect help

2014-04-18 Thread Flavio Pompermaier
t > want. But sure it is discutable and it's more my personal opinion. > > > 2014-04-17 23:28 GMT+02:00 Flavio Pompermaier : > > Thanks again Eugen! I don't get the point..why you prefer to avoid kyro >> ser for closures?is there any problem with that? >>

Re: RDD collect help

2014-04-17 Thread Flavio Pompermaier
tion you reference an object outside of it and it is > getting ser with your task. To enable kryo ser for closures set > spark.closure.serializer property. But usualy I dont as it allows me to > detect such unwanted references. > Le 17 avr. 2014 22:17, "Flavio Pompermaier" a > é

Re: RDD collect help

2014-04-17 Thread Flavio Pompermaier
Now I have another problem..I have to pass one o this non serializable object to a PairFunction and I received another non serializable exception..it seems that Kyro doesn't work within Functions. Am I wrong or this is a limit of Spark? On Apr 15, 2014 1:36 PM, "Flavio Pompermaier"

Re: RDD collect help

2014-04-15 Thread Flavio Pompermaier
FS etc. And even if they were not lazy, no > serialization would happen. Serialization occurs only when data will be > transfered (collect, shuffle, maybe perist to disk - but I am not sure for > this one). > > > 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier : > > Ok, t

Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
utes it's due to the fact that java serialization > does not ser/deser attributes from classes that don't impl. Serializable > (in your case the parent classes). > > > 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier : > >> Thanks Eugen for tgee reply. Could you expla

Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
n.html > > Eugen > > > 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier : > >> Hi to all, >> >> in my application I read objects that are not serializable because I >> cannot modify the sources. >> So I tried to do a workaround creating a dummy cla

RDD collect help

2014-04-14 Thread Flavio Pompermaier
Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are s

Re: Spark operators on Objects

2014-04-10 Thread Flavio Pompermaier
? Is there any suggestion about how to start? On Wed, Apr 9, 2014 at 11:37 PM, Flavio Pompermaier wrote: > Any help about this...? > On Apr 9, 2014 9:19 AM, "Flavio Pompermaier" wrote: > >> Hi to everybody, >> >> In my current scenario I have complex objec

Re: Spark on YARN performance

2014-04-10 Thread Flavio Pompermaier
anage resources & share cluster. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Wed, Apr 9, 2014 at 12:10 AM, Flavio Pompermaier > wrote: > >> Hi to everybody, >

Re: Spark operators on Objects

2014-04-09 Thread Flavio Pompermaier
Any help about this...? On Apr 9, 2014 9:19 AM, "Flavio Pompermaier" wrote: > Hi to everybody, > > In my current scenario I have complex objects stored as xml in an HBase > Table. > What's the best strategy to work with them? My final goal would be to > define

Spark operators on Objects

2014-04-09 Thread Flavio Pompermaier
Hi to everybody, In my current scenario I have complex objects stored as xml in an HBase Table. What's the best strategy to work with them? My final goal would be to define operators on those objects (like filter, equals, append, join, merge, etc) and then work with multiple RDDs to perform some k

Spark on YARN performance

2014-04-09 Thread Flavio Pompermaier
Hi to everybody, I'm new to Spark and I'd like to know if running Spark on top of YARN or Mesos could affect (and how much) its performance. Is there any doc about this? Best, Flavio

Re: Spark and HBase

2014-04-08 Thread Flavio Pompermaier
gt; out. > > /usr/bin > > > On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier > wrote: > >> Hi to everybody, >> >> in these days I looked a bit at the recent evolution of the big data >> stacks and it seems that HBase is somehow fading away in favour

Spark and HBase

2014-04-08 Thread Flavio Pompermaier
Hi to everybody, in these days I looked a bit at the recent evolution of the big data stacks and it seems that HBase is somehow fading away in favour of Spark+HDFS. Am I correct? Do you think that Spark and HBase should work together or not? Best regards, Flavio