Hi to all,
I have 2 rdd D1 and D2 like:
D1:
A,p1,a
A,p2,a2
A,p3,X
B,p3,Y
B,p1,b1
D2:
X,s,V
X,r,2 Y,j,k
I'd like to have a unique rdd D3(Tuple4) like
A,X,a1,a2 B,Y,b1,null
Basically filling with
when D1.f2==D2.f0.
Is that possible and how? Could you show me a simple snippet?
Thanks in advance
/21185092/apache-spark-map-vs-mappartitions>
> and here
> <http://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/>).
> So for simpler code, you can go with map, and for efficiency, you can go
> with mapPartitions.
>
> Regards,
> Kamal
>
Hi to all,
I'm trying to convert my old mapreduce job to a spark one but I have some
doubts..
My application basically buffers a batch of updates and every 100 elements
it flushes the batch to a server. This is very easy in mapreduce but I
don't know how you can do that in scala..
For example, if
Maybe you could implement something like this (i don't know if something
similar already exists in spark):
http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf
Best,
Flavio
On Oct 8, 2014 9:58 PM, "Nicholas Chammas"
wrote:
> Multiple values may be different, yet still be considered dup
Hi to all, sorry for not being fully on topic but I have 2 quick questions
about Parquet tables registered in Hive/sparq:
1) where are the created tables stored?
2) If I have multiple hiveContexts (one per application) using the same
parquet table, is there any problem if inserting concurrently fr
Isn't sqoop export meant for that?
http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1
On Aug 7, 2014 7:59 PM, "Nicholas Chammas"
wrote:
> Vida,
>
> What kind of database are you trying to write to?
>
> For example, I found that for loading into Redshift, by far the ea
Hi everybody,
I have a scenario where I would like to stream data to different
persistency types (i.e. sql db, graphdb ,hdfs, etc) and perform some
filtering and trasformation as the the data comes in.
The problem is to maintain consistency between all datastores (maybe some
operation could fail)
Hi folks,
I was looking at the benchmark provided by Cloudera at
http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/
.
Is it real that Shark cannot execute some query if you don't have enough
memory?
And is it true/reliable that Impala
Hi guys,
I'm analyzing the possibility to use Spark to analyze RDF files and define
reusable Shark operators on them (custom filtering, transforming,
aggregating, etc). Is that possible? Any hint?
Best,
Flavio
/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala>
> to
> see how they manage it using several worker threads.
>
> My suggestion would be to knock-up a basic custom receiver and give it a
> shot!
>
> MC
>
>
> On 19 June 2014 09:31, Flavio Pompe
nsmitted with it are confidential and may also
> be privileged. It is intended only for the person to whom it is addressed.
> If you have received this email in error, please inform the sender
> immediately.
> If you are not the intended recipient you must not use, disclose, copy,
>
of event and only send events to Spark streams when it's
> ready to process more messages.
>
> Hope this helps.
>
> -Soumya
>
>
>
>
> On Wed, Jun 18, 2014 at 6:50 PM, Flavio Pompermaier
> wrote:
>
>> Thanks for the quick reply soumya. Un
This component can control in input rate to spark.
>
> > On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier
> wrote:
> >
> > Hi to all,
> > in my use case I'd like to receive events and call an external service
> as they pass through. Is it possible to limit the
Hi to all,
in my use case I'd like to receive events and call an external service as
they pass through. Is it possible to limit the number of contemporaneous
call to that service (to avoid DoS) using Spark streaming? if so, limiting
the rate implies a possible buffer growth...how can I control the
Is there a way to query fields by similarity (like Lucene or using a
similarity metric) to be able to query something like WHERE language LIKE
"it~0.5" ?
Best,
Flavio
On Thu, May 22, 2014 at 8:56 AM, Michael Cutler wrote:
> Hi Nick,
>
> Here is an illustrated example which extracts certain fie
Hi to all,
I've read about how to create an Elastiscearch index using Spark at
http://loads.pickle.me.uk/2013/11/12/spark-and-elasticsearch.html.
I have 2 questions:
1 - How is Elasticsearch able to autodetect that the hdfs index files have
changed?
2 - Is there anybody that has done the same for
Is there any Spark plugin/add-on that facilitate the query to a JSON
content?
Best,
Flavio
On Thu, May 15, 2014 at 6:53 PM, Michael Armbrust wrote:
> Here is a link with more info:
> http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
>
>
> On Wed, May 7, 2014 at 10:09 PM
Great work!thanks!
On May 13, 2014 3:16 AM, "zhen" wrote:
> Hi Everyone,
>
> I found it quite difficult to find good examples for Spark RDD API calls.
> So
> my student and I decided to go through the entire API and write examples
> for
> the vast majority of API calls (basically examples for any
t
> want. But sure it is discutable and it's more my personal opinion.
>
>
> 2014-04-17 23:28 GMT+02:00 Flavio Pompermaier :
>
> Thanks again Eugen! I don't get the point..why you prefer to avoid kyro
>> ser for closures?is there any problem with that?
>>
tion you reference an object outside of it and it is
> getting ser with your task. To enable kryo ser for closures set
> spark.closure.serializer property. But usualy I dont as it allows me to
> detect such unwanted references.
> Le 17 avr. 2014 22:17, "Flavio Pompermaier" a
> é
Now I have another problem..I have to pass one o this non serializable
object to a PairFunction and I received another non serializable
exception..it seems that Kyro doesn't work within Functions. Am I wrong or
this is a limit of Spark?
On Apr 15, 2014 1:36 PM, "Flavio Pompermaier"
FS etc. And even if they were not lazy, no
> serialization would happen. Serialization occurs only when data will be
> transfered (collect, shuffle, maybe perist to disk - but I am not sure for
> this one).
>
>
> 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier :
>
> Ok, t
utes it's due to the fact that java serialization
> does not ser/deser attributes from classes that don't impl. Serializable
> (in your case the parent classes).
>
>
> 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier :
>
>> Thanks Eugen for tgee reply. Could you expla
n.html
>
> Eugen
>
>
> 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier :
>
>> Hi to all,
>>
>> in my application I read objects that are not serializable because I
>> cannot modify the sources.
>> So I tried to do a workaround creating a dummy cla
Hi to all,
in my application I read objects that are not serializable because I cannot
modify the sources.
So I tried to do a workaround creating a dummy class that extends the
unmodifiable one but implements serializable.
All attributes of the parent class are Lists of objects (some of them are
s
? Is there any suggestion about how to start?
On Wed, Apr 9, 2014 at 11:37 PM, Flavio Pompermaier wrote:
> Any help about this...?
> On Apr 9, 2014 9:19 AM, "Flavio Pompermaier" wrote:
>
>> Hi to everybody,
>>
>> In my current scenario I have complex objec
anage resources & share cluster.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Apr 9, 2014 at 12:10 AM, Flavio Pompermaier
> wrote:
>
>> Hi to everybody,
>
Any help about this...?
On Apr 9, 2014 9:19 AM, "Flavio Pompermaier" wrote:
> Hi to everybody,
>
> In my current scenario I have complex objects stored as xml in an HBase
> Table.
> What's the best strategy to work with them? My final goal would be to
> define
Hi to everybody,
In my current scenario I have complex objects stored as xml in an HBase
Table.
What's the best strategy to work with them? My final goal would be to
define operators on those objects (like filter, equals, append, join,
merge, etc) and then work with multiple RDDs to perform some k
Hi to everybody,
I'm new to Spark and I'd like to know if running Spark on top of YARN or
Mesos could affect (and how much) its performance. Is there any doc about
this?
Best,
Flavio
gt; out.
>
> /usr/bin
>
>
> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier
> wrote:
>
>> Hi to everybody,
>>
>> in these days I looked a bit at the recent evolution of the big data
>> stacks and it seems that HBase is somehow fading away in favour
Hi to everybody,
in these days I looked a bit at the recent evolution of the big data stacks
and it seems that HBase is somehow fading away in favour of Spark+HDFS. Am
I correct?
Do you think that Spark and HBase should work together or not?
Best regards,
Flavio
32 matches
Mail list logo