I am not sure what are you trying to achieve here. Have you thought about
using flume? Additionally maybe something like rsync?
Le sam. 12 sept. 2015 à 0:02, Varadhan, Jawahar
a écrit :
> Hi all,
>I have a coded a custom receiver which receives kafka messages. These
> Kafka messages have FTP
Inspired by this post:
http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
I've started putting together something based on the Spark 1.5 UDAF
interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141
Some questions -
1. How do I get the UDAF
Hello Nick,
I have been working on a (UDT-less) implementation of HLL++. You can find
the PR here: https://github.com/apache/spark/pull/8362. This current
implements the dense version of HLL++, which is a further development of
HLL. It returns a Long, but it shouldn't be to hard to return a Row
co
Can I ask why you've done this as a custom implementation rather than using
StreamLib, which is already implemented and widely used? It seems more
portable to me to use a library - for example, I'd like to export the
grouped data with raw HLLs to say Elasticsearch, and then do further
on-demand agg
Hi,
I am using spark1.4.1 data frame, read JSON data, then save it to orc. the code
is very simple:
DataFrame json = sqlContext.read().json(input);
json.write().format("orc").save(output);
the job failed. what's wrong with this exception? Thanks.
Exception in thread "main" org.apache.spark.sq
I should add that surely the idea behind UDT is exactly that it can (a) fit
automatically into DFs and Tungsten and (b) that it can be used efficiently
in writing ones own UDTs and UDAFs?
On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath
wrote:
> Can I ask why you've done this as a custom implem
There are actually 33 instances of a Java file in src/main/scala -- I
opened https://issues.apache.org/jira/browse/SPARK-10576 to track a
discussion and decision.
On Fri, Sep 11, 2015 at 3:10 PM, lonikar wrote:
> It does not cause any problem when building using maven. But when doing
> eclipse:ec
Is it possible that Canonical_URL occurs more than once in your json ?
Can you check your json input ?
Thanks
On Sat, Sep 12, 2015 at 2:05 AM, Fengdong Yu
wrote:
> Hi,
>
> I am using spark1.4.1 data frame, read JSON data, then save it to orc. the
> code is very simple:
>
> DataFrame json = sql
Hi Ted,
I checked the JSON, there aren't duplicated key in JSON.
Azuryy Yu
Sr. Infrastructure Engineer
cel: 158-0164-9103
wetchat: azuryy
On Sat, Sep 12, 2015 at 5:52 PM, Ted Yu wrote:
> Is it possible that Canonical_URL occurs more than once in your json ?
>
> Can you check your json input
Thanks. Yes thats exactly what i would like to do: copy large amounts of
data to GPU RAM, perform computation and get bulk rows back for map/filter
or reduce result. It is true that non trivial operations benefit more. Even
streaming data to GPU RAM and interleaving computation with data transfer
w
Thanks for pointing to the yarn JIRA. For now, it would be good for my talk
since it brings out that hadoop and big data community is already aware of
the GPUs and making effort to exploit it.
Good luck for your talk. That fear is lurking in my mind too :)
On 10-Sep-2015 2:08 pm, "Steve Loughran"
Can you take a look at SPARK-5278 where ambiguity is shown between field
names which differ only by case ?
Cheers
On Sat, Sep 12, 2015 at 3:40 AM, Fengdong Yu
wrote:
> Hi Ted,
> I checked the JSON, there aren't duplicated key in JSON.
>
>
> Azuryy Yu
> Sr. Infrastructure Engineer
>
> cel: 158-0
Good Day!
I think there are some problems between ORC and AWS EMRFS.
When I was trying to read "upper 150M" ORC files from S3, ArrayOutOfIndex
Exception occured.
I'm sure that it's AWS side issue because there was no exception when trying
from HDFS or S3NativeFileSystem.
Parquet runs ordinari
I am typically all for code re-use. The reason for writing this is to
prevent the indirection of a UDT and work directly against memory. A UDT
will work fine at the moment because we still use
GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you
would use an UnsafeRow as an A
Hello All,
When I push messages into kafka and read into streaming application, I see
the following exception-
I am running the application on YARN and no where broadcasting the message
within the application. Just simply reading message, parsing it and
populating fields in a class and then prin
Ok, that makes sense. So this is (a) more efficient, since as far as I can
see it is updating the HLL registers directly in the buffer for each value,
and (b) would be "Tungsten-compatible" as it can work against UnsafeRow? Is
it currently possible to specify an UnsafeRow as a buffer in a UDAF?
So
Hi Nick,
The buffer exposed to UDAF interface is just a view of underlying buffer
(this underlying buffer is shared by different aggregate functions and
every function takes one or multiple slots). If you need a UDAF, extending
UserDefinedAggregationFunction is the preferred
approach. AggregateFun
Most these files are just package-info.java there for having a good package
index for JavaDoc. If we move them, we will need to create a folder in the
java one for each package that exposes any documentation. And it is very
likely we will forget to update package-info.java when we update
package.sc
Thanks Yin
So how does one ensure a UDAF works with Tungsten and UnsafeRow buffers? Or is
this something that will be included in the UDAF interface in future?
Is there a performance difference between Extending UDAF vs Aggregate2?
It's also not clear to me how to handle inputs of dif
19 matches
Mail list logo