FWIW, CSV has the same problem that renders it immune to naive partitioning.
Consider the following RFC 4180 compliant record:
1,2,"
all,of,these,are,just,one,field
",4,5
Now, it's probably a terrible idea to give a file system awareness of
actual file types, but couldn't HDFS handle this near
both Flink and Spark into one.This eases the
industry adaptation instead.
Thanking you.
With Regards
Sree
On Wednesday, April 29, 2015 3:21 AM, Ewan Higgs
wrote:
Hi all,
A quick question about Tungsten. The announcement of the Tungsten
project is on the back of Hadoop Summit in Brussels whe
Hi all,
A quick question about Tungsten. The announcement of the Tungsten
project is on the back of Hadoop Summit in Brussels where some of the
Flink devs were giving talks [1] on how Flink manages memory using byte
arrays and the like to avoid the overhead of all the Java types[2]. Is
there a
WIP branch
Date: Wed, 14 Jan 2015 14:33:45 +0100
From: Ewan Higgs
To: dev@spark.apache.org
Hi all,
I'm trying to build the Spark-perf WIP code but there are some errors to
do with Hadoop APIs. I presume this is because there is some Hadoop
version set and it's referring to t
To add to Sean and Reynold's point:
Please correct me if I'm wrong, but Spark depends on hadoop-common which
also uses jetty in the HttpServer2 code. So even if you remove jetty
from Spark by making it an optional dependency, it will be pulled in by
Hadoop.
So you'll still see that your prog
ing it
there[1]. I put it on the back burner until someone can get back to me
on it.
Yours,
Ewan Higgs
[1]
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSpark-perf-terasort-WIP-branch-tt10105.html
On 02/02/15 23:26, Kannan Rajah wrote:
Is there a recommended performance test
nd [2]. Then
we should be able to get slurm, pbs, and sge in one shot rather than
implementing some wire formats for RPC.
Thanks,
Ewan Higgs
[1] https://hadoop.apache.org/docs/r1.2.1/hod_scheduler.html
https://github.com/glennklockwood/hpchadoop
http://jaliyacgl.blogspot.be/2008/08/hadoop-as-batc
ystem implementation that overrides the listStatus
method, and then in Hadoop Conf set the fs.file.impl to that.
Shouldn't be too hard. Would you be interested in working on it?
On Fri, Jan 16, 2015 at 3:36 PM, Ewan Higgs <mailto:ewan.hi...@ugent.be>> wrote:
Yes, I am running on
local file system right? HDFS orders the file
based on names, but local file system often don't. I think that's why
the difference.
We might be able to do a sort and order the partitions when we create
a RDD to make this universal though.
On Fri, Jan 16, 2015 at 8:26 AM,
Hi all,
Quick one: when reading files, are the orders of partitions guaranteed
to be preserved? I am finding some weird behaviour where I run
sortByKeys() on an RDD (which has 16 byte keys) and write it to disk. If
I open a python shell and run the following:
for part in range(29):
print
Hi all,
I'm trying to build the Spark-perf WIP code but there are some errors to
do with Hadoop APIs. I presume this is because there is some Hadoop
version set and it's referring to that. But I can't seem to find it.
The errors are as follows:
[info] Compiling 15 Scala sources and 2 Java sou
not be functioning appropriately. If you have trouble with
it, I recommend using the Hadoop version.
Yours,
Ewan
> Thanks,
> Tim
>
>
> On 12/16/14, 12:38 AM, "Ewan Higgs" wrote:
>
>> Hi Tim,
>> run-example is here:
>> https://github.com/ehiggs/spa
Hi Tim,
run-example is here:
https://github.com/ehiggs/spark/blob/terasort/bin/run-example
It should be in the repository that you cloned. So if you were at the
top level of the checkout, run-example would be run as ./bin/run-example.
Yours,
Ewan Higgs
On 12/12/14 01:06, Tim Harsch wrote
great. I think the consensus from last time was that we would
put performance stuff into spark-perf, so it is easy to test different
Spark versions.
On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs <mailto:ewan.hi...@ugent.be>> wrote:
Hi all,
I saw that Reynold Xin had a Terasort e
helped me get
through learning some rudimentary Scala to get this far.
Yours,
Ewan Higgs
[1] https://github.com/apache/spark/pull/1242
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
15 matches
Mail list logo