Hi folks,
Just a friendly message that we have added Python support to the REST
Spark Job Server project. If you are a Python user looking for a
RESTful way to manage your Spark jobs, please come have a look at our
project!
https://github.com/spark-jobserver/spark-jobserver
-Evan
---
at Mark is running a slightly-modified version of stock Spark.
>>> (He's mentioned this in prior posts, as well.)
>>>
>>> And I have to say that I'm, personally, seeing more and more
>>> slightly-modified versions of Spark being deployed to production to
>
#x27;s mentioned this in prior posts, as well.)
>>
>> And I have to say that I'm, personally, seeing more and more
>> slightly-modified versions of Spark being deployed to production to
>> workaround outstanding PR's and Jiras.
>>
>> this may not be what peop
One of the premises here is that if you can restrict your workload to
fewer cores - which is easier with FiloDB and careful data modeling -
you can make this work for much higher concurrency and lower latency
than most typical Spark use cases.
The reason why it typically does not work in productio
Hey folks,
I just saw a recent thread on here (but can't find it anymore) on
using Spark as a web-speed query engine. I want to let you guys know
that this is definitely possible! Most folks don't realize how
low-latency Spark can actually be. Please check out my blog post
below on achieving
I would expect an SQL query on c would fail because c would not be known in
the schema of the older Parquet file.
What I'd be very interested in is how to add a new column as an incremental
new parquet file, and be able to somehow join the existing and new file, in
an efficient way. IE, somehow
Ashwin,
I would say the strategies in general are:
1) Have each user submit separate Spark app (each its own Spark
Context), with its own resource settings, and share data through HDFS
or something like Tachyon for speed.
2) Share a single spark context amongst multiple users, using fair
schedul
What Sean said.
You should also definitely turn on Kryo serialization. The default
Java serialization is really really slow if you're gonna move around
lots of data.Also make sure you use a cluster with high network
bandwidth on.
On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen wrote:
> Base 64 i
Hi Abel,
Pretty interesting. May I ask how big is your point CSV dataset?
It seems you are relying on searching through the FeatureCollection of
polygons for which one intersects your point. This is going to be
extremely slow. I highly recommend using a SpatialIndex, such as the
many that exis
Sweet, that's probably it. Too bad it didn't seem to make 1.1?
On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust
wrote:
> The unknown slowdown might be addressed by
> https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd
>
> On Sun, Sep 14, 2014
SPARK-1671 looks really promising.
Note that even right now, you don't need to un-cache the existing
table. You can do something like this:
newAdditionRdd.registerTempTable("table2")
sqlContext.cacheTable("table2")
val unionedRdd = sqlContext.table("table1").unionAll(sqlContext.table("table2"))
Filed SPARK-3295.
On Mon, Aug 25, 2014 at 12:49 PM, Michael Armbrust
wrote:
>> SO I tried the above (why doesn't union or ++ have the same behavior
>> btw?)
>
>
> I don't think there is a good reason for this. I'd open a JIRA.
>
>>
>> and it works, but is slow because the original Rdds are not
>
There's no way to avoid a shuffle due to the first and last elements
of each partition needing to be computed with the others, but I wonder
if there is a way to do a minimal shuffle.
On Thu, Aug 21, 2014 at 6:13 PM, cjwang wrote:
> One way is to do zipWithIndex on the RDD. Then use the index as
Dear community,
Wow, I remember when we first open sourced the job server, at the
first Spark Summit in December. Since then, more and more of you have
started using it and contributing to it. It is awesome to see!
If you are not familiar with the spark job server, it is a REST API
for managin
And it worked earlier with non-parquet directory.
On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote:
> The underFS is HDFS btw.
>
> On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote:
>> Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config)
>>
The underFS is HDFS btw.
On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote:
> Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config)
>
> scala> val gdeltT =
> sqlContext.parquetFile("tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/")
> 14/08/21 19:07:14
Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config)
scala> val gdeltT =
sqlContext.parquetFile("tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/")
14/08/21 19:07:14 INFO :
initialize(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005,
Configuration: core-default.xml, core-site.xml
I just put up a repo with a write-up on how to import the GDELT public
dataset into Spark SQL and play around. Has a lot of notes on
different import methods and observations about Spark SQL. Feel free
to have a look and comment.
http://www.github.com/velvia/spark-sql-gdelt
---
014 at 12:17 AM, Michael Armbrust
wrote:
> I believe this should work if you run srdd1.unionAll(srdd2). Both RDDs must
> have the same schema.
>
>
> On Wed, Aug 20, 2014 at 11:30 PM, Evan Chan wrote:
>>
>> Is it possible to merge two cached Spark SQL tables into a sing
Is it possible to merge two cached Spark SQL tables into a single
table so it can queried with one SQL statement?
ie, can you do schemaRdd1.union(schemaRdd2), then register the new
schemaRdd and run a query over it?
Ideally, both schemaRdd1 and schemaRdd2 would be cached, so the union
should run
That might not be enough. Reflection is used to determine what the
fields are, thus your class might actually need to have members
corresponding to the fields in the table.
I heard that a more generic method of inputting stuff is coming.
On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer wrote:
>
Hey guys,
I'm using Spark 1.0.2 in AWS with 8 x c3.xlarge machines. I am
working with a subset of the GDELT dataset (57 columns, > 250 million
rows, but my subset is only 4 million) and trying to query it with
Spark SQL.
Since a CSV importer isn't available, my first thought was to use
nested c
22 matches
Mail list logo