spark.hadoop.fs.s3a.secret.key=$s3Secret \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
${execJarPath}
I am using Spark v 2.3.0 along with scala in Standalone cluster node
with three workers.
Cheers
Marius
too large to justify copying them around using addFile. If this is
not possible i would like to know if the community be interested in such
a feature.
Cheers
Marius
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
nding the s3 handler is not using the provided credentials.
Has anyone an idea how to fix this?
Cheers and thanks in Advance
Marius
reducebykey.
>
> On 29 Aug 2016 20:57, "Marius Soutier" <mailto:mps@gmail.com>> wrote:
> In DataFrames (and thus in 1.5 in general) this is not possible, correct?
>
>> On 11.08.2016, at 05:42, Holden Karau > <mailto:hol...@pigscanfly.ca>> wrote:
&
In DataFrames (and thus in 1.5 in general) this is not possible, correct?
> On 11.08.2016, at 05:42, Holden Karau wrote:
>
> Hi Luis,
>
> You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you
> can do groupBy followed by a reduce on the GroupedDataset (
> http://spark.apa
That's to be expected - the application UI is not started by the master, but by
the driver. So the UI will run on the machine that submits the job.
> On 26.07.2016, at 15:49, Jestin Ma wrote:
>
> I did netstat -apn | grep 4040 on machine 6, and I see
>
> tcp0 0 :::4040
faster than joining and subtracting then.
> Anyway, thanks for the hint of the transformWith method!
>
> Am 27. Juni 2016 um 14:32 schrieb Marius Soutier <mailto:mps@gmail.com>>:
> `transformWith` accepts another stream, wouldn't that work?
>
>> On 27.0
This might not help, but I once tried Spark's Random Forest on a Kaggle
competition, and its predictions were terrible compared to R. So maybe you
should rather look for an external library instead of using MLLib's Random
Forest.
—
http://mariussoutier.com/blog
> On 27.06.2016, at 07:47, Ne
Can't you use `transform` instead of `foreachRDD`?
> On 15.06.2016, at 15:18, Matthias Niehoff
> wrote:
>
> Hi,
>
> i want to subtract 2 DStreams (based on the same Input Stream) to get all
> elements that exist in the original stream, but not in the modified stream
> (the modified Stream i
> On 04.03.2016, at 22:39, Cody Koeninger wrote:
>
> The only other valid use of messageHandler that I can think of is
> catching serialization problems on a per-message basis. But with the
> new Kafka consumer library, that doesn't seem feasible anyway, and
> could be handled with a custom (de
Found an issue for this:
https://issues.apache.org/jira/browse/SPARK-10251
<https://issues.apache.org/jira/browse/SPARK-10251>
> On 09.09.2015, at 18:00, Marius Soutier wrote:
>
> Hi all,
>
> as indicated in the title, I’m using Kryo with a custom Kryo serializer, but
with Tuple2, which I cannot serialize sanely
for all specialized forms. According to the documentation, this should be
handled by Chill. Is this a bug or what am I missing?
I’m using Spark 1.4.1.
Cheers
- Marius
-
To unsubscribe
If you takes time to actually learn Scala starting from its fundamental
concepts AND quite importantly get familiar with general functional
programming concepts, you'd immediately realize the things that you'd
really miss going back to Java (8).
On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła wr
Hi,
This is an ugly solution because it requires pulling out a row:
val rdd: RDD[Row] = ...
ctx.createDataFrame(rdd, rdd.first().schema)
Is there a better alternative to get a DataFrame from an RDD[Row] since
toDF won't work as Row is not a Product ?
Thanks,
Marius
suspect there are other technical reasons*).
If anyone know the depths of the problem if would be of great help.
Best,
Marius
On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito
wrote:
> One thing you could do is a broadcast join. You take your smaller RDD,
> save it as a broadcast variable
function, all running in the same state without any
other costs.
Best,
Marius
Turned out that is was sufficient do to repartitionAndSortWithinPartitions
... so far so good ;)
On Tue, May 5, 2015 at 9:45 AM Marius Danciu
wrote:
> Hi Imran,
>
> Yes that's what MyPartitioner does. I do see (using traces from
> MyPartitioner) that the key is partitioned o
ByKey
seemed a natural fit ( ... I am aware of its limitations).
Thanks,
Marius
On Mon, May 4, 2015 at 10:45 PM Imran Rashid wrote:
> Hi Marius,
>
> I am also a little confused -- are you saying that myPartitions is
> basically something like:
>
> class MyPartitioner extend
nodes. In my case I see 2 yarn containers receiving records during a
mapPartition operation applied on the sorted partition. I need to test more
but it seems that applying the same partitioner again right before the
last mapPartition can
help.
Best,
Marius
On Tue, Apr 28, 2015 at 4:40 PM Silvio
explain and f fails.
The overall behavior of this job is that sometimes it succeeds and
sometimes it fails ... apparently due to inconsistent propagation of sorted
records to yarn containers.
If any of this makes any sense to you, please let me know what I am missing.
Best,
Marius
Thank you Iulian ! That's precisely what I discovered today.
Best,
Marius
On Wed, Apr 22, 2015 at 3:31 PM Iulian Dragoș
wrote:
> On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu
> wrote:
>
>> Hello anyone,
>>
>> I have a question regarding the sort shuffle. Rou
Anyone ?
On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu
wrote:
> Hello anyone,
>
> I have a question regarding the sort shuffle. Roughly I'm doing something
> like:
>
> rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2)
>
> The problem is that in
sts since
Spark 1.3.0 is supposed to use the SORT shuffle manager by default, right ?
2. Do I need each key to be an scala.math.Ordered ? ... is Java Comparable
used at all ?
... btw I'm using Spark from Java ... don't ask me why :)
Best,
Marius
The processing speed displayed in the UI doesn’t seem to take everything into
account. I also had a low processing time but had to increase batch duration
from 30 seconds to 1 minute because waiting batches kept increasing. Now it
runs fine.
> On 17.04.2015, at 13:30, González Salgado, Miquel
Same problem here...
> On 20.04.2015, at 09:59, Zsolt Tóth wrote:
>
> Hi all,
>
> it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on
> the mirror sites. Am I missing something?
>
> Regards,
> Zsolt
That’s true, spill dirs don’t get cleaned up when something goes wrong. We are
are restarting long running jobs once in a while for cleanups and have
spark.cleaner.ttl set to a lower value than the default.
> On 14.04.2015, at 17:57, Guillaume Pitel wrote:
>
> Right, I remember now, the only p
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically.
From the source code comments:
// SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the
// application finishes.
> On 13.04.2015, at 11:26, Guillaume Pitel wrote:
>
> Does it also cleanup spark lo
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.appDataTtl="
> On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV)
> wrote:
>
> Does anybody have an answer for this?
>
> Thanks
> Ningjun
>
eiver per batch. I have 5
actor streams (one per node) with 10 total cores assigned.
Driver has 3 GB RAM, each worker 4 GB.
There is certainly no memory pressure, "Memory Used" is around 100kb, "Input"
is around 10 MB.
Thanks f
> 1. I don't think textFile is capable of unpacking a .gz file. You need to use
> hadoopFile or newAPIHadoop file for this.
Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is
compute splits on gz files, so if you have a single file, you'll have a single
partition.
P
0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations
>
> <http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations>
>
> TD
>
> On Wed, Mar 11, 2015 at 12:20 PM, Marius Soutier <mailto:mps@
Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)).
> On 11.03.2015, at 18:35, Marius Soutier wrote:
>
> Hi,
>
> I’ve written a Spark Streaming Job that inserts into a Parquet, using
> stream.foreachRDD(_insertInto(“table”, overwrite = tru
$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
Cheers
- Mar
your machine.
>
> On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier <mailto:mps@gmail.com>> wrote:
> Hi Sameer,
>
> I’m still using Spark 1.1.1, I think the default is hash shuffle. No external
> shuffle service.
>
> We are processing gzipped JSON files, the parti
gt; yes they can.
>
> On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier wrote:
>> Hi,
>>
>> just a quick question about calling persist with the _2 option. Is the 2x
>> replication only useful for fault tolerance, or will it also increase job
>> speed by avoi
Hi,
just a quick question about calling persist with the _2 option. Is the 2x
replication only useful for fault tolerance, or will it also increase job speed
by avoiding network transfers? Assuming I’m doing joins or other shuffle
operations.
Thanks
--
. Everything above that make it very likely
it will crash, even on smaller datasets (~300 files). But I’m not sure if this
is related to the above issue.
> On 23.02.2015, at 18:15, Sameer Farooqui wrote:
>
> Hi Marius,
>
> Are you using the sort or hash shuffle?
>
> Also, do
have run, following jobs will struggle with completion. There are a lot
of failures without any exception message, only the above mentioned lost
executor. As soon as I clear out /var/run/spark/work/ and the spill disk,
everything goes back to normal.
Thanks for any hint,
- M
):
java.io.FileNotFoundException:
/tmp/spark-local-20150210030009-b4f1/3f/shuffle_4_655_49 (No space left on
device)
Even though there’s plenty of disk space left.
On 10.02.2015, at 00:09, Muttineni, Vinay wrote:
> Hi Marius,
> Did you find a solution to this problem? I get the same error.
> Thanks
, most
recent failure: Lost task 10.3 in stage 2.0 (TID 20, xxx.compute.internal):
ExecutorLostFailure (executor lost)
Driver stacktrace:
Is there any way to understand what’s going on? The logs don’t show anything.
I’m using Spark 1.1.1.
Thanks
- Marius
scala.Option.foreach(Option.scala:236)
On 15.12.2014, at 22:36, Marius Soutier wrote:
> Ok, maybe these test versions will help me then. I’ll check it out.
>
> On 15.12.2014, at 22:33, Michael Armbrust wrote:
>
>> Using a single SparkContext should not cause this problem. In the SQ
t testing.
>
> On Mon, Dec 15, 2014 at 1:27 PM, Marius Soutier wrote:
> Possible, yes, although I’m trying everything I can to prevent it, i.e. fork
> in Test := true and isolated. Can you confirm that reusing a single
> SparkContext for multiple tests poses a problem as well?
&g
15.12.2014, at 20:22, Michael Armbrust wrote:
> Is it possible that you are starting more than one SparkContext in a single
> JVM with out stopping previous ones? I'd try testing with Spark 1.2, which
> will throw an exception in this case.
>
> On Mon, Dec 15, 2014 at 8:4
HiveContext. It does not seem to have anything to do with
the actual files that I also create during the test run with
SQLContext.saveAsParquetFile.
Cheers
- Marius
PS The full trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in
stage 6.0 failed 1 times
Hi,
is there an easy way to “migrate” parquet files or indicate optional values in
sql statements? I added a couple of new fields that I also use in a
schemaRDD.sql() which obviously fails for input files that don’t have the new
fields.
Thanks
- Marius
You can also insert into existing tables via .insertInto(tableName, overwrite).
You just have to import sqlContext._
On 19.11.2014, at 09:41, Daniel Haviv wrote:
> Hello,
> I'm writing a process that ingests json files and saves them a parquet files.
> The process is as such:
>
> val sqlContex
Default value is infinite, so you need to enable it. Personally I’ve setup a
couple of cron jobs to clean up /tmp and /var/run/spark.
On 06.11.2014, at 08:15, Romi Kuntsman wrote:
> Hello,
>
> Spark has an internal cleanup mechanism
> (defined by spark.cleaner.ttl, see
> http://spark.apache.o
I did some simple experiments with Impala and Spark, and Impala came out ahead.
But it’s also less flexible, couldn’t handle irregular schemas, didn't support
Json, and so on.
On 01.11.2014, at 02:20, Soumya Simanta wrote:
> I agree. My personal experience with Spark core is that it performs r
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my
Hadoop dependencies to run a SparkContext.
In your build.sbt:
"org.apache.hadoop" % "hadoop-common" % “..." exclude("javax.servlet",
"servlet-api"),
"org.apache.hadoop" % "hadoop-hdfs" % “..." exclude("javax.servlet",
Are these /vols formatted? You typically need to format and define a mount
point in /mnt for attached EBS volumes.
I’m not using the ec2 script, so I don’t know what is installed, but there’s
usually an HDFS info service running on port 50070. After changing
hdfs-site.xml, you have to restart t
So, apparently `wholeTextFiles` runs the job again, passing null as argument
list, which in turn blows up my argument parsing mechanics. I never thought I
had to check for null again in a pure Scala environment ;)
On 26.10.2014, at 11:57, Marius Soutier wrote:
> I tried that already, s
ed file names locally, or save the whole thing out to a file
>
> From: Marius Soutier [mps@gmail.com]
> Sent: Friday, October 24, 2014 6:35 AM
> To: user@spark.apache.org
> Subject: scala.collection.mutable.ArrayOps$ofRef$.length$extension since
&g
ed to process $fileName, reason
${t.getStackTrace.head}")
}
}
Also since 1.1.0, the printlns are no longer visible on the console, only in
the Spark UI worker output.
Thanks for any help
- Marius
-
T
any insights,
- Marius
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Can’t install that on our cluster, but I can try locally. Is there a pre-built
binary available?
On 22.10.2014, at 19:01, Davies Liu wrote:
> In the master, you can easily profile you job, find the bottlenecks,
> see https://github.com/apache/spark/pull/2556
>
> Could you try it and show the s
Yeah we’re using Python 2.7.3.
On 22.10.2014, at 20:06, Nicholas Chammas wrote:
> On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT
> wrote:
>
>
>
> Wild guess maybe, but do you decode the json records in Python ? it could be
> much slower as the default lib is quite slow.
>
>
> Oh yea
.
One core per worker is permanently used by a job that allows SQL queries over
Parquet files.
On 22.10.2014, at 16:18, Arian Pasquali wrote:
> Interesting thread Marius,
> Btw, I'm curious about your cluster size.
> How small it is in terms of ram and cores.
>
> Arian
>
Didn’t seem to help:
conf = SparkConf().set("spark.shuffle.spill",
"false").set("spark.default.parallelism", "12")
sc = SparkContext(appName=’app_name', conf = conf)
but still taking as much time
On 22.10.2014, at 14:17, Nicholas Chammas wrote:
> Total guess without knowing anything about you
rote one of my Scala jobs in Python.
> > From the API-side, everything looks more or less identical. However his
> > jobs take between 5-8 hours to complete! We can also see that the execution
> > plan is quite different, I’m seeing writes to the output much later than in
> &g
plan is quite
different, I’m seeing writes to the output much later than in Scala.
Is Python I/O really that slow?
Thanks
- Marius
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail
On 02.10.2014, at 13:32, Mark Mandel wrote:
> How do I store a JAR on a cluster? Is that through storm-submit with a deploy
> mode of "cluster” ?
Well, just upload it? scp, ftp, and so on. Ideally your build server would put
it there.
> How do I run an already uploaded JAR with spark-submi
I want to use the mllib NaiveBayes classifier to predict user responses to
an offer.
I am interested in different types of responses (not just accept/reject)
and also I need the actual probabilities for each predictions (as each
label might come with a different benefit/cost not known at training
Thank you, that works!
On 24.09.2014, at 19:01, Michael Armbrust wrote:
> This behavior is inherited from the parquet input format that we use. You
> could list the files manually and pass them as a comma separated list.
>
> On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier wr
Hello,
sc.textFile and so on support wildcards in their path, but apparently
sqlc.parquetFile() does not. I always receive “File
/file/to/path/*/input.parquet does not exist". Is this normal or a bug? Is
there are a workaround?
Thanks
- M
..unless I'm missing something in
> your setup.
>
> On Tue, Sep 16, 2014 at 4:18 AM, Marius Soutier wrote:
> Writing to Parquet and querying the result via SparkSQL works great (except
> for some strange SQL parser errors). However the problem remains, how do I
> get that d
Writing to Parquet and querying the result via SparkSQL works great (except for
some strange SQL parser errors). However the problem remains, how do I get that
data back to a dashboard. So I guess I’ll have to use a database after all.
You can batch up data & store into parquet partitions as we
Nice, I’ll check it out. At first glance, writing Parquet files seems to be a
bit complicated.
On 15.09.2014, at 13:54, andy petrella wrote:
> nope.
> It's an efficient storage for genomics data :-D
>
> aℕdy ℙetrella
> about.me/noootsab
>
>
>
> On Mon,
So you are living the dream of using HDFS as a database? ;)
On 15.09.2014, at 13:50, andy petrella wrote:
> I'm using Parquet in ADAM, and I can say that it works pretty fine!
> Enjoy ;-)
>
> aℕdy ℙetrella
> about.me/noootsab
>
>
>
> On Mon, Sep 15, 2014 a
ta & store into parquet partitions as well. & query it
> using another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe.
> --
> Regards,
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
>
>
> On Fri, Se
databases?
Thanks
- Marius
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
70 matches
Mail list logo