Spark 2.3.0 --files vs. addFile()

2018-05-09 Thread Marius
spark.hadoop.fs.s3a.secret.key=$s3Secret \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ ${execJarPath} I am using Spark v 2.3.0 along with scala in Standalone cluster node with three workers. Cheers Marius

Spark Kubernetes Volumes

2018-04-12 Thread Marius
too large to justify copying them around using addFile. If this is not possible i would like to know if the community be interested in such a feature. Cheers Marius - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark Core]: S3a with Openstack swift object storage not using credentials provided in sparkConf

2017-11-15 Thread Marius
nding the s3 handler is not using the provided credentials. Has anyone an idea how to fix this? Cheers and thanks in Advance Marius

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
reducebykey. > > On 29 Aug 2016 20:57, "Marius Soutier" <mailto:mps@gmail.com>> wrote: > In DataFrames (and thus in 1.5 in general) this is not possible, correct? > >> On 11.08.2016, at 05:42, Holden Karau > <mailto:hol...@pigscanfly.ca>> wrote: &

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
In DataFrames (and thus in 1.5 in general) this is not possible, correct? > On 11.08.2016, at 05:42, Holden Karau wrote: > > Hi Luis, > > You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you > can do groupBy followed by a reduce on the GroupedDataset ( > http://spark.apa

Re: Spark Web UI port 4040 not working

2016-07-27 Thread Marius Soutier
That's to be expected - the application UI is not started by the master, but by the driver. So the UI will run on the machine that submits the job. > On 26.07.2016, at 15:49, Jestin Ma wrote: > > I did netstat -apn | grep 4040 on machine 6, and I see > > tcp0 0 :::4040

Re: Substract two DStreams

2016-06-28 Thread Marius Soutier
faster than joining and subtracting then. > Anyway, thanks for the hint of the transformWith method! > > Am 27. Juni 2016 um 14:32 schrieb Marius Soutier <mailto:mps@gmail.com>>: > `transformWith` accepts another stream, wouldn't that work? > >> On 27.0

Re: How to convert a Random Forest model built in R to a similar model in Spark

2016-06-27 Thread Marius Soutier
This might not help, but I once tried Spark's Random Forest on a Kaggle competition, and its predictions were terrible compared to R. So maybe you should rather look for an external library instead of using MLLib's Random Forest. — http://mariussoutier.com/blog > On 27.06.2016, at 07:47, Ne

Re: Substract two DStreams

2016-06-27 Thread Marius Soutier
Can't you use `transform` instead of `foreachRDD`? > On 15.06.2016, at 15:18, Matthias Niehoff > wrote: > > Hi, > > i want to subtract 2 DStreams (based on the same Input Stream) to get all > elements that exist in the original stream, but not in the modified stream > (the modified Stream i

Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Marius Soutier
> On 04.03.2016, at 22:39, Cody Koeninger wrote: > > The only other valid use of messageHandler that I can think of is > catching serialization problems on a per-message basis. But with the > new Kafka consumer library, that doesn't seem feasible anyway, and > could be handled with a custom (de

Re: spark.kryo.registrationRequired: Tuple2 is not registered

2015-09-10 Thread Marius Soutier
Found an issue for this: https://issues.apache.org/jira/browse/SPARK-10251 <https://issues.apache.org/jira/browse/SPARK-10251> > On 09.09.2015, at 18:00, Marius Soutier wrote: > > Hi all, > > as indicated in the title, I’m using Kryo with a custom Kryo serializer, but

spark.kryo.registrationRequired: Tuple2 is not registered

2015-09-09 Thread Marius Soutier
with Tuple2, which I cannot serialize sanely for all specialized forms. According to the documentation, this should be handled by Chill. Is this a bug or what am I missing? I’m using Spark 1.4.1. Cheers - Marius - To unsubscribe

Re: Java 8 vs Scala

2015-07-16 Thread Marius Danciu
If you takes time to actually learn Scala starting from its fundamental concepts AND quite importantly get familiar with general functional programming concepts, you'd immediately realize the things that you'd really miss going back to Java (8). On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła wr

DataFrame from RDD[Row]

2015-07-16 Thread Marius Danciu
Hi, This is an ugly solution because it requires pulling out a row: val rdd: RDD[Row] = ... ctx.createDataFrame(rdd, rdd.first().schema) Is there a better alternative to get a DataFrame from an RDD[Row] since toDF won't work as Row is not a Product ? Thanks, Marius

Re: Optimizations

2015-07-03 Thread Marius Danciu
suspect there are other technical reasons*). If anyone know the depths of the problem if would be of great help. Best, Marius On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito wrote: > One thing you could do is a broadcast join. You take your smaller RDD, > save it as a broadcast variable

Optimizations

2015-07-03 Thread Marius Danciu
function, all running in the same state without any other costs. Best, Marius

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu
Turned out that is was sufficient do to repartitionAndSortWithinPartitions ... so far so good ;) On Tue, May 5, 2015 at 9:45 AM Marius Danciu wrote: > Hi Imran, > > Yes that's what MyPartitioner does. I do see (using traces from > MyPartitioner) that the key is partitioned o

Re: Spark partitioning question

2015-05-04 Thread Marius Danciu
ByKey seemed a natural fit ( ... I am aware of its limitations). Thanks, Marius On Mon, May 4, 2015 at 10:45 PM Imran Rashid wrote: > Hi Marius, > > I am also a little confused -- are you saying that myPartitions is > basically something like: > > class MyPartitioner extend

Re: Spark partitioning question

2015-04-28 Thread Marius Danciu
nodes. In my case I see 2 yarn containers receiving records during a mapPartition operation applied on the sorted partition. I need to test more but it seems that applying the same partitioner again right before the last mapPartition can help. Best, Marius On Tue, Apr 28, 2015 at 4:40 PM Silvio

Spark partitioning question

2015-04-28 Thread Marius Danciu
explain and f fails. The overall behavior of this job is that sometimes it succeeds and sometimes it fails ... apparently due to inconsistent propagation of sorted records to yarn containers. If any of this makes any sense to you, please let me know what I am missing. Best, Marius

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Thank you Iulian ! That's precisely what I discovered today. Best, Marius On Wed, Apr 22, 2015 at 3:31 PM Iulian Dragoș wrote: > On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu > wrote: > >> Hello anyone, >> >> I have a question regarding the sort shuffle. Rou

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Anyone ? On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu wrote: > Hello anyone, > > I have a question regarding the sort shuffle. Roughly I'm doing something > like: > > rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) > > The problem is that in

Shuffle question

2015-04-21 Thread Marius Danciu
sts since Spark 1.3.0 is supposed to use the SORT shuffle manager by default, right ? 2. Do I need each key to be an scala.math.Ordered ? ... is Java Comparable used at all ? ... btw I'm using Spark from Java ... don't ask me why :) Best, Marius

Re: Streaming problems running 24x7

2015-04-20 Thread Marius Soutier
The processing speed displayed in the UI doesn’t seem to take everything into account. I also had a low processing time but had to increase batch duration from 30 seconds to 1 minute because waiting batches kept increasing. Now it runs fine. > On 17.04.2015, at 13:30, González Salgado, Miquel

Re: Spark-1.2.2-bin-hadoop2.4.tgz missing

2015-04-20 Thread Marius Soutier
Same problem here... > On 20.04.2015, at 09:59, Zsolt Tóth wrote: > > Hi all, > > it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on > the mirror sites. Am I missing something? > > Regards, > Zsolt

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
That’s true, spill dirs don’t get cleaned up when something goes wrong. We are are restarting long running jobs once in a while for cleanups and have spark.cleaner.ttl set to a lower value than the default. > On 14.04.2015, at 17:57, Guillaume Pitel wrote: > > Right, I remember now, the only p

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically. From the source code comments: // SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the // application finishes. > On 13.04.2015, at 11:26, Guillaume Pitel wrote: > > Does it also cleanup spark lo

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Marius Soutier
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example: export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=" > On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) > wrote: > > Does anybody have an answer for this? > > Thanks > Ningjun >

actorStream woes

2015-03-30 Thread Marius Soutier
eiver per batch. I have 5 actor streams (one per node) with 10 total cores assigned. Driver has 3 GB RAM, each worker 4 GB. There is certainly no memory pressure, "Memory Used" is around 100kb, "Input" is around 10 MB. Thanks f

Re: Processing of text file in large gzip archive

2015-03-16 Thread Marius Soutier
> 1. I don't think textFile is capable of unpacking a .gz file. You need to use > hadoopFile or newAPIHadoop file for this. Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is compute splits on gz files, so if you have a single file, you'll have a single partition. P

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-12 Thread Marius Soutier
0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations > > <http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations> > > TD > > On Wed, Mar 11, 2015 at 12:20 PM, Marius Soutier <mailto:mps@

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)). > On 11.03.2015, at 18:35, Marius Soutier wrote: > > Hi, > > I’ve written a Spark Streaming Job that inserts into a Parquet, using > stream.foreachRDD(_insertInto(“table”, overwrite = tru

Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) Cheers - Mar

Re: Executor lost with too many temp files

2015-02-26 Thread Marius Soutier
your machine. > > On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier <mailto:mps@gmail.com>> wrote: > Hi Sameer, > > I’m still using Spark 1.1.1, I think the default is hash shuffle. No external > shuffle service. > > We are processing gzipped JSON files, the parti

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
gt; yes they can. > > On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier wrote: >> Hi, >> >> just a quick question about calling persist with the _2 option. Is the 2x >> replication only useful for fault tolerance, or will it also increase job >> speed by avoi

Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
Hi, just a quick question about calling persist with the _2 option. Is the 2x replication only useful for fault tolerance, or will it also increase job speed by avoiding network transfers? Assuming I’m doing joins or other shuffle operations. Thanks --

Re: Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
. Everything above that make it very likely it will crash, even on smaller datasets (~300 files). But I’m not sure if this is related to the above issue. > On 23.02.2015, at 18:15, Sameer Farooqui wrote: > > Hi Marius, > > Are you using the sort or hash shuffle? > > Also, do

Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
have run, following jobs will struggle with completion. There are a lot of failures without any exception message, only the above mentioned lost executor. As soon as I clear out /var/run/spark/work/ and the spill disk, everything goes back to normal. Thanks for any hint, - M

Re: Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-10 Thread Marius Soutier
): java.io.FileNotFoundException: /tmp/spark-local-20150210030009-b4f1/3f/shuffle_4_655_49 (No space left on device) Even though there’s plenty of disk space left. On 10.02.2015, at 00:09, Muttineni, Vinay wrote: > Hi Marius, > Did you find a solution to this problem? I get the same error. > Thanks

Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-09 Thread Marius Soutier
, most recent failure: Lost task 10.3 in stage 2.0 (TID 20, xxx.compute.internal): ExecutorLostFailure (executor lost) Driver stacktrace: Is there any way to understand what’s going on? The logs don’t show anything. I’m using Spark 1.1.1. Thanks - Marius

Re: Intermittent test failures

2014-12-17 Thread Marius Soutier
scala.Option.foreach(Option.scala:236) On 15.12.2014, at 22:36, Marius Soutier wrote: > Ok, maybe these test versions will help me then. I’ll check it out. > > On 15.12.2014, at 22:33, Michael Armbrust wrote: > >> Using a single SparkContext should not cause this problem. In the SQ

Re: Intermittent test failures

2014-12-15 Thread Marius Soutier
t testing. > > On Mon, Dec 15, 2014 at 1:27 PM, Marius Soutier wrote: > Possible, yes, although I’m trying everything I can to prevent it, i.e. fork > in Test := true and isolated. Can you confirm that reusing a single > SparkContext for multiple tests poses a problem as well? &g

Re: Intermittent test failures

2014-12-15 Thread Marius Soutier
15.12.2014, at 20:22, Michael Armbrust wrote: > Is it possible that you are starting more than one SparkContext in a single > JVM with out stopping previous ones? I'd try testing with Spark 1.2, which > will throw an exception in this case. > > On Mon, Dec 15, 2014 at 8:4

Intermittent test failures

2014-12-15 Thread Marius Soutier
HiveContext. It does not seem to have anything to do with the actual files that I also create during the test run with SQLContext.saveAsParquetFile. Cheers - Marius PS The full trace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 6.0 failed 1 times

Migrating Parquet inputs

2014-12-15 Thread Marius Soutier
Hi, is there an easy way to “migrate” parquet files or indicate optional values in sql statements? I added a couple of new fields that I also use in a schemaRDD.sql() which obviously fails for input files that don’t have the new fields. Thanks - Marius

Re: Merging Parquet Files

2014-11-19 Thread Marius Soutier
You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._ On 19.11.2014, at 09:41, Daniel Haviv wrote: > Hello, > I'm writing a process that ingests json files and saves them a parquet files. > The process is as such: > > val sqlContex

Re: Snappy temp files not cleaned up

2014-11-06 Thread Marius Soutier
Default value is infinite, so you need to enable it. Personally I’ve setup a couple of cron jobs to clean up /tmp and /var/run/spark. On 06.11.2014, at 08:15, Romi Kuntsman wrote: > Hello, > > Spark has an internal cleanup mechanism > (defined by spark.cleaner.ttl, see > http://spark.apache.o

Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
I did some simple experiments with Impala and Spark, and Impala came out ahead. But it’s also less flexible, couldn’t handle irregular schemas, didn't support Json, and so on. On 01.11.2014, at 02:20, Soumya Simanta wrote: > I agree. My personal experience with Spark core is that it performs r

Re: Submiting Spark application through code

2014-11-02 Thread Marius Soutier
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my Hadoop dependencies to run a SparkContext. In your build.sbt: "org.apache.hadoop" % "hadoop-common" % “..." exclude("javax.servlet", "servlet-api"), "org.apache.hadoop" % "hadoop-hdfs" % “..." exclude("javax.servlet",

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-11-01 Thread Marius Soutier
Are these /vols formatted? You typically need to format and define a mount point in /mnt for attached EBS volumes. I’m not using the ec2 script, so I don’t know what is installed, but there’s usually an HDFS info service running on port 50070. After changing hdfs-site.xml, you have to restart t

Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-27 Thread Marius Soutier
So, apparently `wholeTextFiles` runs the job again, passing null as argument list, which in turn blows up my argument parsing mechanics. I never thought I had to check for null again in a pure Scala environment ;) On 26.10.2014, at 11:57, Marius Soutier wrote: > I tried that already, s

Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-26 Thread Marius Soutier
ed file names locally, or save the whole thing out to a file > > From: Marius Soutier [mps@gmail.com] > Sent: Friday, October 24, 2014 6:35 AM > To: user@spark.apache.org > Subject: scala.collection.mutable.ArrayOps$ofRef$.length$extension since &g

scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-24 Thread Marius Soutier
ed to process $fileName, reason ${t.getStackTrace.head}") } } Also since 1.1.0, the printlns are no longer visible on the console, only in the Spark UI worker output. Thanks for any help - Marius - T

SparkSQL and columnar data

2014-10-23 Thread Marius Soutier
any insights, - Marius - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Can’t install that on our cluster, but I can try locally. Is there a pre-built binary available? On 22.10.2014, at 19:01, Davies Liu wrote: > In the master, you can easily profile you job, find the bottlenecks, > see https://github.com/apache/spark/pull/2556 > > Could you try it and show the s

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Yeah we’re using Python 2.7.3. On 22.10.2014, at 20:06, Nicholas Chammas wrote: > On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT > wrote: > > > > Wild guess maybe, but do you decode the json records in Python ? it could be > much slower as the default lib is quite slow. > > > Oh yea

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
. One core per worker is permanently used by a job that allows SQL queries over Parquet files. On 22.10.2014, at 16:18, Arian Pasquali wrote: > Interesting thread Marius, > Btw, I'm curious about your cluster size. > How small it is in terms of ram and cores. > > Arian >

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Didn’t seem to help: conf = SparkConf().set("spark.shuffle.spill", "false").set("spark.default.parallelism", "12") sc = SparkContext(appName=’app_name', conf = conf) but still taking as much time On 22.10.2014, at 14:17, Nicholas Chammas wrote: > Total guess without knowing anything about you

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
rote one of my Scala jobs in Python. > > From the API-side, everything looks more or less identical. However his > > jobs take between 5-8 hours to complete! We can also see that the execution > > plan is quite different, I’m seeing writes to the output much later than in > &g

Python vs Scala performance

2014-10-22 Thread Marius Soutier
plan is quite different, I’m seeing writes to the output much later than in Scala. Is Python I/O really that slow? Thanks - Marius - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail

Re: Confusion over how to deploy/run JAR files to a Spark Cluster

2014-10-02 Thread Marius Soutier
On 02.10.2014, at 13:32, Mark Mandel wrote: > How do I store a JAR on a cluster? Is that through storm-submit with a deploy > mode of "cluster” ? Well, just upload it? scp, ftp, and so on. Ideally your build server would put it there. > How do I run an already uploaded JAR with spark-submi

Fwd: Actual Probabilities when Using Naive Bayes classifier

2014-09-30 Thread Marius FETEANU
I want to use the mllib NaiveBayes classifier to predict user responses to an offer. I am interested in different types of responses (not just accept/reject) and also I need the actual probabilities for each predictions (as each label might come with a different benefit/cost not known at training

Re: parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Thank you, that works! On 24.09.2014, at 19:01, Michael Armbrust wrote: > This behavior is inherited from the parquet input format that we use. You > could list the files manually and pass them as a comma separated list. > > On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier wr

parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Hello, sc.textFile and so on support wildcards in their path, but apparently sqlc.parquetFile() does not. I always receive “File /file/to/path/*/input.parquet does not exist". Is this normal or a bug? Is there are a workaround? Thanks - M

Re: Serving data

2014-09-17 Thread Marius Soutier
..unless I'm missing something in > your setup. > > On Tue, Sep 16, 2014 at 4:18 AM, Marius Soutier wrote: > Writing to Parquet and querying the result via SparkSQL works great (except > for some strange SQL parser errors). However the problem remains, how do I > get that d

Re: Serving data

2014-09-16 Thread Marius Soutier
Writing to Parquet and querying the result via SparkSQL works great (except for some strange SQL parser errors). However the problem remains, how do I get that data back to a dashboard. So I guess I’ll have to use a database after all. You can batch up data & store into parquet partitions as we

Re: Serving data

2014-09-15 Thread Marius Soutier
Nice, I’ll check it out. At first glance, writing Parquet files seems to be a bit complicated. On 15.09.2014, at 13:54, andy petrella wrote: > nope. > It's an efficient storage for genomics data :-D > > aℕdy ℙetrella > about.me/noootsab > > > > On Mon,

Re: Serving data

2014-09-15 Thread Marius Soutier
So you are living the dream of using HDFS as a database? ;) On 15.09.2014, at 13:50, andy petrella wrote: > I'm using Parquet in ADAM, and I can say that it works pretty fine! > Enjoy ;-) > > aℕdy ℙetrella > about.me/noootsab > > > > On Mon, Sep 15, 2014 a

Re: Serving data

2014-09-15 Thread Marius Soutier
ta & store into parquet partitions as well. & query it > using another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe. > -- > Regards, > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi > > > On Fri, Se

Serving data

2014-09-12 Thread Marius Soutier
databases? Thanks - Marius - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org