Re: spark sql hive-shims

2015-05-13 Thread Lior Chaga
Ultimately it was PermGen out of memory. I somehow missed it in the log On Thu, May 14, 2015 at 9:24 AM, Lior Chaga wrote: > After profiling with YourKit, I see there's an OutOfMemoryException in > context SQLContext.applySchema. Again, it's a very small RDD. Each executor > has 180GB RAM. > > O

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread NB
The data pipeline (DAG) should not be added to the StreamingContext in the case of a recovery scenario. The pipeline metadata is recovered from the checkpoint folder. That is one thing you will need to fix in your code. Also, I don't think the ssc.checkpoint(folder) call should be made in case of t

Re: spark sql hive-shims

2015-05-13 Thread Lior Chaga
After profiling with YourKit, I see there's an OutOfMemoryException in context SQLContext.applySchema. Again, it's a very small RDD. Each executor has 180GB RAM. On Thu, May 14, 2015 at 8:53 AM, Lior Chaga wrote: > Hi, > > Using spark sql with HiveContext. Spark version is 1.3.1 > When running l

spark sql hive-shims

2015-05-13 Thread Lior Chaga
Hi, Using spark sql with HiveContext. Spark version is 1.3.1 When running local spark everything works fine. When running on spark cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims. This class belongs to hive-shims-0.23, and is a runtime dependency for spark-hive: [INFO]

How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-13 Thread MEETHU MATHEW
Hi all,  Quote "Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. "  How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? I found some examples in scala and java, but co

Re: JavaPairRDD

2015-05-13 Thread Yasemin Kaya
Thank you Tristan. It is totally what I am looking for :) 2015-05-14 5:05 GMT+03:00 Tristan Blakers : > You could use a map() operation, but the easiest way is probably to just > call values() method on the JavaPairRDD to get a JavaRDD. > > See this link: > > https://www.safaribooksonline.com/li

Spark performance in cluster mode using yarn

2015-05-13 Thread sachin Singh
Hi Friends, please someone can give the idea, Ideally what should be time(complete job execution) for spark job, I have data in a hive table, amount of data would be 1GB , 2 lacs rows for whole month, I want to do monthly aggregation, using SQL queries,groupby I have only one node,1 cluster,below

Spark recovery takes long

2015-05-13 Thread NB
Hello Spark gurus, We have a spark streaming application that is consuming from a Flume stream and has some window operations. The input batch sizes are 1 minute and intermediate Window operations have window sizes of 1 minute, 1 hour and 6 hours. I enabled checkpointing and Write ahead log so tha

Re: --jars works in "yarn-client" but not "yarn-cluster" mode, why?

2015-05-13 Thread Fengyun RAO
I look into the "Environment" in both modes. yarn-client: spark.jars local:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar,file:/home/xxx/my-app.jar yarn-cluster: spark.yarn.secondary.jars local:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar I won

--jars works in "yarn-client" but not "yarn-cluster" mode, why?

2015-05-13 Thread Fengyun RAO
Hadoop version: CDH 5.4. We need to connect to HBase, thus need extra "/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar" dependency. It works in yarn-client mode: "spark-submit --class xxx.xxx.MyApp --master yarn-client --num-executors 10 --executor-memory 10g --jars /opt/

spark-streaming whit flume error

2015-05-13 Thread ??
Hi all, I want use spark-streaming with flume ,now i am in truble, I don't know how to configure the flume ,I use I configure flume like this : a1.sources = r1 a1.channels = c1 c2 a1.sources.r1.type = avro a1.sources.r1.channels = c1 c2 a1.sources.r1.bind = 0.0.0.0 a1.sources.r1.port =

Re: Spark SQL: preferred syntax for column reference?

2015-05-13 Thread Dean Wampler
Is the $"foo" or mydf("foo") or both checked at compile time to verify that the column reference is valid? Thx. Dean On Wednesday, May 13, 2015, Michael Armbrust wrote: > I would not say that either method is preferred (neither is > old/deprecated). One advantage to the second is that you are

Re: JavaPairRDD

2015-05-13 Thread Tristan Blakers
You could use a map() operation, but the easiest way is probably to just call values() method on the JavaPairRDD to get a JavaRDD. See this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html Tristan On 13 May 2015 at 23:12, Yasemin Kaya wrote: > Hi,

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
You might get stage information through SparkListener. But I am not sure whether you can use that information to easily kill stages. Though i highly recommend using Spark 1.3.1 (or even Spark master). Things move really fast between releases. 1.1.1 feels really old to me :P TD On Wed, May 13, 201

Re: Trouble trying to run ./spark-ec2 script

2015-05-13 Thread Su She
Hmm, just tried to run it again, but opened the script with python, the cmd line seemed to pop up really quick and exited. On Wed, May 13, 2015 at 2:06 PM, Su She wrote: > Hi Ted, Yes I do have Python 3.5 installed. I just ran "py" from the > ec2 directory and it started up the python shell. > >

Re: Trouble trying to run ./spark-ec2 script

2015-05-13 Thread Ted Yu
Is python installed on the machine where you ran ./spark-ec2 ? Cheers On Wed, May 13, 2015 at 1:33 PM, Su She wrote: > I'm trying to set up my own cluster and am having trouble running this > script: > > ./spark-ec2 --key-pair=xx --identity-file=xx.pem --region=us-west-2 > --zone=us-west-2c --n

Problem with current spark

2015-05-13 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to run an application that uses a Hive context to perform some queries over JSON files. The code of the application is here: https://github.com/GiovanniPaoloGibilisco/spark-log-processor/tree/fca93d95a227172baca58d51a4d799594a0429a1 I can run it on Spark 1.3.1 after rebuilding it wi

Trouble trying to run ./spark-ec2 script

2015-05-13 Thread Su She
I'm trying to set up my own cluster and am having trouble running this script: ./spark-ec2 --key-pair=xx --identity-file=xx.pem --region=us-west-2 --zone=us-west-2c --num-slaves=1 launch my-spark-cluster based off: https://spark.apache.org/docs/latest/ec2-scripts.html It just tries to open the s

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
I do rdd.countApprox() and rdd.sparkContext.setJobGroup() inside dstream.foreachRDD{...}. After calling cancelJobGroup(), the spark context seems no longer valid, which crashes subsequent jobs. My spark version is 1.1.1. I will do more investigation into this issue, perhaps after upgrading to 1.

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
That is not supposed to happen :/ That is probably a bug. If you have the log4j logs, would be good to file a JIRA. This may be worth debugging. On Wed, May 13, 2015 at 12:13 PM, Du Li wrote: > Actually I tried that before asking. However, it killed the spark context. > :-) > > Du > > > > On W

PostgreSQL JDBC Classpath Issue

2015-05-13 Thread George Adams
Hey all, I seem to be having an issue with PostgreSQL JDBC jar on my classpath. I’ve outlined the issue on Stack Overflow (http://stackoverflow.com/questions/30221677/spark-sql-postgresql-data-source-issues). I’m not sure how to fix this since I built the uber jar using sbt-assembly and the fin

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
Actually I tried that before asking. However, it killed the spark context. :-) Du On Wednesday, May 13, 2015 12:02 PM, Tathagata Das wrote: That is a good question. I dont see a direct way to do that.  You could do try the following  val jobGroupId = rdd.sparkContext.setJobGroup(jo

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
That is a good question. I dont see a direct way to do that. You could do try the following val jobGroupId = rdd.sparkContext.setJobGroup(jobGroupId) val approxCount = rdd.countApprox().getInitialValue // job launched with the set job group rdd.sparkContext.cancelJobGroup(jobGroupId)

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
Hi TD, Do you know how to cancel the rdd.countApprox(5000) tasks after the timeout? Otherwise it keeps running until completion, producing results not used but consuming resources. Thanks,Du On Wednesday, May 13, 2015 10:33 AM, Du Li wrote: Hi TD, Thanks a lot. rdd.countApprox(50

Word2Vec with billion-word corpora

2015-05-13 Thread Shilad Sen
Hi all, I'm experimenting with Spark's Word2Vec implementation for a relatively large (5B word, vocabulary size 4M, 400-dimensional vectors) corpora. Has anybody had success running it at this scale? Thanks in advance for your guidance! -Shilad -- Shilad W. Sen Associate Professor Mathematics,

Re: Spark on Mesos

2015-05-13 Thread Stephen Carman
Yup, exactly as Tim mentioned on it too. I went back and tried what you just suggested and that was also perfectly fine. Steve On May 13, 2015, at 1:58 PM, Tim Chen mailto:t...@mesosphere.io>> wrote: Hi Stephen, You probably didn't run the Spark driver/shell as root, as Mesos scheduler will

Re: Spark on Mesos

2015-05-13 Thread Tim Chen
Hi Stephen, You probably didn't run the Spark driver/shell as root, as Mesos scheduler will pick up your local user and tries to impersonate as the same user and chown the directory before executing any task. If you try to run Spark driver as root it should resolve the problem. No switch user can

Re: Worker Spark Port

2015-05-13 Thread James King
Indeed, many thanks. On Wednesday, 13 May 2015, Cody Koeninger wrote: > I believe most ports are configurable at this point, look at > > http://spark.apache.org/docs/latest/configuration.html > > search for ".port" > > On Wed, May 13, 2015 at 9:38 AM, James King > wrote: > >> I understated that

Re: Spark on Mesos

2015-05-13 Thread Stephen Carman
Sander, I eventually solved this problem via the --[no-]switch_user flag, which is set to true by default. I set this to false, which would have the user that owns the process run the job, otherwise it was my username (scarman) running the job, which would fail because obviously my username didn

data schema and serialization format suggestions

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, I want to get a general idea about best practices around data format serialization. Currently, I am using avro as the data serialization format but the emitted types aren't very scala friendly. So, I was wondering how others deal with this pro

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
Hi TD, Thanks a lot. rdd.countApprox(5000).initialValue worked! Now my streaming app is standing a much better chance to complete processing each batch within the batch interval. Du On Tuesday, May 12, 2015 10:31 PM, Tathagata Das wrote: From the code it seems that as soon as the

Re: Backward compatibility with org.apache.spark.sql.api.java.Row class

2015-05-13 Thread Michael Armbrust
Sorry for missing that in the upgrade guide. As part of unifying the Java and Scala interfaces we got rid of the java specific row. You are correct in assuming that you want to use row in org.apache.spark.sql from both Scala and Java now. On Wed, May 13, 2015 at 2:48 AM, Emerson Castañeda wrote

Re: value toDF is not a member of RDD object

2015-05-13 Thread Michael Armbrust
On Wed, May 13, 2015 at 3:00 AM, SLiZn Liu wrote: > Additionally, after I successfully packaged the code, and submitted via > spark-submit > webcat_2.11-1.0.jar, the following error was thrown at the line where > toDF() been called: > > Exception in thread "main" java.lang.NoSuchMethodError: >

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-13 Thread Michael Armbrust
I think this is a bug in our date handling that should be fixed in Spark 1.4. On Wed, May 13, 2015 at 8:23 AM, ayan guha wrote: > Your stack trace says it can't convert date to integer. You sure about > column positions? > On 13 May 2015 21:32, "Ishwardeep Singh" > wrote: > >> Hi , >> >> I am u

Re: Spark SQL: preferred syntax for column reference?

2015-05-13 Thread Michael Armbrust
I would not say that either method is preferred (neither is old/deprecated). One advantage to the second is that you are referencing a column from a specific dataframe, instead of just providing a string that will be resolved much like an identifier in a SQL query. This means given: df1 = [id: in

Word2Vec with billion-word corpora

2015-05-13 Thread Shilad Sen
Hi all, I'm experimenting with Spark's Word2Vec implementation for a relatively large (5B word, vocabulary size 4M) corpora. Has anybody had success running it at this scale? -Shilad -- Shilad W. Sen Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College s...

Spark SQL: preferred syntax for column reference?

2015-05-13 Thread Diana Carroll
I'm just getting started with Spark SQL and DataFrames in 1.3.0. I notice that the Spark API shows a different syntax for referencing columns in a dataframe than the Spark SQL Programming Guide. For instance, the API docs for the select method show this: df.select($"colA", $"colB") Whereas the

Re: force the kafka consumer process to different machines

2015-05-13 Thread Du Li
Alternatively, you may spread your kafka receivers to multiple machines as discussed in this blog post:How to spread receivers over worker hosts in Spark streaming |   | |   |   |   |   |   | | How to spread receivers over worker hosts in Spark streamingIn Spark Streaming, you can spawn multipl

Re: Kryo serialization of classes in additional jars

2015-05-13 Thread Akshat Aranya
I cherry-picked this commit into my local 1.2 branch. It fixed the problem with setting spark.serializer, but I get a similar problem with spark.closure.serializer org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.newKryo(Kry

Re: force the kafka consumer process to different machines

2015-05-13 Thread Dibyendu Bhattacharya
or you can use this Receiver as well : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Where you can specify how many Receivers you need for your topic and it will divides the partitions among the Receiver and return the joined stream for you . Say you specified 20 receivers , in

Re: force the kafka consumer process to different machines

2015-05-13 Thread Akhil Das
With this lowlevel Kafka API , you can actually specify how many receivers that you want to spawn and most of the time it spawns evenly, usually you can put a sleep just after creating the context for the executors to connect to the driver and then

Re: force the kafka consumer process to different machines

2015-05-13 Thread 李森栋
thank you very much 来自 魅族 MX4 Pro 原始邮件 发件人:Cody Koeninger 时间:周三 5月13日 23:52 收件人:hotdog 抄送:user@spark.apache.org 主题:Re: force the kafka consumer process to different machines >I assume you're using the receiver based approach? Have you tried the >createDirectStream api? > >h

Re: force the kafka consumer process to different machines

2015-05-13 Thread Cody Koeninger
I assume you're using the receiver based approach? Have you tried the createDirectStream api? https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html If you're sticking with the receiver based approach I think your only option would be to create more consumer streams and union them.

Re: Spark on Mesos

2015-05-13 Thread Sander van Dijk
Hey all, I seem to be experiencing the same thing as Stephen. I run Spark 1.2.1 with Mesos 0.22.1, with Spark coming from the spark-1.2.1-bin-hadoop2.4.tgz prebuilt package, and Mesos installed from the Mesosphere repositories. I have been running with Spark standalone successfully for a while and

force the kafka consumer process to different machines

2015-05-13 Thread hotdog
I 'm using streaming integrated with streaming-kafka. My kafka topic has 80 partitions, while my machines have 40 cores. I found that when the job is running, the kafka consumer processes are only deploy to 2 machines, the bandwidth of the 2 machines will be very very high. I wonder is there any

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-13 Thread ayan guha
Your stack trace says it can't convert date to integer. You sure about column positions? On 13 May 2015 21:32, "Ishwardeep Singh" wrote: > Hi , > > I am using Spark SQL 1.3.1. > > I have created a dataFrame using jdbc data source and am using > saveAsTable() > method but got the following 2 excep

Re: Worker Spark Port

2015-05-13 Thread Cody Koeninger
I believe most ports are configurable at this point, look at http://spark.apache.org/docs/latest/configuration.html search for ".port" On Wed, May 13, 2015 at 9:38 AM, James King wrote: > I understated that this port value is randomly selected. > > Is there a way to enforce which spark port a

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Cody Koeninger
You linked to a google mail tab, not a public archive, so I don't know exactly which conversation you're referring to. As far as I know, streaming only runs a single job at a time in the order they were defined, unless you turn on an experimental option for more parallelism (TD or someone more kno

Worker Spark Port

2015-05-13 Thread James King
I understated that this port value is randomly selected. Is there a way to enforce which spark port a Worker should use?

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing shows setting up your stream and calling .checkpoint(checkpointDir) inside the functionToCreateContext. It looks to me like you're setting up your stream and calling checkpoint outside, after getOrCreate. I'm not

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-13 Thread Ishwardeep Singh
Hi, I am using spark-shell and the steps using which I can reproduce the issue are as follows: scala> val dateDimDF= sqlContext.load("jdbc",Map("url"->"jdbc:teradata://192.168.145.58/DBS_PORT=1025,DATABASE=BENCHQADS,LOB_SUPPORT=OFF,USER= BENCHQADS,PASSWORD=abc","dbtable" -> "date_dim")) scala>

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Dibyendu Bhattacharya
Thanks Cody for your email. I think my concern was not to get the ordering of message within a partition , which as you said is possible if one knows how Spark works. The issue is how Spark schedule jobs on every batch which is not on the same order they generated. So if that is not guaranteed it

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
Many thanks Cody! On Wed, May 13, 2015 at 4:22 PM, Cody Koeninger wrote: > In my mind, this isn't really a producer vs consumer distinction, this is > a broker vs zookeeper distinction. > > The producer apis talk to brokers. The low level consumer api (what direct > stream uses) also talks to br

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread Cody Koeninger
In my mind, this isn't really a producer vs consumer distinction, this is a broker vs zookeeper distinction. The producer apis talk to brokers. The low level consumer api (what direct stream uses) also talks to brokers. The high level consumer api talks to zookeeper, at least initially. TLDR; do

Spark Sorted DataFrame & Repartitioning

2015-05-13 Thread Night Wolf
Hi guys, If I load a dataframe via a sql context that has a SORT BY in the query and I want to repartition the data frame will it keep the sort order in each partition? I want to repartition because I'm going to run a Map that generates lots of data internally so to avoid Out Of Memory errors I n

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Cody Koeninger
As far as I can tell, Dibyendu's "cons" boil down to: 1. Spark checkpoints can't be recovered if you upgrade code 2. Some Spark transformations involve a shuffle, which can repartition data It's not accurate to imply that either one of those things are inherently "cons" of the direct stream api.

NullPointerException while creating DataFrame from an S3 Avro Object

2015-05-13 Thread Mohammad Tariq
Hi List, I have just started using Spark and trying to create DataFrame from an Avro file stored in Amazon S3. I am using *Spark-Avro* library for this. The code which I'm using is shown below. Nothing fancy, just the basic prototype as shown on the Spark

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
Looking at Consumer Configs in http://kafka.apache.org/documentation.html#consumerconfigs The properties *metadata.broker.list* or *bootstrap.servers *are not mentioned. Should I need these for consume side? On Wed, May 13, 2015 at 3:52 PM, James King wrote: > Many thanks Cody and contributor

Re: value toDF is not a member of RDD object

2015-05-13 Thread Todd Nist
I believe what Dean Wampler was suggesting is to use the sqlContext not the sparkContext (sc), which is where the createDataFrame function resides: https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.SQLContext HTH. -Todd On Wed, May 13, 2015 at 6:00 AM, SLiZn Liu wro

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
Many thanks Cody and contributors for the help. On Wed, May 13, 2015 at 3:44 PM, Cody Koeninger wrote: > Either one will work, there is no semantic difference. > > The reason I designed the direct api to accept both of those keys is > because they were used to define lists of brokers in pre-exi

how to read lz4 compressed data using fileStream of spark streaming?

2015-05-13 Thread hotdog
in spark streaming, I want to use fileStream to monitor a directory. But the files in that directory are compressed using lz4. So the new lz4 files are not detected by the following code. How to detect these new files? val list_join_action_stream = ssc.fileStream[LongWritable, Text, TextInputF

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread Cody Koeninger
Either one will work, there is no semantic difference. The reason I designed the direct api to accept both of those keys is because they were used to define lists of brokers in pre-existing Kafka project apis. I don't know why the Kafka project chose to use 2 different configuration keys. On Wed

Re: how to set random seed

2015-05-13 Thread Charles Hayden
?Can you elaborate? Broadcast will distribute the seed, which is only one number. But what construct do I use to "plant" the seed (call random.seed()) once on each worker? From: ayan guha Sent: Tuesday, May 12, 2015 11:17 PM To: Charles Hayden Cc: user Subject:

Re: Kafka + Direct + Zookeeper

2015-05-13 Thread Ted Yu
Please see Approach 2 in: http://spark.apache.org/docs/latest/streaming-kafka-integration.html > On May 13, 2015, at 6:10 AM, James King wrote: > > I'm trying Kafka Direct approach (for consume) but when I use only this > config: > > kafkaParams.put("group.id", groupdid); > kafkaParams.put

RE: Kafka + Direct + Zookeeper

2015-05-13 Thread Shao, Saisai
Hi James, For Kafka direct approach, you don’t need to specify “group.id” and “zookeeper.connect”, instead you need to specify the broker list through “metadata.broker.list” or “bootstrap.servers”. Internally direct approach uses low level api of Kafka, so it does not involve ZK. Thanks Jerry

Re: Spark and Flink

2015-05-13 Thread Ted Yu
You can run the following command: mvn dependency:tree And see what jetty versions are brought in. Cheers > On May 13, 2015, at 6:07 AM, Pa Rö wrote: > > hi, > > i use spark and flink in the same maven project, > > now i get a exception on working with spark, flink work well > > the prob

JavaPairRDD

2015-05-13 Thread Yasemin Kaya
Hi, I want to get *JavaPairRDD *from the tuple part of *JavaPairRDD> .* As an example: ( http://www.koctas.com.tr/reyon/el-aletleri/7,(0,1,0,0,0,0,0,0,46551)) in my *JavaPairRDD> *and I want to get *( (46551), (0,1,0,0,0,0,0,0) )* I try to split tuple._2() and create new JavaPairRDD but I can'

Kafka + Direct + Zookeeper

2015-05-13 Thread James King
I'm trying Kafka Direct approach (for consume) but when I use only this config: kafkaParams.put("group.id", groupdid); kafkaParams.put("zookeeper.connect", zookeeperHostAndPort + "/cb_kafka"); I get this Exception in thread "main" org.apache.spark.SparkException: Must specify metadata.broker.lis

Re: Building Spark

2015-05-13 Thread Emre Sevinc
My 2 cents: If you have Java 8, you don't need any extra settings for Maven. -- Emre Sevinç On Wed, May 13, 2015 at 3:02 PM, Stephen Boesch wrote: > Hi Akhil, Building with sbt tends to need around 3.5GB whereas maven > requirements are much lower , around 1.7GB. So try using maven . > > For

Spark and Flink

2015-05-13 Thread Pa Rö
hi, i use spark and flink in the same maven project, now i get a exception on working with spark, flink work well the problem are transitiv dependencies. maybe somebody know a solution, or versions, which work together. best regards paul ps: a cloudera maven repo flink would be desirable my

applications are still in progress?

2015-05-13 Thread Yifan LI
Hi, I have some applications finished(but actually failed before), that in WebUI show Application myApp is still in progress. and, in the eventlog folder, there are several log files like this: app-20150512***.inprogress So, I am wondering what the “inprogress” means… Thanks! :) Best, Yifan

Re: Building Spark

2015-05-13 Thread Stephen Boesch
Hi Akhil, Building with sbt tends to need around 3.5GB whereas maven requirements are much lower , around 1.7GB. So try using maven . For reference I have the following settings and both do compile. sbt would not work with lower values. $echo $SBT_OPTS -Xmx3012m -XX:MaxPermSize=512m -XX:Reser

Building Spark

2015-05-13 Thread Heisenberg Bb
I tried to build Spark in my local machine Ubuntu 14.04 ( 4 GB Ram), my system is getting hanged (freezed). When I monitered system processes, the build process is found to consume 85% of my memory. Why does it need lot of resources. Is there any efficient method to build Spark. Thanks Akhil

Removing FINISHED applications and shuffle data

2015-05-13 Thread sayantini
Hi, Please help me with below two issues: *Environment:* I am running my spark cluster in stand alone mode. I am initializing the spark context from inside my tomcat server. I am setting below properties in environment.sh in $SPARK_HOME/conf directory SPARK_MASTER_OPTS=-Dspark.dep

[Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-13 Thread Ishwardeep Singh
Hi , I am using Spark SQL 1.3.1. I have created a dataFrame using jdbc data source and am using saveAsTable() method but got the following 2 exceptions: java.lang.RuntimeException: Unsupported datatype DecimalType() at scala.sys.package$.error(package.scala:27) at org.apache.spar

com.esotericsoftware.kryo.KryoException: java.io.IOException: Stream is corrupted

2015-05-13 Thread Yifan LI
Hi, I was running our graphx application(worked finely on Spark 1.2.0) but failed on Spark 1.3.1 with below exception. Anyone has idea on this issue? I guess it was caused by using LZ4 codec? Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 54

Fwd: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
Are you sure that you are submitting it correctly? Can you post the entire command you are using to run the .jar file via spark-submit? Ok, here it is: /opt/spark-1.3.1-bin-hadoop2.6/bin/spark-submit target/scala-2.11/webcat_2.11-1.0.jar However, on the server somehow I have to specify main clas

Re: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
No, creating DF using createDataFrame won’t work: val peopleDF = sqlContext.createDataFrame(people) the code can be compiled but raised the same error as toDF at the line above. On Wed, May 13, 2015 at 6:22 PM Sebastian Alfers [sebastian.alf...@googlemail.com](mailto:sebastian.alf...@googlemail.

Re: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
Additionally, after I successfully packaged the code, and submitted via spark-submit webcat_2.11-1.0.jar, the following error was thrown at the line where toDF() been called: Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader

Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
From: http://spark.apache.org/docs/latest/streaming-kafka-integration.html I'm trying to use the direct approach to read messages form Kafka. Kafka is running as a cluster and configured with Zookeeper. On the above page it mentions: "In the Kafka parameters, you must specify either *metadata.

Backward compatibility with org.apache.spark.sql.api.java.Row class

2015-05-13 Thread Emerson Castañeda
Hello everyone I'm adopting the latest version of Apache Spark on my project, moving from *1.2.x* to *1.3.x*, and the only significative incompatibility for now is related to the *Row *class. Any idea about what did happen to* org.apache.spark.sql.api.java.Row* class in Apache Spark 1.3 ? Migra

Re: how to monitor multi directories in spark streaming task

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, You could retroactively union an existing DStream with one from a newly created file. Then when another file is "detected", you would need to re-union the stream an create another DStream. It seems like the implementation of FileInputDStream only

Re: How to get applicationId for yarn mode(both yarn-client and yarn-cluster mode)

2015-05-13 Thread thanhtien522
Earthson wrote > Finally, I've found two ways: > > 1. search the output with something like "Submitted application > application_1416319392519_0115" > 2. use specific AppName. We could query the ApplicationID(yarn) Hi Eathson, Can you explain more about case 2? How can we query the ApplicationID

kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread ankurcha
Hi, I have a simple application which fails with the following exception only when the application is restarted (i.e. the checkpointDir has entires from a previous execution): Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.dstream.ShuffledDStream@2264e43c h

Re: how to monitor multi directories in spark streaming task

2015-05-13 Thread lisendong
but in fact the directories are not ready at the beginning to my task . for example: /user/root/2015/05/11/data.txt /user/root/2015/05/12/data.txt /user/root/2015/05/13/data.txt like this. and one new directory one day. how to create the new DStream for tomorrow’s new directory(/user/root/20

kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I have a simple application which fails with the following exception only when the application is restarted (i.e. the checkpointDir has entires from a previous execution): Exception in thread "main" org.apache.spark.SparkException: org.apache.spa

Re: how to monitor multi directories in spark streaming task

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I would suggest creating one DStream per directory and then using StreamingContext#union(...) to get a union DStream. - -- Ankur On 13/05/2015 00:53, hotdog wrote: > I want to use use fileStream in spark streaming to monitor multi > hdfs directories,

Increase maximum amount of columns for covariance matrix for principal components

2015-05-13 Thread Sebastian Alfers
Hello, in order to compute a huge dataset, the amount of columns to calculate the covariance matrix is limited: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L129 What is the reason behind this limitation and can it be

kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I have a simple application which fails with the following exception only when the application is restarted (i.e. the checkpointDir has entires from a previous execution): Exception in thread "main" org.apache.spark.SparkException: org.apache.spa

how to monitor multi directories in spark streaming task

2015-05-13 Thread hotdog
I want to use use fileStream in spark streaming to monitor multi hdfs directories, such as: val list_join_action_stream = ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/root/*/*", check_valid_file(_), false).map(_._2.toString).print Buy the way, i could not under the meaning of the t

回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-13 Thread luohui20001
Hi Hao: I tried broadcastjoin with following steps, and found that my query is still running slow ,not very sure if I'm doing right with broadcastjoin:1.add "spark.sql.autoBroadcastJoinThreshold 104857600(100MB)" in conf/spark-default.conf. 100MB is larger than any of my 2 tables.2.start