Re: Using Spark like a search engine

2015-05-25 Thread Сергей Мелехин
Hi, ankur! Thanks for your reply! CVs are a just bunch of IDs, each ID represents some object of some class (eg. class=JOB, object=SW Developer). We have already processed texts and extracted all facts. So we don't need to do any text processing in Spark, just to run scoring function on many many C

Re: Intellij IDEA import spark souce code error

2015-05-25 Thread Yi Zhang
I am not sure what happen. According to your snapshot, it just shows warning message instead of error. But I suggest you can try to use maven with: mvn idea:idea. On Monday, May 25, 2015 2:48 PM, huangzheng <1106944...@qq.com> wrote: Hi all   I  want to learn spark source code   rec

Websphere MQ as a data source for Apache Spark Streaming

2015-05-25 Thread umesh9794
I was digging into the possibilities for Websphere MQ as a data source for spark-streaming becuase it is needed in one of our use case. I got to know that MQTT is the protocol that supports the communication from MQ data structures but since I am a newbie to spark streaming I

Re: Websphere MQ as a data source for Apache Spark Streaming

2015-05-25 Thread Arush Kharbanda
Hi Umesh, You can connect to Spark Streaming with MQTT refer to the example. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala Thanks Arush On Mon, May 25, 2015 at 3:43 PM, umesh9794 wrote: > I was digging into the

spark sql through java code facing issue

2015-05-25 Thread vinayak
Hi All, *I am new to spark and trying to execute spark sql through java code as below* package com.ce.sql; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.api.java.J

回复:Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread luohui20001
thanks, madhu and Akhil I modified my code like below,however I think it is not so distributed. Have you guys better idea to run this app more efficiantly and distributed? So I add some comments with my understanding: import org.apache.spark._ import www.celloud.com.model._ object GeneCompare3 {

Tasks randomly stall when running on mesos

2015-05-25 Thread Reinis Vicups
Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have suc

DataFrame. Conditional aggregation

2015-05-25 Thread Masf
Hi. In a dataframe, How can I execution a conditional sentence in a aggregation. For example, Can I translate this SQL statement to DataFrame?: SELECT name, SUM(IF table.col2 > 100 THEN 1 ELSE table.col1) FROM table GROUP BY name Thanks -- Regards. Miguel

Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread Akhil Das
Can you can tell us what exactly you are trying to achieve? Thanks Best Regards On Mon, May 25, 2015 at 5:00 PM, wrote: > thanks, madhu and Akhil > > I modified my code like below,however I think it is not so distributed. > Have you guys better idea to run this app more efficiantly and distribu

Re: How to use zookeeper in Spark Streaming

2015-05-25 Thread Akhil Das
If you want to notify after every batch is completed, then you can simply implement the StreamingListener interface, which has methods like onBatchCompleted, onBatchStarted etc in which

Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Spico Florin
Hello! I would like to use the logging mechanism provided by the log4j, but I'm getting the Exception in thread "main" org.apache.spark.SparkException: Task not serializable -> Caused by: java.io.NotSerializableException: org.apache.log4j.Logger The code (and the problem) that I'm using resemble

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Iulian Dragoș
On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups wrote: > Hello, > > I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and > running on a cluster with 3 nodes on 64bit ubuntu. > > My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 > dependency), hadoop 2.5.1-

Re: Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Akhil Das
Try this way: object Holder extends Serializable { @transient lazy val log = Logger.getLogger(getClass.getName)} val someRdd = spark.parallelize(List(1, 2, 3)) someRdd.map { element => Holder.*log.info (s"$element will be processed")* element + 1

Re: Spark updateStateByKey fails with class leak when using case classes - resend

2015-05-25 Thread rsearle
Further experimentation indicates these problems only occur when master is local[*]. There are no issues if a standalone cluster is used. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-updateStateByKey-fails-with-class-leak-when-using-case-classes-r

Re: IPv6 support

2015-05-25 Thread Akhil Das
Hi Kevin, Did you try adding a host name for the ipv6? I have a few ipv6 boxes, spark failed for me when i use just the ipv6 addresses, but it works fine when i use the host names. Here's an entry in my /etc/hosts: 2607:5300:0100:0200::::0a4d hacked.work My spark-env.sh file: expo

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Reinis Vicups
Hello, I assume I am running spark in a fine-grained mode since I haven't changed the default here. One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I could use for my project config? (I know that I have to download source and make-distribution for executor as well) th

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Dean Wampler
Here is a link for builds of 1.4 RC2: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ For a mvn repo, I believe the RC2 artifacts are here: https://repository.apache.org/content/repositories/orgapachespark-1104/ A few experiments you might try: 1. Does spark-shell work?

scala.ScalaReflectionException when creating SchemaRDD

2015-05-25 Thread vkcelik
Hello I am trying to create a SchemaRDD from a RDD of case classes. Depending on an argument to the program these case classes should be different. But this throws an exception. I am using Spark version 1.1.0 and Scala version 2.10.4 The exception can be reproduced by: val table = "type 1" impor

SparkSQL's performance : contacting namenode and datanode to uncessarily check all partitions for a query of specific partitions

2015-05-25 Thread ogoh
Hello, I am using SparkSQL 1.3.0 and Hive 0.13.1 on AWS & YARN. My Hive table as an external table is partitioned with date and hour. I expected that a query with certain partitions will read only the data files of the partitions. I turned on TRACE level logging for ThriftServer since the query r

scala.ScalaReflectionException when creating SchemaRDD

2015-05-25 Thread vkcelik
Hello I am trying to create a SchemaRDD from a RDD of case classes. Depending on an argument to the program, the program reads data of specified type and maps it to the correct case class. But this throws an exception. I am using Spark version 1.1.0 and Scala version 2.10.4 The exception can be

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Reinis Vicups
Great hints, you guys! Yes spark-shell worked fine with mesos as master. I haven't tried to execute multiple rdd actions in a row though (I did couple of successful counts on hbase tables i am working with in several experiments but nothing that would compare to the stuff my spark jobs are d

Implementing custom RDD in Java

2015-05-25 Thread Swaranga Sarma
Hello, I have a custom data source and I want to load the data into Spark to perform some computations. For this I see that I might need to implement a new RDD for my data source. I am a complete Scala noob and I am hoping that I can implement the RDD in Java only. I looked around the internet an

Implementing custom RDD in Java

2015-05-25 Thread swaranga
Hello, I have a custom data source and I want to load the data into Spark to perform some computations. For this I see that I might need to implement a new RDD for my data source. I am a complete Scala noob and I am hoping that I can implement the RDD in Java only. I looked around the internet an

Re: DataFrame. Conditional aggregation

2015-05-25 Thread ayan guha
Case when col2>100 then 1 else col2 end On 26 May 2015 00:25, "Masf" wrote: > Hi. > > In a dataframe, How can I execution a conditional sentence in a > aggregation. For example, Can I translate this SQL statement to DataFrame?: > > SELECT name, SUM(IF table.col2 > 100 THEN 1 ELSE table.col1) > FR

Re: Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Wesley Miao
The reason it didn't work for you is that the function you registered with someRdd.map will be running on the worker/executor side, not in your driver's program. Then you need to be careful to not accidentally close over some objects instantiated from your driver's program, like the log object in y

Re: Using Spark like a search engine

2015-05-25 Thread Alex Chavez
Сергей, A simple implementation would be to create a DataFrame of CVs by issuing a Spark SQL query against your Postgres database, persist it in memory, and then to map F over it at query time and return the top

回复:Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread luohui20001
I am right trying to run some shell script in my spark app, hoping it runs more concurrently in my spark cluster.However I am not sure whether my codes will run concurrently in my executors.Dive into my code, you can see that I am trying to 1.splite both db and sample into 21 small files. That

Re: Implementing custom RDD in Java

2015-05-25 Thread Swaranga Sarma
My data is in S3 and is indexed in Dynamo. For example, If I want to load data given a time range, I will first need to query Dynamo for the S3 file keys for the corresponding time range and then load them in Spark. The files may not always be in the same S3 path prefix, hence sc.testFile("s3://dir

Re: Re: is there any easier way to define a custom RDD in Java

2015-05-25 Thread swaranga
Has this changed now? Can a new RDD be implemented in Java? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/is-there-any-easier-way-to-define-a-custom-RDD-in-Java-tp6917p23027.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Re: is there any easier way to define a custom RDD in Java

2015-05-25 Thread Ted Yu
Please take a look at: core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java Cheers On Mon, May 25, 2015 at 8:39 PM, swaranga wrote: > Has this changed now? Can a new RDD be implemented in Java? > > > > -- > View this message in c

Re: Using Spark like a search engine

2015-05-25 Thread Сергей Мелехин
Thanks I'll give it a try! С Уважением, Сергей Мелехин. 2015-05-26 12:56 GMT+10:00 Alex Chavez : > Сергей, > A simple implementation would be to create a DataFrame of CVs by issuing a > Spark SQL query against your Postgres database, persist it in memory, and > then to map F over it at query tim

Re: Spark SQL High GC time

2015-05-25 Thread Nick Travers
Hi Yuming - I was running into the same issue with larger worker nodes a few weeks ago. The way I managed to get around the high GC time, as per the suggestion of some others, was to break each worker node up into individual workers of around 10G in size. Divide your cores accordingly. The other

Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread Akhil Das
If you open up the driver UI (running on 4040), you can see multiple tasks per stage which will be happening concurrently. If it is a single task, and you want to increase the parallelism, then you can simply do a re-partition. Thanks Best Regards On Tue, May 26, 2015 at 8:27 AM, wrote: > I am

Remove COMPLETED applications and shuffle data

2015-05-25 Thread sayantini
Hi All, Please help me with the below 2 issues: *Environment:* I am running my spark cluster in stand alone mode. I am initializing the spark context from inside my tomcat server. I am setting below properties in environment.sh in $SPARK_HOME/conf directory SPARK_MASTER_OPTS=-Dspa