Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das
1. Can you try count()? Take often does not force the entire computation. 2. Can you give the full log. From the log it seems that the blocks are added to two nodes but the tasks seem to be launched to different nodes. I dont see any message removing the blocks. So need the whole log to debug this.

Re: Streaming scheduling delay

2015-02-12 Thread Cody Koeninger
outdata.foreachRDD( rdd => rdd.foreachPartition(rec => { val writer = new KafkaOutputService(otherConf("kafkaProducerTopic").toString, propsMap) writer.output(rec) }) ) So this is creating a new kafka producer for every n

Re: Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
Michael, I haven't been paying close attention to the JIRA tickets for PrunedFilteredScan but I noticed some weird behavior around the filters being applied when OR expressions were used in the WHERE clause. From what I was seeing, it looks like it could be possible that the "start" and "end" rang

Re: Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
Ok. I just verified that this is the case with a little test: WHERE (a = 'v1' and b = 'v2')PrunedFilteredScan passes down 2 filters WHERE(a = 'v1' and b = 'v2') or (a = 'v3') PrunedFilteredScan passes down 0 filters On Fri, Feb 13, 2015 at 12:28 AM, Corey Nolet wrote: > Michael, > > I haven

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Vladimir Protsenko
Thank you. That worked. 2015-02-12 20:03 GMT+04:00 Imran Rashid : > You need to import the implicit conversions to PairRDDFunctions with > > import org.apache.spark.SparkContext._ > > (note that this requirement will go away in 1.3: > https://issues.apache.org/jira/browse/SPARK-4397) > > On Thu,

Re: HiveContext in SparkSQL - concurrency issues

2015-02-12 Thread Felix C
Your earlier call stack clearly states that it fails because the Derby metastore has already been started by another instance, so I think that is explained by your attempt to run this concurrently. Are you running Spark standalone? Do you have a cluster? You should be able to run spark in yarn-

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Hi Tim, I think this code will still introduce shuffle even when you call repartition on each input stream. Actually this style of implementation will generate more jobs (job per each input stream) than union into one stream as called DStream.union(), and union normally has no special overhead as

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Vladimir Protsenko
Thank's for reply. I solved porblem with importing org.apache.spark.SparkContext._ by Imran Rashid suggestion. In the sake of interest, does JavaPairRDD intended for use from java? What is the purpose of this class? Does my rdd implicitly converted to it in some circumstances? 2015-02-12 19:42 GM

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Su She
Thanks Kevin for the link, I have had issues trying to install zeppelin as I believe it is not yet supported for CDH 5.3, and Spark 1.2. Please correct me if I am mistaken. On Thu, Feb 12, 2015 at 7:33 PM, Kevin (Sangwoo) Kim wrote: > Apache Zeppelin also has a scheduler and then you can reload

Re: Using Spark SQL for temporal data

2015-02-12 Thread Michael Armbrust
> > I haven't been paying close attention to the JIRA tickets for > PrunedFilteredScan but I noticed some weird behavior around the filters > being applied when OR expressions were used in the WHERE clause. From what > I was seeing, it looks like it could be possible that the "start" and "end" > ra

Re: 8080 port password protection

2015-02-12 Thread Akhil Das
Just to add to what Arush said, you can go through these links: http://stackoverflow.com/questions/1162375/apache-port-proxy http://serverfault.com/questions/153229/password-protect-and-serve-apache-site-by-port-number Thanks Best Regards On Thu, Feb 12, 2015 at 10:43 PM, Arush Kharbanda < ar..

Re: Master dies after program finishes normally

2015-02-12 Thread Akhil Das
Increasing your driver memory might help. Thanks Best Regards On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar wrote: > Hi, > I have a Hidden Markov Model running with 200MB data. > Once the program finishes (i.e. all stages/jobs are done) the program > hangs for 20 minutes or so before killing ma

Why are there different "parts" in my CSV?

2015-02-12 Thread Su She
Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be

An interesting and serious problem I encountered

2015-02-12 Thread Landmark
Hi foks, My Spark cluster has 8 machines, each of which has 377GB physical memory, and thus the total maximum memory can be used for Spark is more than 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, where the key is an integer and the value is an integer array with 43

How to sum up the values in the columns of a dataset in Scala?

2015-02-12 Thread Carter
I am new to Scala. I have a dataset with many columns, each column has a column name. Given several column names (these column names are not fixed, they are generated dynamically), I need to sum up the values of these columns. Is there an efficient way of doing this? I worked out a way by using f

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
I replaced the writeToKafka statements with a rdd.count() and sure enough, I have a stable app with total delay well within the batch window (20 seconds). Here's the total delay lines from the driver log: 15/02/13 06:14:26 INFO JobScheduler: Total delay: 6.521 s for time 142380806 ms (execution

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Hi Tim, I think maybe you can try this way: create Receiver per executor and specify thread for each topic large than 1, and the total number of consumer thread will be: total consumer = (receiver number) * (thread number), and make sure this total consumer is less than or equal to Kafka partitio

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
Hi Saisai, If I understand correctly, you are suggesting that control parallelism by having number of consumers/executors at least 1:1 for number of kafka partitions. For example, if I have 50 partitions for a kafka topic then either have: - 25 or more executors, 25 receivers, each receiver set to

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Yes, you can try it. For example, if you have a cluster of 10 executors, 60 Kafka partitions, you can try to choose 10 receivers * 2 consumer threads, so each thread will consume 3 partitions ideally, if you increase the threads to 6, each threads will consume 1 partitions ideally. What I think imp

Re: Why are there different "parts" in my CSV?

2015-02-12 Thread Akhil Das
For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","cs

<    1   2