NoSuchMethodError : org.apache.spark.streaming.scheduler.StreamingListenerBus.start()V

2015-08-04 Thread Deepesh Maheshwari
Hi, I am trying to read data from kafka and process it using spark. i have attached my source code , error log. For integrating kafka, i have added dependency in pom.xml org.apache.spark spark-streaming_2.10 1.3.0 org.apache.spark

Unable to load native-hadoop library for your platform

2015-08-04 Thread Deepesh Maheshwari
Hi, When i run the spark locally on windows it gives below hadoop library error. I am using below spark version. org.apache.spark spark-core_2.10 1.4.1 2015-08-04 12:22:23,463 WARN (org.apache.hadoop.util.NativeCodeLoader:62) - Unable to load nativ

Re: large scheduler delay in pyspark

2015-08-04 Thread Davies Liu
On Mon, Aug 3, 2015 at 9:00 AM, gen tang wrote: > Hi, > > Recently, I met some problems about scheduler delay in pyspark. I worked > several days on this problem, but not success. Therefore, I come to here to > ask for help. > > I have a key_value pair rdd like rdd[(key, list[dict])] and I tried t

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Sean Owen
You can ignore it entirely. It just means you haven't installed and configured native libraries for things like accelerated compression, but it has no negative impact otherwise. On Tue, Aug 4, 2015 at 8:11 AM, Deepesh Maheshwari wrote: > Hi, > > When i run the spark locally on windows it gives be

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Deepesh Maheshwari
Can you elaborate about the things this native library covering. One you mentioned accelerated compression. It would be very helpful if you can give any useful to link to read more about it. On Tue, Aug 4, 2015 at 12:56 PM, Sean Owen wrote: > You can ignore it entirely. It just means you haven'

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Sean Owen
It won't affect you if you're not actually running Hadoop. But it's mainly things like Snappy/LZO compression which are implemented as native libraries under the hood. Spark doesn't necessarily use these anyway; it's from the Hadoop libs. On Tue, Aug 4, 2015 at 8:30 AM, Deepesh Maheshwari wrote:

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Deepesh Maheshwari
Hi Sean, Thanks for the information and clarification. On Tue, Aug 4, 2015 at 1:04 PM, Sean Owen wrote: > It won't affect you if you're not actually running Hadoop. But it's > mainly things like Snappy/LZO compression which are implemented as > native libraries under the hood. Spark doesn't ne

Twitter live Streaming

2015-08-04 Thread Sadaf
Hi Is there any way to get all old tweets since when the account was created using spark streaming and twitters api? Currently my connector is showing those tweets that get posted after the program runs. I've done this task using spark streaming and a custom receiver using "twitter user api". Than

Re: Twitter live Streaming

2015-08-04 Thread Enno Shioji
If you want to do it through streaming API you have to pay Gnip; it's not free. You can go through non-streaming Twitter API and convert it to stream yourself though. > On 4 Aug 2015, at 09:29, Sadaf wrote: > > Hi > Is there any way to get all old tweets since when the account was created >

Re: How do I Process Streams that span multiple lines?

2015-08-04 Thread Akhil Das
If you are using Kafka, then you can basically push an entire file as a message to Kafka. In that case in your DStream, you will receive the single message which is the contents of the file and it can of course span multiple lines. Thanks Best Regards On Mon, Aug 3, 2015 at 8:27 PM, Spark Enthusi

Re: Twitter Connector-Spark Streaming

2015-08-04 Thread Akhil Das
You will have to write your own consumer for pulling your custom feeds, and then you can do a union (customfeedDstream.union(twitterStream)) with the twitter stream api. Thanks Best Regards On Tue, Aug 4, 2015 at 2:28 PM, Sadaf Khan wrote: > Thanks alot :) > > One more thing that i want to ask

Re: Writing to HDFS

2015-08-04 Thread Akhil Das
Just to add rdd.take(1) won't trigger the entire computation, it will just pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files to trigger the complete pipeline. How many partitions do you see in the last stage? Thanks Best Regards On Tue, Aug 4, 2015 at 7:10 AM, ayan guha

Transform MongoDB Aggregation into Spark Job

2015-08-04 Thread Deepesh Maheshwari
Hi, I am new to Apache Spark and exploring spark+kafka intergration to process data using spark which i did earlier in MongoDB Aggregation. I am not able to figure out to handle my use case. Mongo Document : { "_id" : ObjectId("55bfb3285e90ecbfe37b25c3"), "url" : " http://www.z.com/ne

Re?? About memory leak in spark 1.4.1

2015-08-04 Thread Sea
How much machines are there in your standalone cluster? I am not using tachyon. GC can not help me... Can anyone help ? my configuration: spark.deploy.spreadOut false spark.eventLog.enabled true spark.executor.cores 24 spark.ui.retainedJobs 10 spark.ui.retainedStages 10 spark.history.retai

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-08-04 Thread Jeff Zhang
Please check the node manager logs to see why the container is killed. On Mon, Aug 3, 2015 at 11:59 PM, Umesh Kacha wrote: > Hi all any help will be much appreciated my spark job runs fine but in the > middle it starts loosing executors because of netafetchfailed exception > saying shuffle not f

Re: Extremely poor predictive performance with RF in mllib

2015-08-04 Thread Yanbo Liang
It looks like the predicted result just opposite with expectation, so could you check whether the label is right? Or could you share several data which can help to reproduce this output? 2015-08-03 19:36 GMT+08:00 Barak Gitsis : > hi, > I've run into some poor RF behavior, although not as pronoun

Total delay per batch in a CSV file

2015-08-04 Thread allonsy
Hi everyone, I'm working with Spark Streaming, and I need to perform some offline performance measures. What I'd like to have is a CSV file that reports something like this: *Batch number/timestampInput SizeTotal Delay* which is in fact similar to what the UI outputs. I tried t

Fwd: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Priya Ch
Yes...union would be one solution. I am not doing any aggregation hence reduceByKey would not be useful. If I use groupByKey, messages with same key would be obtained in a partition. But groupByKey is very expensive operation as it involves shuffle operation. My ultimate goal is to write the messag

Re: TFIDF Transformation

2015-08-04 Thread Yanbo Liang
It can not translate the number back to the word except you store the in map by yourself. 2015-07-31 1:45 GMT+08:00 hans ziqiu li : > Hello spark users! > > I am having some troubles with the TFIDF in MLlib and was wondering if > anyone can point me to the right direction. > > The data ingestion

Re: About memory leak in spark 1.4.1

2015-08-04 Thread Igor Berman
sorry, can't disclose info about my prod cluster nothing jumps into my mind regarding your config we don't use lz4 compression, don't know what is spark.deploy.spreadOut(there is no documentation regarding this) If you are sure that you don't have memory leak in your business logic I would try to

Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Gerard Maas
(removing dev from the to: as not relevant) it would be good to see some sample data and the cassandra schema to have a more concrete idea of the problem space. Some thoughts: reduceByKey could still be used to 'pick' one element. example of arbitrarily choosing the first one: reduceByKey{case (e

Re: TFIDF Transformation

2015-08-04 Thread clark djilo kuissu
Hi,  I had the same problem and I didn't found the solution. I used Word2Vec instead. I am interessed by the solution of this problem of how to go back from the TF-IDF hashing to word. Regards, Clark Le Mardi 4 août 2015 13h03, Yanbo Liang a écrit : It can not translate the number

Re: Setting a stage timeout

2015-08-04 Thread William Kinney
Yes I upgraded but I would still like to set an overall stage timeout. Does that exist? On Fri, Jul 31, 2015 at 1:13 PM, Ted Yu wrote: > The referenced bug has been fixed in 1.4.0, are you able to upgrade ? > > Cheers > > On Fri, Jul 31, 2015 at 10:01 AM, William Kinney > wrote: > >> Hi, >> >>

Re: Safe to write to parquet at the same time?

2015-08-04 Thread Cheng Lian
It should be safe for Spark 1.4.1 and later versions. Now Spark SQL adds a job-wise UUID to output file names to distinguish files written by different write jobs. So those two write jobs you gave should play well with each other. And the job committed later will generate a summary file for al

Re: Parquet SaveMode.Append Trouble.

2015-08-04 Thread Cheng Lian
You need to import org.apache.spark.sql.SaveMode Cheng On 7/31/15 6:26 AM, satyajit vegesna wrote: Hi, I am new to using Spark and Parquet files, Below is what i am trying to do, on Spark-shell, val df = sqlContext.parquetFile("/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet") Hav

Delete NA in a dataframe

2015-08-04 Thread clark djilo kuissu
Hello, I try to magage NA in this dataset. I import my dataset with the com.databricks.spark.csv package When I do this: allyears2k.na.drop() I have no result. Can you help me please ? Regards, ---   The dataset - dataset:  https://s3.amazonaws.com/h2o-

Re: Schedule lunchtime today for a free webinar "IoT data ingestion in Spark Streaming using Kaa" 11 a.m. PDT (2 p.m. EDT)

2015-08-04 Thread orozvadovskyy
Hi there! If you missed our webinar on "IoT data ingestion in Spark with KaaIoT", see the video and slides here: http://goo.gl/VMyQ1M We recorded our webinar on “IoT data ingestion in Spark Streaming using Kaa” for those who couldn’t see it live or who would like to refresh what they have le

AW: Twitter live Streaming

2015-08-04 Thread Filli Alem
Hi Sadaf, Im currently struggling with Twitter Streaming as well. I cant get it working using the simple setup bellow. I use spark 1.2 and I replaced twitter4j v3 with v4. Am I doing something wrong? How are you doing this? twitter4j.conf.Configuration conf = new twitter4j.conf.ConfigurationBui

Re: Delete NA in a dataframe

2015-08-04 Thread Peter Rudenko
Hi Clark, the problem is that in this dataset null values represented as NA marker. Spark-csv doesn't have configurable null values marker (i've made a PR with it some time ago: https://github.com/databricks/spark-csv/pull/76). So one option for you is to do post filtering, something like thi

Re: Delete NA in a dataframe

2015-08-04 Thread clark djilo kuissu
Thank you Peter I try this Le Mardi 4 août 2015 15h02, Peter Rudenko a écrit : Hi Clark, the problem is that in this dataset null values represented as NA marker. Spark-csv doesn't have configurable null values marker (i've made a PR with it some time ago: https://github.com/da

Re: About memory leak in spark 1.4.1

2015-08-04 Thread Barak Gitsis
maybe try reducing spark.executor.cores perhaps your tasks have large offheap overhead and better have less tasks running in parallel is it streaming job? On Tue, Aug 4, 2015 at 2:14 PM Igor Berman wrote: > sorry, can't disclose info about my prod cluster > > nothing jumps into my mind regardin

Re: spark streaming max receiver rate doubts

2015-08-04 Thread Cody Koeninger
Those jobs will still be created for each valid time, they just may not have many messages in them On Mon, Aug 3, 2015 at 11:11 PM, Shushant Arora wrote: > 1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't > set spark.streaming.kafka.maxRatePerPartition - so default behav

giving offset in spark sql

2015-08-04 Thread Hafiz Mujadid
Hi all! I want to skip first n rows from a dataframe? This is done in normal sql using offset keyword. How can we achieve in spark sql? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/giving-offset-in-spark-sql-tp24130.html Sent from the Apache Spar

Re: About memory leak in spark 1.4.1

2015-08-04 Thread Ted Yu
w.r.t. spark.deploy.spreadOut , here is the scaladoc: // As a temporary workaround before better ways of configuring memory, we allow users to set // a flag that will perform round-robin scheduling across the nodes (spreading out each app // among all the nodes) instead of trying to consolid

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Richard Marscher
Yes it does, in fact it's probably going to be one of the more expensive shuffles you could trigger. On Mon, Aug 3, 2015 at 12:56 PM, Meihua Wu wrote: > Does RDD.cartesian involve shuffling? > > Thanks! > > - > To unsubscribe, e

Re: Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-04 Thread Upen N
Hi Guru, It was a no brainer issue. I had to create HDFS user ec2-user to make it work. It worked like a charm after that. Thanks Upender On Mon, Aug 3, 2015 at 10:27 PM, Guru Medasani wrote: > Hi Upen, > > Did you deploy the client configs after assigning the gateway roles? You > should be abl

Re: How to increase parallelism of a Spark cluster?

2015-08-04 Thread Richard Marscher
I think you did a good job of summarizing terminology and describing spark's operation. However #7 is inaccurate if I am interpreting correctly. The scheduler schedules X tasks from the current stage across all executors, where X is the the number of cores assigned to the application (assuming only

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Meihua Wu
Thanks, Richard! I basically have two RDD's: A and B; and I need to compute a value for every pair of (a, b) for a in A and b in B. My first thought is cartesian, but involves expensive shuffle. Any alternatives? How about I convert B to an array and broadcast it to every node (assuming B is rela

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Richard Marscher
That is the only alternative I'm aware of, if either A or B are small enough to broadcast then you'd at least be done cartesian products all locally without needing to also transmit and shuffle A. Unless spark somehow optimizes cartesian product and only transfers the smaller RDD across the network

No Twitter Input from Kafka to Spark Streaming

2015-08-04 Thread narendra
My application takes Twitter4j tweets and publishes those to a topic in Kafka. Spark Streaming subscribes to that topic for processing. But in actual, Spark Streaming is not able to receive tweet data from Kafka so Spark Streaming is running empty batch jobs with out input and I am not able to see

Re: No Twitter Input from Kafka to Spark Streaming

2015-08-04 Thread Cody Koeninger
Have you tried using the console consumer to see if anything is actually getting published to that topic? On Tue, Aug 4, 2015 at 11:45 AM, narendra wrote: > My application takes Twitter4j tweets and publishes those to a topic in > Kafka. Spark Streaming subscribes to that topic for processing. B

Re: Spark SQL support for Hive 0.14

2015-08-04 Thread Steve Loughran
Spark 1.3.1 & 1.4 only support Hive 0.13 Spark 1.5 is going to be released against Hive 1.2.1; it'll skip Hive .14 support entirely and go straight to the currently supported Hive release. See SPARK-8064 for the gory details > On 3 Aug 2015, at 23:01, Ishwardeep Singh > wrote: > > Hi, > > D

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Steve Loughran
Think it may be needed on Windows, certainly if you start trying to work with local files. > On 4 Aug 2015, at 00:34, Sean Owen wrote: > > It won't affect you if you're not actually running Hadoop. But it's > mainly things like Snappy/LZO compression which are implemented as > native librarie

Re: Extremely poor predictive performance with RF in mllib

2015-08-04 Thread Patrick Lam
Yes, I rechecked and the label is correct. As you can see in the code posted, I used the exact same features for all the classifiers so unless rf somehow switches the labels, it should be correct. I have posted a sample dataset and sample code to reproduce what I'm getting here: https://github.co

Re: Transform MongoDB Aggregation into Spark Job

2015-08-04 Thread Jörn Franke
Hi, I think the combination of Mongodb and Spark is a little bit unlucky. Why don't you simply use mongodb? If you want to process a lot of data you should use hdfs or cassandra as storage. Mongodb is not suitable for heterogeneous processing of large scale data. Best regards Best regards, L

Re: Repartition question

2015-08-04 Thread Richard Marscher
Hi, it is possible to control the number of partitions for the RDD without calling repartition by setting the max split size for the hadoop input format used. Tracing through the code, XmlInputFormat extends FileInputFormat which determines the number of splits (which NewHadoopRdd uses to determin

Re: How does the # of tasks affect # of threads?

2015-08-04 Thread Elkhan Dadashov
Hi Connor, Spark creates cached thread pool in Executor for executing the tasks: // Start worker thread pool *private val threadPool = ThreadUtils.newDaemonCachedThreadPool("Executor task la

Re: Spark SQL support for Hive 0.14

2015-08-04 Thread Michael Armbrust
I'll add that while Spark SQL 1.5 compiles against Hive 1.2.1, it has support for reading from metastores for Hive 0.12 - 1.2.1 On Tue, Aug 4, 2015 at 9:59 AM, Steve Loughran wrote: > Spark 1.3.1 & 1.4 only support Hive 0.13 > > Spark 1.5 is going to be released against Hive 1.2.1; it'll skip Hi

Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread gaurav sharma
Ideally the 2 messages read from kafka must differ on some parameter atleast, or else they are logically same As a solution to your problem, if the message content is same, u cud create a new field UUID, which might play the role of partition key while inserting the 2 messages in Cassandra Msg1 -

Spark SQL unable to recognize schema name

2015-08-04 Thread Mohammed Guller
Hi - I am running the Thrift JDBC/ODBC server (v1.4.1) and encountered a problem when querying tables using fully qualified table names(schemaName.tableName). The following query works fine from the beeline tool: SELECT * from test; However, the following query throws an exception, even though

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Sean Owen
Oh good point, does the Windows integration need native libs for POSIX-y file system access? I know there are some binaries shipped for this purpose but wasn't sure if that's part of what's covered in the native libs message. On Tue, Aug 4, 2015 at 6:01 PM, Steve Loughran wrote: > Think it may be

Re: scheduler delay time

2015-08-04 Thread maxdml
You'd need to provide information such as executor configuration (#cores, memory size). You might have less scheduler delay with smaller, but more numerous executors, than the contrary. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/scheduler-delay-time-tp6

Re: Spark SQL unable to recognize schema name

2015-08-04 Thread Ted Yu
This should have been fixed by: [SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support The fix is in the upcoming 1.5.0 FYI On Tue, Aug 4, 2015 at 11:45 AM, Mohammed Guller wrote: > Hi - > > > > I am running the Thrift JDBC/ODBC server (v1.4.1) and en

Re: Problem submiting an script .py against an standalone cluster.

2015-08-04 Thread Ford Farline
The code is very simple, just a couple of lines. When i lanch it runs in local but not in cluster. sc = SparkContext("local", "Tech Companies Feedback") beginning_time = datetime.now() time.sleep(60) print datetime.now() - beginning_time sc.stop() Thanks for your interest, Gonzalo On Fri,

Re: How does the # of tasks affect # of threads?

2015-08-04 Thread Connor Zanin
Elkhan, Thank you for the response. This was a great answer. On Tue, Aug 4, 2015 at 1:47 PM, Elkhan Dadashov wrote: > Hi Connor, > > Spark creates cached thread pool in Executor > > for ex

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam
Hi Spark users and developers, I have been trying to use spark-ec2. After I launched the spark cluster (1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job where the data is stored in the ephemeral hdfs. It does not matter what I tried to do, there is no data locality at all.

Is SPARK-3322 fixed in latest version of Spark?

2015-08-04 Thread Jim Green
*Symotom:* Even sample job fails: $ MASTER=spark://xxx:7077 run-example org.apache.spark.examples.SparkPi 10 Pi is roughly 3.140636 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(xxx,) not found WARN ConnectionManager: All connections not cleaned up Found https

Re: Is SPARK-3322 fixed in latest version of Spark?

2015-08-04 Thread Jim Green
And also https://issues.apache.org/jira/browse/SPARK-3106 This one is still open. On Tue, Aug 4, 2015 at 6:12 PM, Jim Green wrote: > *Symotom:* > Even sample job fails: > $ MASTER=spark://xxx:7077 run-example org.apache.spark.examples.SparkPi 10 > Pi is roughly 3.140636 > ERROR ConnectionManager

Turn Off Compression for Textfiles

2015-08-04 Thread Brandon White
How do you turn off gz compression for saving as textfiles? Right now, I am reading ,gz files and it is saving them as .gz. I would love to not compress them when I save. 1) DStream.saveAsTextFiles() //no compression 2) RDD.saveAsTextFile() //no compression Any ideas?

Re: Turn Off Compression for Textfiles

2015-08-04 Thread Philip Weaver
The .gz extension indicates that the file is compressed with gzip. Choose a different extension (e.g. .txt) when you save them. On Tue, Aug 4, 2015 at 7:00 PM, Brandon White wrote: > How do you turn off gz compression for saving as textfiles? Right now, I > am reading ,gz files and it is saving

Re: Difference between RandomForestModel and RandomForestClassificationModel

2015-08-04 Thread Yanbo Liang
The old mllib API will use RandomForest.trainClassifier() to train a RandomForestModel; the new mllib API (AKA ML) will use RandomForestClassifier.train() to train a RandomForestClassificationModel. They will produce the same result for a given dataset. 2015-07-31 1:34 GMT+08:00 Bryan Cutler : >

Combining Spark Files with saveAsTextFile

2015-08-04 Thread Brandon White
What is the best way to make saveAsTextFile save as only a single file?

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
One options is to use the coalesce method in the RDD class. Mohammed From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Tuesday, August 4, 2015 7:23 PM To: user Subject: Combining Spark Files with saveAsTextFile What is the best way to make saveAsTextFile save as only a single file?

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
Just to further clarify, you can first call coalesce with argument 1 and then call saveAsTextFile. For example, rdd.coalesce(1).saveAsTextFile(...) Mohammed From: Mohammed Guller Sent: Tuesday, August 4, 2015 9:39 PM To: 'Brandon White'; user Subject: RE: Combining Spark Files with saveAsText

control the number of reducers for groupby in data frame

2015-08-04 Thread Fang, Mike
Hi, Does anyone know how I could control the number of reducer when we do operation such as groupie For data frame? I could set spark.sql.shuffle.partitions in sql but not sure how to do in df.groupBy("XX") api. Thanks, Mike

Re: Is SPARK-3322 fixed in latest version of Spark?

2015-08-04 Thread Aaron Davidson
ConnectionManager has been deprecated and is no longer used by default (NettyBlockTransferService is the replacement). Hopefully you would no longer see these messages unless you have explicitly flipped it back on. On Tue, Aug 4, 2015 at 6:14 PM, Jim Green wrote: > And also https://issues.apache

Re: Total delay per batch in a CSV file

2015-08-04 Thread Saisai Shao
Hi, Lots of streaming internal status are exposed through StreamingListener, as well as what see from web UI, so you could write your own StreamingListener and register in StreamingContext to get the internal information of Spark Streaming and write to CSV file. You could check the source code he

Re: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Igor Berman
using coalesce might be dangerous, since 1 worker process will need to handle whole file and if the file is huge you'll get OOM, however it depends on implementation, I'm not sure how it will be done nevertheless, worse to try the coallesce method(please post your results) another option would be