Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-29 Thread Saisai Shao
>From the chart you pasted, I guess you only have one receiver with storage level two copies, so mostly your taks are located on two executors. You could use repartition to redistribute the data more evenly across the executors. Also add more receiver is another solution. 2015-04-30 14:38 GMT+08:0

Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-29 Thread Kyle Lin
Hello Lin Hao Thanks for your reply. I will try to produce more data into Kafka. I run three Kafka borkers. Following is my topic info. Topic:kyle_test_topic PartitionCount:3 ReplicationFactor:2 Configs: Topic: kyle_test_topic Partition: 0 Leader: 3 Replicas: 3,4 Isr: 3,4 Topic: kyle_test_topic

Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-29 Thread Lin Hao Xu
It seems that the data size is only 2.9MB, far less than the default rdd size. How about put more data into kafka? and what about the number of topic partitions from kafka? Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/set

The Processing loading of Spark streaming on YARN is not in balance

2015-04-29 Thread Kyle Lin
Hi all My environment info Hadoop release version: HDP 2.1 Kakfa: 0.8.1.2.1.4.0 Spark: 1.1.0 My question: I ran Spark streaming program on YARN. My Spark streaming program will read data from Kafka and doing some processing. But, I found there is always only ONE executor under processing. As

Enabling Event Log

2015-04-29 Thread James King
I'm unclear why I'm getting this exception. It seems to have realized that I want to enable Event Logging but ignoring where I want it to log to i.e. file:/opt/cb/tmp/spark-events which does exist. spark-default.conf # Example: spark.master spark://master1:7077,master2:7077

spark kryo serialization question

2015-04-29 Thread 邓刚 [技术中心]
Hi all We know that spark support Kryo serialization, suppose there is a map function which map C to K,V(here C,K,V are instance of class C,K,V), when we register kryo serialization, should I register all of these three class? Best Wishes Triones Deng 本电子邮件可能为保密文件。如果阁下非电子邮件所指定之收件人,谨

Event generator for SPARK-Streaming from csv

2015-04-29 Thread anshu shukla
I have the real DEBS-TAxi data in csv file , in order to operate over it how to simulate a "Spout" kind of thing as event generator using the timestamps in CSV file. -- SERC-IISC Thanks & Regards, Anshu Shukla

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread Lin Hao Xu
For you question, I think the discussion in this link can help. http://apache-spark-user-list.1001560.n3.nabble.com/Error-related-to-serialisation-in-spark-streaming-td6801.html Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinh

Re: How to install spark in spark on yarn mode

2015-04-29 Thread madhvi
Hi, Follow the instructions to install on the following link: http://mbonaci.github.io/mbo-spark/ You dont need to install spark on every node.Just install it on one node or you can install it on remote system also and made a spark cluster. Thanks Madhvi On Thursday 30 April 2015 09:31 AM, xiaoh

How to install spark in spark on yarn mode

2015-04-29 Thread xiaohe lan
Hi experts, I see spark on yarn has yarn-client and yarn-cluster mode. I also have a 5 nodes hadoop cluster (hadoop 2.4). How to install spark if I want to try the spark on yarn mode. Do I need to install spark on the each node of hadoop cluster ? Thanks, Xiaohe

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Cody Koeninger
Use lsof to see what files are actually being held open. That stacktrace looks to me like it's from the driver, not executors. Where in foreach is it being called? The outermost portion of foreachRDD runs in the driver, the innermost portion runs in the executors. From the docs: https://spark.a

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread Wang, Ningjun (LNG-NPV)
As I understand from SQL, group by allow you to do sum(), average(), max(), mn(). But how do I select the entire row in the group with maximum column timeStamp? For example id1, value1, 2015-01-01 id1, value2, 2015-01-02 id2, value3, 2015-01-01 id2, value4, 2015-01-02 I want to return id1,

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread guoqing0...@yahoo.com.hk
Appreciate for your help , it works . i`m curious why the enclosing class cannot serialized , is it need to extends java.io.Serializable ? if object never serialized how it works in the task .whether there`s any association with the spark.closure.serializer . guoqing0...@yahoo.com.hk From:

Re: Compute pairwise distance

2015-04-29 Thread Debasish Das
Cross Join shuffle space might not be needed since most likely through application specific logic (topK etc) you can cut the shuffle space...Also most likely the brute force approach will be a benchmark tool to see how better is your clustering based KNN solution since there are several ways you ca

Re: [Spark SQL] Problems creating a table in specified schema/database

2015-04-29 Thread Michael Armbrust
No, sorry this is not supported. Support for more than one database is lacking in several areas (though mostly works for hive tables). I'd like to fix this in Spark 1.5. On Tue, Apr 28, 2015 at 1:54 AM, James Aley wrote: > Hey all, > > I'm trying to create tables from existing Parquet data in

RE: Sort (order by) of the big dataset

2015-04-29 Thread Ulanov, Alexander
After day of debugging (actually, more), I can answer my question: The problem is that the default value 200 of "spark.sql.shuffle.partitions" is too small for sorting 2B rows. It was hard to realize because Spark executors just crash with various exceptions one by one. The other takeaway is that

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread Tathagata Das
Could you put the implicit def in an object? That should work, as objects are never serialized. On Wed, Apr 29, 2015 at 6:28 PM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Thank you for your pointers , it`s very helpful to me , in this scenario > how can i use the implicit def

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread guoqing0...@yahoo.com.hk
Thank you for your pointers , it`s very helpful to me , in this scenario how can i use the implicit def in the enclosing class ? From: Tathagata Das Date: 2015-04-30 07:00 To: guoqing0...@yahoo.com.hk CC: user Subject: Re: implicit function in SparkStreaming I believe that the implicit def is

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
This function is called in foreachRDD. I think it should be running in the executors. I add the statement.close() in the code and it is running. I will let you know if this fixes the issue. On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das wrote: > Is the function ingestToMysql running on the dri

Kryo serialization of classes in additional jars

2015-04-29 Thread Akshat Aranya
Hi, Is it possible to register kryo serialization for classes contained in jars that are added with "spark.jars"? In my experiment it doesn't seem to work, likely because the class registration happens before the jar is shipped to the executor and added to the classloader. Here's the general ide

Re: Join between Streaming data vs Historical Data in spark

2015-04-29 Thread Tathagata Das
Have you taken a look at the join section in the streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior < rendy.b.jun...@gmail.com> wrote: > Let say I have transaction data and v

Re: Driver memory leak?

2015-04-29 Thread Tathagata Das
It could be related to this. https://issues.apache.org/jira/browse/SPARK-6737 This was fixed in Spark 1.3.1. On Wed, Apr 29, 2015 at 8:38 AM, Sean Owen wrote: > Not sure what you mean. It's already in CDH since 5.4 = 1.3.0 > (This isn't the place to ask about CDH) > I also don't think that's

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Also cc;ing Cody. @Cody maybe there is a reason for doing connection pooling even if there is not performance difference. TD On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das wrote: > Is the function ingestToMysql running on the driver or on the executors? > Accordingly you can try debugging whil

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Is the function ingestToMysql running on the driver or on the executors? Accordingly you can try debugging while running in a distributed manner, with and without calling the function. If you dont get "too many open files" without calling ingestToMysql(), the problem is likely to be in ingestToMys

Re: implicit function in SparkStreaming

2015-04-29 Thread Tathagata Das
I believe that the implicit def is pulling in the enclosing class (in which the def is defined) in the closure which is not serializable. On Wed, Apr 29, 2015 at 4:20 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hi guys, > I`m puzzled why i cant use the implicit function in

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
"I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of comm

Re: multiple programs compilation by sbt.

2015-04-29 Thread Dan Dong
HI, Ted, I will have a look at it , thanks a lot. Cheers, Dan 2015年4月29日 下午5:00于 "Ted Yu" 写道: > Have you looked at > http://www.scala-sbt.org/0.13/tutorial/Multi-Project.html ? > > Cheers > > On Wed, Apr 29, 2015 at 2:45 PM, Dan Dong wrote: > >> Hi, >> Following the Quick Start guide: >

Re: Compute pairwise distance

2015-04-29 Thread ayan guha
This is my first thought, please suggest any further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance Best Ayan On 30 Apr 2015 06:11, "Driesprong, Fokko" wrote: > Dear Spark

Re: multiple programs compilation by sbt.

2015-04-29 Thread Ted Yu
Have you looked at http://www.scala-sbt.org/0.13/tutorial/Multi-Project.html ? Cheers On Wed, Apr 29, 2015 at 2:45 PM, Dan Dong wrote: > Hi, > Following the Quick Start guide: > https://spark.apache.org/docs/latest/quick-start.html > > I could compile and run a Spark program successfully, now

Re: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread ayan guha
Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" wrote: > I have multiple DataFrame objects each stored in a parquet file. The > DataFrame just contains 3 columns (id, value, timeStamp). I need to union > all the DataFra

multiple programs compilation by sbt.

2015-04-29 Thread Dan Dong
Hi, Following the Quick Start guide: https://spark.apache.org/docs/latest/quick-start.html I could compile and run a Spark program successfully, now my question is how to compile multiple programs with sbt in a bunch. E.g, two programs as: ./src ./src/main ./src/main/scala ./src/main/scala/Sim

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Ted Yu
Maybe add statement.close() in finally block ? Streaming / Kafka experts may have better insight. On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay wrote: > Thanks for the suggestion. I ran the command and the limit is 1024. > > Based on my understanding, the connector to Kafka should not open so many

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Thanks for the suggestion. I ran the command and the limit is 1024. Based on my understanding, the connector to Kafka should not open so many files. Do you think there is possible socket leakage? BTW, in every batch which is 5 seconds, I output some results to mysql: def ingestToMysql(data: Arr

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Ted Yu
Can you run the command 'ulimit -n' to see the current limit ? To configure ulimit settings on Ubuntu, edit */etc/security/limits.conf* Cheers On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay wrote: > Hi all, > > I am using the direct approach to receive real-time data from Kafka in the > following li

Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example: https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/

Hardware provisioning for Spark SQl

2015-04-29 Thread Pietro Gentile
Hi all, I have to estimate resource requirements for my hadoop/spark cluster. In particular, i have to query about 100tb of hbase table to do aggregation with spark sql. What is, approximately, the most suitable cluster configuration for my use case? In order to query data in a fast way. At last

Re: indexing an RDD [Python]

2015-04-29 Thread Sven Krasser
Hey Roberto, You will likely want to use a cogroup() then, but it hinges all on how your data looks, i.e. if you have the index in the key. Here's an example: http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#cogroup . Clone: RDDs are immutable, so if you need to make changes t

Compute pairwise distance

2015-04-29 Thread Driesprong, Fokko
Dear Sparkers, I am working on an algorithm which requires the pair distance between all points (eg. DBScan, LOF, etc.). Computing this for *n* points will require produce a n^2 matrix. If the distance measure is symmetrical, this can be reduced to (n^2)/2. What would be the most optimal way of co

Sort (order by) of the big dataset

2015-04-29 Thread Ulanov, Alexander
Hi, I have a 2 billion records dataset witch schema . It is stored in Parquet format in HDFS, size 23GB. Specs: Spark 1.3, Hadoop 1.2.1, 8 nodes with Xeon 16GB RAM, 1TB disk space, each node has 3 workers with 3GB memory. I keep failing to sort the mentioned dataset in Spark. I do the following

Re: Slower performance when bigger memory?

2015-04-29 Thread Sven Krasser
On Mon, Apr 27, 2015 at 7:36 AM, Shuai Zheng wrote: > Thanks. So may I know what is your configuration for more/smaller > executors on r3.8xlarge, how big of the memory that you eventually decide > to give one executor without impact performance (for example: 64g? ). > We're currently using 16 e

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Richard Marscher
I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of comma

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
Thanks for the responses. "Try removing toDebugString and see what happens. " The toDebugString is performed after [d] (the action), as [e]. By then all stages are already executed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-execute

Re: spark with standalone HBase

2015-04-29 Thread Ted Yu
Can you enable HBase DEBUG logging in log4j.properties so that we can have more clue ? What hbase release are you using ? Cheers On Wed, Apr 29, 2015 at 4:27 AM, Saurabh Gupta wrote: > Hi, > > I am working with standalone HBase. And I want to execute HBaseTest.scala > (in scala examples) . > >

Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
Hi, I am trying to see exactly what happens underneath the hood of Spar when performing a simple sortByKey. So far I've already discovered the fetch-files and both the temp-shuffle and shuffle files being written to disk, but there is still an extra stage that keeps on puzzling me. This is the cod

Performance advantage by loading data from local node over S3.

2015-04-29 Thread Nisrina Luthfiyati
Hi all, I'm new to Spark so I'm sorry if the question is too vague. I'm currently trying to deploy a Spark cluster using YARN on an amazon EMR cluster. For the data storage I'm currently using S3 but would loading the data in HDFS from local node gives considerable performance advantage over loadin

RE: Spark on Cassandra

2015-04-29 Thread Huang, Roger
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/ http://planetcassandra.org/blog/holy-momentum-batman-spark-and-cassandra-circa-2015-w-datastax-connector-and-java/ https://github.com/datastax/spark-cassandra-connector From: Cody Koeninger [mailto:c...@koeninger.org] Se

Re: Spark on Cassandra

2015-04-29 Thread Cody Koeninger
Hadoop version doesn't matter if you're just using cassandra. On Wed, Apr 29, 2015 at 12:08 PM, Matthew Johnson wrote: > Hi all, > > > > I am new to Spark, but excited to use it with our Cassandra cluster. I > have read in a few places that Spark can interact directly with Cassandra > now, so I

Re: Multiclass classification using Ml logisticRegression

2015-04-29 Thread DB Tsai
Wrapping the old LogisticRegressionWithLBFGS could be a quick solution for 1.4, and it's not too hard so it's potentially to get into 1.4. In the long term, I will like to implement a new version like https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef which handles the

Spark on Cassandra

2015-04-29 Thread Matthew Johnson
Hi all, I am new to Spark, but excited to use it with our Cassandra cluster. I have read in a few places that Spark can interact directly with Cassandra now, so I decided to download it and have a play – I am happy to run it in standalone cluster mode initially. When I go to download it ( http:/

Re: Parquet error reading data that contains array of structs

2015-04-29 Thread Cheng Lian
Thanks for the detailed information! Now I can confirm that this is a backwards-compatibility issue. The data written by parquet 1.6rc7 follows the standard LIST structure. However, Spark SQL still uses old parquet-avro style two-level structures, which causes the problem. Cheng On 4/27/15

HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread Wang, Ningjun (LNG-NPV)
I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to union all the DataFrame objects together but for duplicated id only keep the record with the latest timestamp. How can I do that? I can do this for RDDs b

Re: How Spark SQL supports primary and secondary indexes

2015-04-29 Thread Nikolay Tikhonov
I'm running this query with different parameter on the same RDD and got 0.2s for each query. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-Spark-SQL-supports-primary-and-secondary-indexes-tp22700p22706.html Sent from the Apache Spark User List mailing

Re: Driver memory leak?

2015-04-29 Thread Sean Owen
Not sure what you mean. It's already in CDH since 5.4 = 1.3.0 (This isn't the place to ask about CDH) I also don't think that's the problem. The process did not run out of memory. On Wed, Apr 29, 2015 at 2:08 PM, Serega Sheypak wrote: > >The memory leak could be related to this >

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Ah yes so along the lines of the second approach: http://stackoverflow.com/questions/27251055/stopping-spark-streaming-after-reading-first-batch-of-data . On Wed, Apr 29, 2015 at 10:26 AM, Cody Koeninger wrote: > The idea of peek vs poll doesn't apply to kafka, because kafka is not a > queue. >

Re: solr in spark

2015-04-29 Thread Costin Leau
On 4/29/15 6:02 PM, Jeetendra Gangele wrote: Thanks for detail explanation. My only worry is to search the all combinations of company names through ES looks hard. I'm not sure what makes you think "ES looks hard". Have you tried browsing the Elasticsearch reference or the definitive guide?

Re: Re: solr in spark

2015-04-29 Thread Jeetendra Gangele
Thanks for detail explanation. My only worry is to search the all combinations of company names through ES looks hard. in solr we define everything in xml files like all attributes in WordDocumentFilterFactory and shingles factory. how to do this in elastic search? On 29 April 2015 at 20:03, Co

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Thanks for the comments, Cody. Granted, Kafka topics aren't queues. I was merely wishing that Kafka's topics had some queue behaviors supported because often that is exactly what one wants. The ability to poll messages off a topic seems like what lots of use-cases would want. I'll explore both o

Re: Dataframe filter based on another Dataframe

2015-04-29 Thread Olivier Girardot
You mean after joining ? Sure, my question was more if there was any best practice preferred to joining the other dataframe for filtering. Regards, Olivier. Le mer. 29 avr. 2015 à 13:23, Olivier Girardot a écrit : > Hi everyone, > what is the most efficient way to filter a DataFrame on a colum

Re: Re: solr in spark

2015-04-29 Thread Costin Leau
# disclaimer I'm an employee of Elastic (the company behind Elasticsearch) and lead of Elasticsearch Hadoop integration Some things to clarify on the Elasticsearch side: 1. Elasticsearch is a distributed, real-time search and analytics engine. Search is just one aspect of it and it can work wi

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Cody Koeninger
The idea of peek vs poll doesn't apply to kafka, because kafka is not a queue. There are two ways of doing what you want, either using KafkaRDD or a direct stream The Kafka rdd approach would require you to find the beginning and ending offsets for each partition. For an example of this, see get

RE: How to group multiple row data ?

2015-04-29 Thread Silvio Fiorito
I think you'd probably want to look at combineByKey. I'm on my phone so can't give you an example, but that's one solution i would try. You would then take the resulting RDD and go back to a DF if needed. From: bipin Sent: ‎4/‎29/‎2015

Re: How to group multiple row data ?

2015-04-29 Thread ayan guha
looks like you need this: lst = [[10001, 132, 2002, 1, "2012-11-23"], [10001, 132, 2002, 1, "2012-11-24"], [10031, 102, 223, 2, "2012-11-24"], [10001, 132, 2002, 2, "2012-11-25"], [10001, 132, 2002, 3, "2012-11-26"]] base = sc.parallelize(lst,1).map(

Join between Streaming data vs Historical Data in spark

2015-04-29 Thread Rendy Bambang Junior
Let say I have transaction data and visit data visit | userId | Visit source | Timestamp | | A | google ads | 1 | | A | facebook ads | 2 | transaction | userId | total price | timestamp | | A | 100 | 248384| | B | 200 | 43298739 | I want

Re: HBase HTable constructor hangs

2015-04-29 Thread Ted Yu
Can you verify whether the hbase release you're using has the following fix ? HBASE-8 non environment variable solution for "IllegalAccessError" Cheers On Tue, Apr 28, 2015 at 10:47 PM, Tridib Samanta wrote: > I turned on the TRACE and I see lot of following exception: > > java.lang.Illegal

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Part of the issues is, when you read messages in a topic, the messages are peeked, not polled, so there'll be no "when the queue is empty", as I understand it. So it would seem I'd want to do KafkaUtils.createRDD, which takes an array of OffsetRange's. Each OffsetRange is characterized by topic, p

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Yes, and Kafka topics are basically queues. So perhaps what's needed is just KafkaRDD with starting offset being 0 and finish offset being a very large number... Sent from my iPhone > On Apr 29, 2015, at 1:52 AM, ayan guha wrote: > > I guess what you mean is not streaming. If you create a st

Re: Driver memory leak?

2015-04-29 Thread Serega Sheypak
>The memory leak could be related to this defect that was resolved in Spark 1.2.2 and 1.3.0. @Sean Will it be backported to CDH? I did't find that bug in CDH 5.4 release notes. 2015-04-29 14:51 GMT+02:00 Conor Fennell : > The memory leak could be

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Guys, great feedback by pointing out my stupidity :D Rows and columns got intermixed hence the weird results I was seeing. Ignore my previous issues will reformat my data first. On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga wrote: > I'm mostly using example code, see here: > http://paste.opens

Re: DataFrame filter referencing error

2015-04-29 Thread ayan guha
Looks like you DF is based on a MySQL DB using jdbc, and error is thrown from mySQL. Can you see what SQL is finally getting fired in MySQL? Spark is pushing down the predicate to mysql so its not a spark problem perse On Wed, Apr 29, 2015 at 9:56 PM, Francesco Bigarella < francesco.bigare...@gmai

Re: Dataframe filter based on another Dataframe

2015-04-29 Thread ayan guha
You can use .select to project only columns you need On Wed, Apr 29, 2015 at 9:23 PM, Olivier Girardot wrote: > Hi everyone, > what is the most efficient way to filter a DataFrame on a column from > another Dataframe's column. The best idea I had, was to join the two > dataframes : > > > val df1

Re: Driver memory leak?

2015-04-29 Thread Conor Fennell
The memory leak could be related to this defect that was resolved in Spark 1.2.2 and 1.3.0. It also was a HashMap causing the issue. -Conor On Wed, Apr 29, 2015 at 12:01 PM, Sean Owen wrote: > Please use user@, not dev@ > > This message does

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
I'm mostly using example code, see here: http://paste.openstack.org/show/211966/ The data has 799305 dimensions and is separated by space Please note the issues I'm seeing is because of the scala implementation imo as it happens also when using the Python wrappers. On Wed, Apr 29, 2015 at 8:00

Re: java.io.IOException: No space left on device

2015-04-29 Thread selim namsi
Sorry I put the log messages when creating the thread in http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-td22702.html but I forgot that raw messages will not be sent in emails. So this is the log related to the error : 15/04/29 02:48:50 INFO CacheMa

Re: How to group multiple row data ?

2015-04-29 Thread Manoj Awasthi
Sorry but I didn't fully understand the grouping. This line: >> The group must only take the closest previous trigger. The first one hence shows alone. Can you please explain further? On Wed, Apr 29, 2015 at 4:42 PM, bipin wrote: > Hi, I have a ddf with schema (CustomerID, SupplierID, Product

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Jeetendra Gangele
How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga wrote: > Hi Sparkers, > > I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a > lar

DataFrame filter referencing error

2015-04-29 Thread Francesco Bigarella
Hi all, I was testing the DataFrame filter functionality and I found what I think is a strange behaviour. My dataframe testDF, obtained loading aMySQL table via jdbc, has the following schema: root | -- id: long (nullable = false) | -- title: string (nullable = true) | -- value: string (nullabl

MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues: - Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory. - At the end

Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler
Or multiple volumes. The LOCAL_DIRS (YARN) and SPARK_LOCAL_DIRS (Mesos, Standalone) environment variables and the spark.local.dir property control where temporary data is written. The default is /tmp. See http://spark.apache.org/docs/latest/configuration.html#runtime-environment for more details.

Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler
Makes sense. "/" is where /tmp would be. However, 230G should be plenty of space. If you have INFO logging turned on (set in $SPARK_HOME/conf/log4j.properties), you'll see messages about saving data to disk that will list sizes. The web console also has some summary information about this. dean D

spark with standalone HBase

2015-04-29 Thread Saurabh Gupta
Hi, I am working with standalone HBase. And I want to execute HBaseTest.scala (in scala examples) . I have created a test table with three rows and I just want to get the count using HBaseTest.scala I am getting this issue: 15/04/29 11:17:10 INFO BlockManagerMaster: Registered BlockManager 15/0

Dataframe filter based on another Dataframe

2015-04-29 Thread Olivier Girardot
Hi everyone, what is the most efficient way to filter a DataFrame on a column from another Dataframe's column. The best idea I had, was to join the two dataframes : > val df1 : Dataframe > val df2: Dataframe > df1.join(df2, df1("id") === df2("id"), "inner") But I end up (obviously) with the "id"

Re: java.io.IOException: No space left on device

2015-04-29 Thread selim namsi
This is the output of df -h so as you can see I'm using only one disk mounted on / df -h Filesystem Size Used Avail Use% Mounted on /dev/sda8 276G 34G 229G 13% /none4.0K 0 4.0K 0% /sys/fs/cgroup udev7.8G 4.0K 7.8G 1% /dev tmpfs 1.6G 1.4M

implicit function in SparkStreaming

2015-04-29 Thread guoqing0...@yahoo.com.hk
Hi guys, I`m puzzled why i cant use the implicit function in spark streaming to cause Task not serializable . code snippet: implicit final def str2KeyValue(s:String): (String,String) = { val message = s.split("\\|") if(message.length >= 2) (message(0),message(1)) else if(message.length

Re: java.io.IOException: No space left on device

2015-04-29 Thread Anshul Singhle
Do you have multiple disks? Maybe your work directory is not in the right disk? On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi wrote: > Hi, > > I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf > output,the training data is a file containing 156060 (size 8.1M). > > The problem

java.io.IOException: No space left on device

2015-04-29 Thread Selim Namsi
Hi, I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf output,the training data is a file containing 156060 (size 8.1M). The problem is that when trying to presist a partition into memory and there is not enought memory, the partition is persisted on disk and despite Having 229

How to group multiple row data ?

2015-04-29 Thread bipin
Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event, CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet out of them such that I can infer that Customer registered event from 1to 2 and if p

Re: A problem of using spark streaming to capture network packets

2015-04-29 Thread Dean Wampler
I would use the "ps" command on each machine while the job is running to confirm that every process involved is running as root. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler

Re: Driver memory leak?

2015-04-29 Thread Sean Owen
Please use user@, not dev@ This message does not appear to be from your driver. It also doesn't say you ran out of memory. It says you didn't tell YARN to let it use the memory you want. Look at the memory overhead param and please search first for related discussions. On Apr 29, 2015 11:43 AM, "w

How Spark SQL supports primary and secondary indexes

2015-04-29 Thread Nikolay Tikhonov
Hi all, I execute simple SQL query and got a unacceptable performance. I do the following steps: 1. Apply a schema to an RDD and register table. 2. Run sql query which returns several entries: Running time for this query 0.2s (table contains 10 entries). I think that Spark SQL ha

Re: sparksql - HiveConf not found during task deserialization

2015-04-29 Thread Manku Timma
The issue is solved. There was a problem in my hive codebase. Once that was fixed, -Phive-provided spark is working fine against my hive jars. On 27 April 2015 at 08:00, Manku Timma wrote: > Made some progress on this. Adding hive jars to the system classpath is > needed. But looks like it needs

Re: Re: Question about Memory Used and VCores Used

2015-04-29 Thread bit1...@163.com
Thanks Sandy, it is very useful! bit1...@163.com From: Sandy Ryza Date: 2015-04-29 15:24 To: bit1...@163.com CC: user Subject: Re: Question about Memory Used and VCores Used Hi, Good question. The extra memory comes from spark.yarn.executor.memoryOverhead, the space used for the application

Re: How to setup this "false streaming" problem

2015-04-29 Thread Ignacio Blasco
Hi Toni. Given there is more than one measure by (user, hour) what is the measure you want to keep? The sum?, the mean?, the most recent measure?. For the sum or the mean you don't need to care about the timing. And If you wan't to have the most recent then you can include the timestamp in the redu

RE: HBase HTable constructor hangs

2015-04-29 Thread Tridib Samanta
I turned on the TRACE and I see lot of following exception: java.lang.IllegalAccessError: com/google/protobuf/ZeroCopyLiteralByteString at org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:897) at org.apache.hadoop.hbase.protobuf.RequestConverter.bui

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-29 Thread Akhil Das
It is possible to access the filename, its a bit tricky though. val fstream = ssc.fileStream[LongWritable, IntWritable, SequenceFileInputFormat[LongWritable, IntWritable]]("/home/akhld/input/") fstream.foreach(x =>{ //You can get it with this object. println(x.values.toDebu

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-29 Thread Saisai Shao
Yes, looks like a solution but quite tricky. You have to parse the debug string to get the file name, also relies on HadoopRDD to get the file name :) 2015-04-29 14:52 GMT+08:00 Akhil Das : > It is possible to access the filename, its a bit tricky though. > > val fstream = ssc.fileStream[LongWri

Re: Multiclass classification using Ml logisticRegression

2015-04-29 Thread selim namsi
Thank you for your Answer! Yes I would like to work on it. Selim On Mon, Apr 27, 2015 at 5:23 AM Joseph Bradley wrote: > Unfortunately, the Pipelines API doesn't have multiclass logistic > regression yet, only binary. It's really a matter of modifying the current > implementation; I just added

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread ayan guha
I guess what you mean is not streaming. If you create a stream context at time t, you will receive data coming through starting time t++, not before time t. Looks like you want a queue. Let Kafka write to a queue, consume msgs from the queue and stop when queue is empty. On 29 Apr 2015 14:35, "dg

Re: Question about Memory Used and VCores Used

2015-04-29 Thread Sandy Ryza
Hi, Good question. The extra memory comes from spark.yarn.executor.memoryOverhead, the space used for the application master, and the way the YARN rounds requests up. This explains it in a little more detail: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ -Sand

How to set DEBUG level log of spark executor on Standalone deploy mode

2015-04-29 Thread eric wong
Hi, I want to check the DEBUG log of spark executor on Standalone deploy mode. But, 1. Set log4j.properties in spark/conf folder on master node and restart cluster. no means above works. 2. usning spark-submit --properties-file log4j. Just print debug log to screen but executor log still seems to b

Re: ReduceByKey and sorting within partitions

2015-04-29 Thread Marco
On 04/27/2015 06:00 PM, Ganelin, Ilya wrote: > Marco - why do you want data sorted both within and across partitions? If you > need to take an ordered sequence across all your data you need to either > aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an > ordered inde

  1   2   >