Re:Re: Materials for deep insight into Spark SQL

2015-08-14 Thread Todd
Thanks Ted for the help! At 2015-08-14 12:02:14, "Ted Yu" wrote: You can look under Developer Track: https://spark-summit.org/2015/#day-1 http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=1 (slightly old) Catalyst design: https://docs.google.com/a/databricks.com/docume

Using unserializable classes in tasks

2015-08-14 Thread mark
I have a Spark job that computes some values and needs to write those values to a data store. The classes that write to the data store are not serializable (eg, Cassandra session objects etc). I don't want to collect all the results at the driver, I want each worker to write the data - what is the

Fwd: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
-- Forwarded message -- From: Dawid Wysakowicz Date: 2015-08-14 9:32 GMT+02:00 Subject: Re: Using unserializable classes in tasks To: mark I am not an expert but first of all check if there is no ready connector (you mentioned Cassandra - check: spark-cassandra-connector

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Mridul Muralidharan
What I understood from Imran's mail (and what was referenced in his mail) the RDD mentioned seems to be violating some basic contracts on how partitions are used in spark [1]. They cannot be arbitrarily numbered,have duplicates, etc. Extending RDD to add functionality is typically for niche cases

Re: Always two tasks slower than others, and then job fails

2015-08-14 Thread Jeff Zhang
Data skew ? May your partition key has some special value like null or empty string On Fri, Aug 14, 2015 at 11:01 AM, randylu wrote: > It is strange that there are always two tasks slower than others, and the > corresponding partitions's data are larger, no matter how many partitions? > > > Ex

Re: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
No the connector does not need to be serializable cause it is constructed on the worker. Only objects shuffled across partitions needs to be serializable. 2015-08-14 9:40 GMT+02:00 mark : > I guess I'm looking for a more general way to use complex graphs of > objects that cannot be serialized in

Re: matrix inverse and multiplication

2015-08-14 Thread go canal
Correction: I am not able to convert the Scala statement to java.

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
Thanks for the clarifications Mrithul. Thanks Best Regards On Fri, Aug 14, 2015 at 1:04 PM, Mridul Muralidharan wrote: > What I understood from Imran's mail (and what was referenced in his > mail) the RDD mentioned seems to be violating some basic contracts on > how partitions are used in spark

Re: spark.files.userClassPathFirst=true Return Error - Please help

2015-08-14 Thread Kyle Lin
Hi all I had similar usage and also got the same problem. I guess Spark use some class in my user jars but actually it should use the class in spark-assembly-xxx.jar, but I don't know how to fix it. Kyle 2015-07-22 23:03 GMT+08:00 Ashish Soni : > Hi All , > > I am getting below error when i

Re: spark.files.userClassPathFirst=true Return Error - Please help

2015-08-14 Thread Akhil Das
Which version of spark are you using? Did you try with --driver-class-path configuration? Thanks Best Regards On Fri, Aug 14, 2015 at 2:05 PM, Kyle Lin wrote: > Hi all > > I had similar usage and also got the same problem. > > I guess Spark use some class in my user jars but actually it should

Re: saveToCassandra not working in Spark Job but works in Spark Shell

2015-08-14 Thread Akhil Das
Looks like a jar version conflict to me. Thanks Best Regards On Thu, Aug 13, 2015 at 7:59 PM, satish chandra j wrote: > HI, > Please let me know if I am missing anything in the below mail, to get the > issue fixed > > Regards, > Satish Chandra > > On Wed, Aug 12, 2015 at 6:59 PM, satish chandra

Re: RDD.join vs spark SQL join

2015-08-14 Thread Akhil Das
Both works the same way, but with SparkSQL you will get the optimization etc done by the catalyst. One important thing to consider is the # partitions and the key distribution (when you are doing RDD.join), If the keys are not evenly distributed across machines then you can see the process chocking

Re: Always two tasks slower than others, and then job fails

2015-08-14 Thread Zoltán Zvara
Data skew is still a problem with Spark. - If you use groupByKey, try to express your logic by not using groupByKey. - If you need to use groupByKey, all you can do is to scale vertically. - If you can, repartition with a finer HashPartitioner. You will have many tasks for each stage, but tasks ar

Re: Sorted Multiple Outputs

2015-08-14 Thread Yiannis Gkoufas
Hi Eugene, in my case the list of values that I want to sort and write to a separate file, its fairly small so the way I solved it is the following: .groupByKey().foreach(e => { val hadoopConfig = new Configuration() val hdfs = FileSystem.get(hadoopConfig); val newPath = rootPath+"/"+e._1;

Re: spark streaming map use external variable occur a problem

2015-08-14 Thread Shixiong Zhu
Looks you compiled the codes with one Scala version but ran your app using a different incompatible version. BTW, you should not use PrintWriter like this to save your results. There may be multiple tasks running at the same host, and your job will fail because you are trying to write to the same

Re: spark.files.userClassPathFirst=true Return Error - Please help

2015-08-14 Thread Kyle Lin
Hello Akhil I use Spark 1.4.2 on HDP 2.1(Hadoop 2.4) I didn't use --driver-class-path. I only use spark.executor.userClassPathFirst=true Kyle 2015-08-14 17:11 GMT+08:00 Akhil Das : > Which version of spark are you using? Did you try with --driver-class-path > configuration? > > Thanks > Best

Re: saveToCassandra not working in Spark Job but works in Spark Shell

2015-08-14 Thread satish chandra j
Hi Akhil, Which jar version is conflicting and what needs to be done for the fix Regards, Satish Chandra On Fri, Aug 14, 2015 at 2:44 PM, Akhil Das wrote: > Looks like a jar version conflict to me. > > Thanks > Best Regards > > On Thu, Aug 13, 2015 at 7:59 PM, satish chandra j < > jsatishchan..

Spark job endup with NPE

2015-08-14 Thread hide
Hello, I'm using spark on yarn cluster and using mongo-hadoop-connector to pull data to spark, doing some job The job has following stage. (flatMap -> flatMap -> reduceByKey -> sortByKey) The data in MongoDB is tweet from twitter. First, connect to mongodb and make RDD by following val mongoRD

Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-14 Thread stelsavva
Hello, I am just starting out with spark streaming and Hbase/hadoop, i m writing a simple app to read from kafka and store to Hbase, I am having trouble submitting my job to spark. I 've downloaded Apache Spark 1.4.1 pre-build for hadoop 2.6 I am building the project with mvn package and submitt

Re: Spark Streaming: Change Kafka topics on runtime

2015-08-14 Thread Nisrina Luthfiyati
Hi Cody, by start/stopping, do you mean the streaming context or the app entirely? >From what I understand once a streaming context has been stopped it cannot be restarted, but I also haven't found a way to stop the app programmatically. The batch duration will probably be around 1-10 seconds. I

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Ted Yu
Which Spark release are you using ? Can you show us snippet of your code ? Have you checked namenode log ? Thanks > On Aug 13, 2015, at 10:21 PM, Mohit Anchlia wrote: > > I was able to get this working by using an alternative method however I only > see 0 bytes files in hadoop. I've verifi

Re: Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-14 Thread Stephen Boesch
The NoClassDefFoundException differs from ClassNotFoundException : it indicates an error while initializing that class: but the class is found in the classpath. Please provide the full stack trace. 2015-08-14 4:59 GMT-07:00 stelsavva : > Hello, I am just starting out with spark streaming and Hbas

Left outer joining big data set with small lookups

2015-08-14 Thread VIJAYAKUMAR JAWAHARLAL
Hi I am facing huge performance problem when I am trying to left outer join very big data set (~140GB) with bunch of small lookups [Start schema type]. I am using data frame in spark sql. It looks like data is shuffled and skewed when that join happens. Is there any way to improve performance

Re: Left outer joining big data set with small lookups

2015-08-14 Thread Silvio Fiorito
You could cache the lookup DataFrames, it’ll then do a broadcast join. On 8/14/15, 9:39 AM, "VIJAYAKUMAR JAWAHARLAL" wrote: >Hi > >I am facing huge performance problem when I am trying to left outer join very >big data set (~140GB) with bunch of small lookups [Start schema type]. I am >usin

Re: How to specify column type when saving DataFrame as parquet file?

2015-08-14 Thread Raghavendra Pandey
I think you can try dataFrame create api that takes RDD[Row] and Struct type... On Aug 11, 2015 4:28 PM, "Jyun-Fan Tsai" wrote: > Hi all, > I'm using Spark 1.4.1. I create a DataFrame from json file. There is > a column C that all values are null in the json file. I found that > the datatype o

Re: Spark job endup with NPE

2015-08-14 Thread Akhil Das
You can put a try..catch around all the transformations that you are doing and catch such exceptions instead of crashing your entire job. Thanks Best Regards On Fri, Aug 14, 2015 at 4:35 PM, hide wrote: > Hello, > > I'm using spark on yarn cluster and using mongo-hadoop-connector to pull > data

Re: Spark Streaming: Change Kafka topics on runtime

2015-08-14 Thread Cody Koeninger
There's a long recent thread in this list about stopping apps, subject was "stopping spark stream app" at 1 second I wouldn't run repeated rdds, no. I'd take a look at subclassing, personally (you'll have to rebuild the streaming kafka project since a lot is private), but if topic changes dont ha

Cannot cast to Tuple when running in cluster mode

2015-08-14 Thread Saif.A.Ellafi
Hi All, I have a working program, in which I create two big tuples2 out of the data. This seems to work in local but when I switch over cluster standalone mode, I get this error at the very beggining: 15/08/14 10:22:25 WARN TaskSetManager: Lost task 4.0 in stage 1.0 (TID 10, 162.101.194.44): j

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread dutrow
How do I get beyond the "This post has NOT been accepted by the mailing list yet" message? This message was posted through the nabble interface; one would think that would be enough to get the message accepted. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread Cody Koeninger
Use your email client to send a message to the mailing list from the email address you used to subscribe? The message you just sent reached the list On Fri, Aug 14, 2015 at 9:36 AM, dutrow wrote: > How do I get beyond the "This post has NOT been accepted by the mailing > list > yet" message? Th

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread dutrow
For those who find this post and may be interested, the most thorough documentation on the subject may be found here: https://github.com/koeninger/kafka-exactly-once -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Maintaining-Kafka-Direct-API-Offsets-tp24246

Re: How to specify column type when saving DataFrame as parquet file?

2015-08-14 Thread Francis Lau
Jyun Fan Here is how I have been doing it. I found that I needed to define the schema when loading the JSON file first Francis import datetime from pyspark.sql.types import * # Define schema upSchema = StructType([ StructField("field 1", StringType(), True), StructField("field 2", LongType(

Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
So it seems like dataframes aren't going give me a break and just work. Now it evaluates but goes nuts if it runs into a null case OR doesn't know how to get the correct data type when I specify the default value as a string expression. Let me know if anyone has a work around to this. PLEASE HELP M

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread dutrow
In summary, it appears that the use of the DirectAPI was intended specifically to enable exactly-once semantics. This can be achieved for idempotent transformations and with transactional processing using the database to guarantee an "onto" mapping of results based on inputs. For the latter, you ne

Fwd: Graphx - how to add vertices to a HashSet of vertices ?

2015-08-14 Thread Ranjana Rajendran
-- Forwarded message -- From: Ranjana Rajendran Date: Thu, Aug 13, 2015 at 7:37 AM Subject: Graphx - how to add vertices to a HashSet of vertices ? To: d...@spark.apache.org Hi, sampledVertices is a HashSet of vertices var sampledVertices: HashSet[VertexId] = HashSet() I

RE: Spark Job Hangs on our production cluster

2015-08-14 Thread java8964
I still want to check if anyone can provide any help related to the Spark 1.2.2 will hang on our production cluster when reading Big HDFS data (7800 avro blocks), while looks fine for small data (769 avro blocks). I enable the debug level in the spark log4j, and attached the log file if it helps

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread Cody Koeninger
I don't entirely agree with that assessment. Not paying for extra cores to run receivers was about as important as delivery semantics, as far as motivations for the api. As I said in the jira tickets on the topic, if you want to use the direct api and save offsets to ZK, you can. The right way

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
Our additional question on checkpointing is basically the logistics of it -- At which point does the data get written into checkpointing? Is it written as soon as the driver program retrieves an RDD from Kafka (or another source)? Or, is it written after that RDD has been processed and we're bas

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread Dan Dutrow
Thanks. Looking at the KafkaCluster.scala code, ( https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaCluster.scala#L253), it seems a little hacky for me to alter and recompile spark to expose those methods, so I'll use the receiver API fo

QueueStream Does Not Support Checkpointing

2015-08-14 Thread Asim Jalis
I want to test some Spark Streaming code that is using reduceByKeyAndWindow. If I do not enable checkpointing, I get the error: java.lang.IllegalArgumentException: requirement failed: The checkpoint > directory has not been set. Please set it by StreamingContext.checkpoint(). But if I enable che

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Salih Oztop
Hi Jerry,This blog post is perfect for window functions in Spark.https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html and a generic sql usage from oracle-base blog.https://oracle-base.com/articles/misc/lag-lead-analytic-functions It seems you are not using Window

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Mohit Anchlia
Spark 1.3 Code: wordCounts.foreachRDD(*new* *Function2, Time, Void>()* { @Override *public* Void call(JavaPairRDD rdd, Time time) *throws* IOException { String counts = "Counts at time " + time + " " + rdd.collect(); System.*out*.println(counts); System.*out*.println("Appending to " + output

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
Hi Salih, Normally I do sort before performing that operation, but since I've been trying to get this working for a week, I'm just loading something simple to test if lag works. Earlier I was having DB issues so it's been a long run of solving one runtime exception after another. Hopefully tho

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
Still not cooperating... lag(A,1,'X') OVER (ORDER BY A) as LA ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45) at org.apache.spark.sql.DataFrame$$anonfun$selectExpr$1.apply(DataFrame.scala:6

Help with persist: Data is requested again

2015-08-14 Thread Saif.A.Ellafi
Hello all, I am writing a program which calls from a database. A run a couple computations, but in the end I have a while loop, in which I make a modification to the persisted thata. eg: val data = PairRDD... persist() var i = 0 while (i < 10) { val data_mod = data.map(_._1 + 1, _._2)

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Varadhan, Jawahar
Thanks Marcelo. But our problem is little complicated. We have 10+ ftp sites that we will be transferring data from. The ftp server info, filename, credentials are all coming via Kafka message. So, I want to read those kafka message and dynamically connect to the ftp site and download those fat

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Marcelo Vanzin
On Fri, Aug 14, 2015 at 2:11 PM, Varadhan, Jawahar < varad...@yahoo.com.invalid> wrote: > And hence, I was planning to use Spark Streaming with Kafka or Flume with > Kafka. But flume runs on a JVM and may not be the best option as the huge > file will create memory issues. Please suggest someway t

Re: distributing large matrices

2015-08-14 Thread Rob Sargent
@Koen, If you meant to reply to my question on distributing matrices, could you re-send as there was not content in your post. Thanks, On 08/07/2015 10:02 AM, Koen Vantomme wrote: Verzonden vanaf mijn Sony Xperia™-smartphone iceback schreef Is this the sort of problem spark ca

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Cody Koeninger
You'll resume and re-process the rdd that didnt finish On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg wrote: > Our additional question on checkpointing is basically the logistics of it > -- > > At which point does the data get written into checkpointing? Is it > written as soon as the drive

Re: worker and executor memory

2015-08-14 Thread James Pirz
Additional Comment: I checked the disk usage on the 3 nodes (using iostat) and it seems that reading from HDFS partitions happen in a node-by-node basis. Only one of the nodes shows active IO (as read) at any given time while the other two nodes are idle IO-wise. I am not sure why the tasks are sch

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Tathagata Das
A hacky workaround is to create a customer InputDStream that creates the right RDDs based on a function. The TestInputDStream does something similar for Spark Streaming unit tes

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Holden Karau
I just pushed some code that does this for spark-testing-base ( https://github.com/holdenk/spark-testing-base ) (its in master) and will publish an updated artifact with it for tonight. On Fri, Aug 14, 2015 at 3:35 PM, Tathagata Das wrote: > A hacky workaround is to create a customer InputDStre

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Ted Yu
Please take a look at JavaPairDStream.scala: def saveAsHadoopFiles[F <: OutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F]) { Did you intend to use outputPath as prefix ? Cheers On Fri, Aug 14

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Mohit Anchlia
I thought prefix meant the output path? What's the purpose of prefix and where do I specify the path if not in prefix? On Fri, Aug 14, 2015 at 4:36 PM, Ted Yu wrote: > Please take a look at JavaPairDStream.scala: > def saveAsHadoopFiles[F <: OutputFormat[_, _]]( > prefix: String, >

Executors on multiple nodes

2015-08-14 Thread Mohit Anchlia
I am running on Yarn and do have a question on how spark runs executors on different data nodes. Is that primarily decided based on number of receivers? What do I need to do to ensure that multiple nodes are being used for data processing?

Too many files/dirs in hdfs

2015-08-14 Thread Mohit Anchlia
Spark stream seems to be creating 0 bytes files even when there is no data. Also, I have 2 concerns here: 1) Extra unnecessary files is being created from the output 2) Hadoop doesn't work really well with too many files and I see that it is creating a directory with a timestamp every 1 second. Is

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Ted Yu
First you create the file: final File outputFile = new File(outputPath); Then you write to it: Files.append(counts + "\n", outputFile, Charset.defaultCharset()); Cheers On Fri, Aug 14, 2015 at 4:38 PM, Mohit Anchlia wrote: > I thought prefix meant the output path? What's the purpo

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Asim Jalis
I feel the real fix here is to remove the exception from QueueInputDStream class by reverting the fix of https://issues.apache.org/jira/browse/SPARK-8630 I can write another class that is identical to the QueueInputDStream class except it does not throw the exception. But this feels like a convolu

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Asim Jalis
Another fix might be to remove the exception that is thrown when windowing and other stateful operations are used without checkpointing. On Fri, Aug 14, 2015 at 5:43 PM, Asim Jalis wrote: > I feel the real fix here is to remove the exception from QueueInputDStream > class by reverting the fix of

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
Thanks, Cody. It sounds like Spark Streaming has enough state info to know how many batches have been processed and if not all of them then the RDD is 'unfinished'. I wonder if it would know whether the last micro-batch has been fully processed successfully. Hypothetically, the driver program could

Re: Left outer joining big data set with small lookups

2015-08-14 Thread Raghavendra Pandey
In spark 1.4 there is a parameter to control that. Its default value is 10 M. So you need to cache your dataframe to hint the size. On Aug 14, 2015 7:09 PM, "VIJAYAKUMAR JAWAHARLAL" wrote: > Hi > > I am facing huge performance problem when I am trying to left outer join > very big data set (~140G

How to save a string to a text file ?

2015-08-14 Thread go canal
Hello again,online resources have sample code for writing RDD to a file, but I have a simple string, how to save to a text file ? (my data is a DenseMatrix actually) appreciate any help ! thanks, canal

Re: How to save a string to a text file ?

2015-08-14 Thread Brandon White
Convert it to a rdd then save the rdd to a file val str = "dank memes" sc.parallelize(List(str)).saveAsTextFile("str.txt") On Fri, Aug 14, 2015 at 7:50 PM, go canal wrote: > Hello again, > online resources have sample code for writing RDD to a file, but I have a > simple string, how to save to

Re: How to save a string to a text file ?

2015-08-14 Thread go canal
thank you very much. just a quick question - I try to save string in this way but the file is always empty:     val file = Path ("sample data/ZN_SPARK.OUT").createFile(true)    file.bufferedWriter().write(im.toString())    file.bufferedWriter().flush()    file.bufferedWriter().close() anything w