Re: SparkSQL: Freezing while running TPC-H query 5

2014-09-23 Thread Samay
Hey Dan, Thanks for your reply. I have a couple of questions. 1) Were you able to verify that this is because of GC? If yes, then could you let me know how. 2) If this is GC, then do you know of any tuning I can do to reduce this GC pause? Regards, Samay On Tue, Sep 23, 2014 at 11:15 PM, Dan D

Re: HdfsWordCount only counts some of the words

2014-09-23 Thread aka.fe2s
I guess because this example is stateless, so it outputs counts only for given RDD. Take a look at stateful word counter StatefulNetworkWordCount.scala On Wed, Sep 24, 2014 at 4:29 AM, SK wrote: > > I execute it as follows: > > $SPARK_HOME/bin/spark-submit --master --class > org.apache.spark

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-23 Thread Aniket Bhatnagar
Hi all I was finally able to get this working by setting the SPARK_EXECUTOR_INSTANCES to a high number. However, I am wondering if this is a bug because the app gets submitted but ceases to run because it can't run desired number of workers. Shouldn't the app be rejected if it cant be run on the c

Can not see any spark metrics on ganglia-web

2014-09-23 Thread tsingfu
I installed ganglia, and I think it worked well for hadoop, hbase for I can see hadoop/hbase metrics on ganglia-web.I want to use ganglia to monitor spark. and I followed the steps as following:1) first I did a custom compile with -Pspark-ganglia-lgpl, and it sucessed without warnings../make-distri

RE: how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread Shao, Saisai
If you have enough memory, the speed will be faster, within one minutes, since most of the files are cached. Also you can build your Spark project on a mounted ramfs in Linux, this will also speed up the process. Thanks Jerry -Original Message- From: Zhan Zhang [mailto:zzh...@hortonwork

java.lang.stackoverflowerror when running Spark shell

2014-09-23 Thread mrshen
I tested the examples according to the docs in spark sql programming guide, but the java.lang.stackoverflowerror occurred everytime I called sqlContext.sql("..."). Meanwhile, it worked fine in a hiveContext. The Hadoop version is 2.2.0, the Spark version is 1.1.0, built with Yarn, Hive.. I would b

Re: how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread Tobias Pfeiffer
Hi, http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-building-project-with-sbt-assembly-is-extremely-slow-td13152.html --> Maybe related to this? Tobias

Re: how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread Zhan Zhang
Definitely something wrong. For me, 10 to 30 minutes. Thanks. Zhan Zhang On Sep 23, 2014, at 10:02 PM, christy <760948...@qq.com> wrote: > This process began yesterday and it has already run for more than 20 hours. > Is it normal? Any one has the same problem? No error throw out yet. > > > >

Re: Converting one RDD to another

2014-09-23 Thread Zhan Zhang
Here is my understanding def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = { if (num == 0) { //if 0, return empty array Array.empty } else { mapPartitions { items => //map each partition to a a new one with the iterator consists of the single queue, wh

how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread christy
This process began yesterday and it has already run for more than 20 hours. Is it normal? Any one has the same problem? No error throw out yet. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-long-does-it-take-executing-sbt-sbt-assembly-tp14975.html Sent

Re: Where can I find the module diagram of SPARK?

2014-09-23 Thread David Rowe
Hi Andrew, I can't speak for Theodore, but I would find that incredibly useful. Dave On Wed, Sep 24, 2014 at 11:24 AM, Andrew Ash wrote: > Hi Theodore, > > What do you mean by module diagram? A high level architecture diagram of > how the classes are organized into packages? > > Andrew > > On

Converting one RDD to another

2014-09-23 Thread Deep Pradhan
Hi, Is it always possible to get one RDD from another. For example, if I do a *top(K)(Ordering)*, I get an Int right? (In my example the type is Int). I do not get an RDD. Can anyone explain this to me? Thank You

java.lang.OutOfMemoryError while running SVD MLLib example

2014-09-23 Thread sbir...@wynyardgroup.com
Hello, I am new to Spark. I have downloaded Spark 1.1.0 and trying to run the TallSkinnySVD.scala example with different input data sizes. I tried with input data with 1000X1000 matrix, 5000X5000 matrix. Though I had faced some Java Heap issues I added following parameters in "spark-defaults.conf"

Re: SQL status code to indicate success or failure of query

2014-09-23 Thread Michael Armbrust
An exception should be thrown in the case of failure for DDL commands. On Tue, Sep 23, 2014 at 4:55 PM, Du Li wrote: > Hi, > > After executing sql() in SQLContext or HiveContext, is there a way to > tell whether the query/command succeeded or failed? Method sql() returns > SchemaRDD which eit

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
Xiangrui, Thanks for replying. I am using the subset of newsgroup20 data. I will send you the vectorized data for analysis shortly. I have tried running in local mode as well but I get the same OOM exception. I started with 4GB of data but then moved to smaller set to verify that everything was

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-23 Thread Andrew Ash
Also you'd rather have 2-3 tasks per core than 1 task per core because if the 1 task per core is actually 1.01 tasks per core, then you have one wave of tasks complete and another wave of tasks with very few tasks in them. You get better utilization when you're higher than 1. Aaron Davidson goes i

RE: HdfsWordCount only counts some of the words

2014-09-23 Thread SK
I execute it as follows: $SPARK_HOME/bin/spark-submit --master --class org.apache.spark.examples.streaming.HdfsWordCount target/scala-2.10/spark_stream_examples-assembly-1.0.jar After I start the job, I add a new test file in hdfsdir. It is a large text file which I will not be able to c

Re: Where can I find the module diagram of SPARK?

2014-09-23 Thread Andrew Ash
Hi Theodore, What do you mean by module diagram? A high level architecture diagram of how the classes are organized into packages? Andrew On Tue, Sep 23, 2014 at 12:46 AM, Theodore Si wrote: > Hi, > > Please help me with that. > > BR, > Theodore Si > >

Re: Access resources from jar-local resources folder

2014-09-23 Thread Andrew Ash
Hi Roberto, - Try using MyApp.class.getClassLoader.getResource instead of starting with getClass() to make sure you're pulling from your jar instead of another one - Can you confirm that the perl script is being included in your resulting jar? Check with "jar tf" to get a file list - What path ar

Re: spark time out

2014-09-23 Thread Andrew Ash
Hi Chen, The fetch failures seem to be happening a lot more to people on 1.1.0 -- there's a bug tracking fetch failures at https://issues.apache.org/jira/browse/SPARK-3633 that might be the same as what you're seeing. Can you take a peek at that bug and if it matches what you're observing follow

Re: General question on persist

2014-09-23 Thread Liquan Pei
For sortByKey, I need more reading to answer your question. For your second question, you can do partitioned.cache() to avoid multiple data shuffling. -Liquan On Tue, Sep 23, 2014 at 6:08 PM, Arun Ahuja wrote: > Thanks for the answers - from what I've read doing a sortByKey still does > not gi

Garbage Collection

2014-09-23 Thread dizzy5112
Hi all, just wondering what sort of ration GC should be on a task. Ive attached a screen shot from the UI and it seems as though these might be a bit high? -- View this message in context: http://apache-spark-user-list.1

RE: HdfsWordCount only counts some of the words

2014-09-23 Thread Liu, Raymond
It should count all the words, so you probably need to post more details on how you run it and the log, output etc. Best Regards, Raymond Liu -Original Message- From: SK [mailto:skrishna...@gmail.com] Sent: Wednesday, September 24, 2014 5:04 AM To: u...@spark.incubator.apache.org Subje

RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Liu, Raymond
When did you check the dir’s contents? When the application finished, those dirs will be cleaned. Best Regards, Raymond Liu From: Chitturi Padma [mailto:learnings.chitt...@gmail.com] Sent: Tuesday, September 23, 2014 8:33 PM To: u...@spark.incubator.apache.org Subject: Re: spark.local.dir and sp

Adding extra classpath

2014-09-23 Thread Victor Tso-Guillen
So we have an extra jar that we need to get loaded in the spark executor runtime before log4j loads (yes, you've guessed it, it's a custom appender!). We've tried putting it in spark-defaults.conf and restarting our application, but it didn't work. I'm kinda leery of putting the user classpath firs

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Liquan Pei
Hi Steve, Hi Steve, Did you try the newAPIHadoopFile? That worked for us. Thanks, Liquan On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis wrote: > Well I had one and tried that - my message tells what I found found > 1) Spark only accepts org.apache.hadoop.mapred.InputFormat > not org.apache.had

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Andrew Ash
Hi Steve, Hadoop has both old-style and new-style APIs -- Java package has "mapred" vs "mapreduce". Spark supports both of these via the sc.hadoopFile() and sc.newAPIHadoopFile(). Maybe you need to switch to the newAPI versions of those methods? On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis wro

Re: Worker Random Port

2014-09-23 Thread Andrew Ash
Hi Paul, There are several ports you need to configure in order to run in a tight network environment. It sounds like you the DMZ that contains the spark cluster is wide open internally, but you have to poke holes between that and the driver. You should take a look at the port configuration opti

Re: Exception with SparkSql and Avro

2014-09-23 Thread Zalzberg, Idan (Agoda)
Thanks, I didn't create the tables myself as I have no control over that process. However these tables are read just fund using the Jdbc connection to the hiveserver2 so it should be possible On Sep 24, 2014 12:48 AM, Michael Armbrust wrote: Can you show me the DDL you are using? Here is an exa

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Steve Lewis
Well I had one and tried that - my message tells what I found found 1) Spark only accepts org.apache.hadoop.mapred.InputFormat not org.apache.hadoop.mapreduce.InputFormat 2) Hadoop expects K and V to be Writables - I always use Text - Text is not Serializable and will not work with Spark - StringB

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread Xiangrui Meng
You dataset is small. NaiveBayes should work under the default settings, even in local mode. Could you try local mode first without changing any Spark settings? Since your dataset is small, could you save the vectorized data (RDD[LabeledPoint]) and send me a sample? I want to take a look at the fea

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Liquan Pei
Hi Steve, Here is my understanding, as long as you implement InputFormat, you should be able to use hadoopFile API in SparkContext to create an RDD. Suppose you have a customized InputFormat which we call CustomizedInputFormat where K is the key type and V is the value type. You can create an RDD

Re: Spark Code to read RCFiles

2014-09-23 Thread Matei Zaharia
Is your file managed by Hive (and thus present in a Hive metastore)? In that case, Spark SQL (https://spark.apache.org/docs/latest/sql-programming-guide.html) is the easiest way. Matei On September 23, 2014 at 2:26:10 PM, Pramod Biligiri (pramodbilig...@gmail.com) wrote: Hi, I'm trying to re

IOException running streaming job

2014-09-23 Thread Emil Gustafsson
I'm trying out some streaming with spark and I'm getting an error that puzzles me since I'm new to Spark. I get this error all the time but 1-2 batches in the stream are processed before the job stops. but never the complete job and often no batch is processed at all. I use Spark 1.1.0. The job is

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Tim Smith
Maybe post the before-code as in what was the code before you did the loop (that worked)? I had similar situations where reviewing code before (worked) and after (does not work) helped. Also, what helped is the Scala REPL because I can see what are the object types being returned by each statement.

Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Steve Lewis
When I experimented with using an InputFormat I had used in Hadoop for a long time in Hadoop I found 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat 2) initialize needs to be called in the constructor 3) The

SQL status code to indicate success or failure of query

2014-09-23 Thread Du Li
Hi, After executing sql() in SQLContext or HiveContext, is there a way to tell whether the query/command succeeded or failed? Method sql() returns SchemaRDD which either is empty or contains some Rows of results. However, some queries and commands do not return results by nature; being empty is

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
To my eyes, these are functionally equivalent. I’ll try a Scala approach, but this may cause waves for me upstream (e.g., non-Java) Thanks for looking at this. If anyone else can see a glaring issue in the Java approach that would be appreciated. Thanks, Matt On Sep 23, 2014, at 4:13 PM, Tim

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Tim Smith
Sorry, I am almost Java illiterate but here's my Scala code to do the equivalent (that I have tested to work): val kInStreams = (1 to 10).map{_ => KafkaUtils.createStream(ssc,zkhost.acme.net:2182,"myGrp",Map("myTopic" -> 1), StorageLevel.MEMORY_AND_DISK_SER) } //Create 10 receivers across the clus

Worker Random Port

2014-09-23 Thread Paul Magid
I am trying to get an edge server up and running connecting to our Spark 1.1 cluster. The edge server is in a different DMZ than the rest of the cluster and we have to specifically open firewall ports between the edge server and the rest of the cluster. I can log on to any node in the cluster

Re: Sorting a Table in Spark RDD

2014-09-23 Thread Victor Tso-Guillen
You could pluck out each column in separate rdds, sort them independently, and zip them :) On Tue, Sep 23, 2014 at 2:40 PM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -) < abaghdasa...@bloomberg.net> wrote: > Hello, > > So I have crated a table in in RDD in spark in thei format: > col1 col2 >

Running Spark from an Intellij worksheet - akka.version error

2014-09-23 Thread adrian
I have an SBT Spark project compiling fine in Intellij. However when I try to create a SparkContext from a worksheet: import org.apache.spark.SparkContext val sc1 = new SparkContext("local[8]", "sc1") I get this error: com.typesafe.config.ConfigException$Missing: No configuration setting found f

Re: clarification for some spark on yarn configuration options

2014-09-23 Thread Andrew Or
Yes... good find. I have filed a JIRA here: https://issues.apache.org/jira/browse/SPARK-3661 and will get to fixing it shortly. Both of these fixes will be available in 1.1.1. Until both of these are merged in, it appears that the only way you can do it now is through --driver-memory. -Andrew 201

Re: RDD data checkpoint cleaning

2014-09-23 Thread RodrigoB
Just a follow-up. Just to make sure about the RDDs not being cleaned up, I just replayed the app both on the windows remote laptop and then on the linux machine and at the same time was observing the RDD folders in HDFS. Confirming the observed behavior: running on the laptop I could see the RDDs

Re: General question on persist

2014-09-23 Thread Arun Ahuja
Thanks Liquan, that makes sense, but if I am only doin the computation once, there will essentially be no difference, correct? I had second question related to mapPartitions 1) All of the records of the Iterator[T] that a single function call in mapPartitions process must fit into memory, correct?

Sorting a Table in Spark RDD

2014-09-23 Thread Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
Hello, So I have crated a table in in RDD in spark in thei format: col1col2 --- 1. 10 11 2. 12 8 3. 9 13 4. 2 3 And the RDD is ristributed by the rows (rows 1, 2 on one node and rows 3 4 on another) I want to sort each column of the table so tha

Sorting a table in Spark

2014-09-23 Thread Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
Hello,

Re: RDD data checkpoint cleaning

2014-09-23 Thread RodrigoB
Hi TD, tnks for getting back on this. Yes that's what I was experiencing - data checkpoints were being recovered from considerable time before the last data checkpoint, probably since the beginning of the first writes, would have to confirm. I have some development on this though. These results

Spark Code to read RCFiles

2014-09-23 Thread Pramod Biligiri
Hi, I'm trying to read some data in RCFiles using Spark, but can't seem to find a suitable example anywhere. Currently I've written the following bit of code that lets me count() the no. of records, but when I try to do a collect() or a map(), it fails with a ConcurrentModificationException. I'm ru

Re: General question on persist

2014-09-23 Thread Liquan Pei
Hi Arun, The intermediate results like keyedRecordPieces will not be materialized. This indicates that if you run partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save() again, the keyedRecordPieces will be re-computed . In this case, cache or p

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
So, this is scrubbed some for confidentiality, but the meat of it is as follows. Note, that if I substitute the commented section for the loop, I receive messages from the topic. SparkConf sparkConf = new SparkConf(); sparkConf.set("spark.streaming.unpersist", "true"); sparkConf.set("spark.logC

Re: RDD data checkpoint cleaning

2014-09-23 Thread Tathagata Das
I am not sure what you mean by data checkpoint continuously increase, leading to recovery process taking time? Do you mean that in HDFS you are seeing rdd checkpoint files being continuously written but never being deleted? On Tue, Sep 23, 2014 at 2:40 AM, RodrigoB wrote: > Hi all, > > I've just

General question on persist

2014-09-23 Thread Arun Ahuja
I have a general question on when persisting will be beneficial and when it won't: I have a task that runs as follow keyedRecordPieces = records.flatMap( record => Seq(key, recordPieces)) partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save()

HdfsWordCount only counts some of the words

2014-09-23 Thread SK
Hi, I tried out the HdfsWordCount program in the Streaming module on a cluster. Based on the output, I find that it counts only a few of the words. How can I have it count all the words in the text? I have only one text file in the directory. thanks -- View this message in context: http://a

Re: Spark SQL 1.1.0 - large insert into parquet runs out of memory

2014-09-23 Thread Dan Dietterich
I have only been using spark through the SQL front-end (CLI or JDBC). I don't think I have access to saveAsParquetFile from there, do I? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-1-0-large-insert-into-parquet-runs-out-of-memory-tp14924p1492

Re: NullPointerException on reading checkpoint files

2014-09-23 Thread Tathagata Das
This is actually a very tricky as their two pretty big challenges that need to be solved. (i) Checkpointing for broadcast variables: Unlike RDDs, broadcasts variable dont have checkpointing support (that is you cannot write the content of a broadcast variable to HDFS and recover it automatically w

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Tim Smith
Posting your code would be really helpful in figuring out gotchas. On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell wrote: > Hey, > > Spark 1.1.0 > Kafka 0.8.1.1 > Hadoop (YARN/HDFS) 2.5.1 > > I have a five partition Kafka topic. I can create a single Kafka receiver > via KafkaUtils.createStream wi

Re: Spark SQL 1.1.0 - large insert into parquet runs out of memory

2014-09-23 Thread Michael Armbrust
I would hope that things should work for this kind of workflow. I'm curious if you have tried using saveAsParquetFile instead of inserting directly into a hive table (you could still register this as an external table afterwards). Right now inserting into Hive tables is going to through their Ser

Spark SQL 1.1.0 - large insert into parquet runs out of memory

2014-09-23 Thread Dan Dietterich
I am trying to load data from csv format into parquet using Spark SQL. It consistently runs out of memory. The environment is: * standalone cluster using HDFS and Hive metastore from HDP2.0 * spark1.1.0 * parquet jar files (v1.5) explicitly added when starting spark-sql.

Re: Java Implementation of StreamingContext.fileStream

2014-09-23 Thread Michael Quinlan
Thanks very much for the pointer, which validated my initial approach. It turns out that I was creating a tag for the abstract class "InputFormat.class". Using "TextInputFormat.class" instead fixed my issue. Regards, Mike -- View this message in context: http://apache-spark-user-list.1001560.

Re: spark1.0 principal component analysis

2014-09-23 Thread Evan R. Sparks
In its current implementation, the principal components are computed in MLlib in two steps: 1) In a distributed fashion, compute the covariance matrix - the result is a local matrix. 2) On this local matrix, compute the SVD. The sorting comes from the SVD. If you want to get the eigenvalues out, y

Re: access javaobject in rdd map

2014-09-23 Thread jamborta
Great. Thanks a lot. On 23 Sep 2014 18:44, "Davies Liu-2 [via Apache Spark User List]" < ml-node+s1001560n14908...@n3.nabble.com> wrote: > Right now, there is no way to access JVM in Python worker, in order > to make this happen, we need to do: > > 1. setup py4j in Python worker > 2. serialize the

Re: spark1.0 principal component analysis

2014-09-23 Thread st553
sowen wrote > it seems that the singular values from the SVD aren't returned, so I don't > know that you can access this directly Its not clear to me why these aren't returned? The S matrix would be useful to determine a reasonable value for K. -- View this message in context: http://apache-sp

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-23 Thread Davies Liu
Or maybe there is a bug related to the base64 in py4j, could you dumps the serialized bytes of closure to verify this? You could add a line in spark/python/pyspark/rdd.py: ser = CloudPickleSerializer() pickled_command = ser.dumps(command) + print len(pickled_command), repr(pi

Re: How to initialize updateStateByKey operation

2014-09-23 Thread Tathagata Das
At a high-level, the suggestion sounds good to me. However regarding code, its best to submit a Pull Request on Spark github page for community reviewing. You will find more information here. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Tue, Sep 23, 2014 at 10:11 PM,

Re: Setup an huge Unserializable Object in a mapper

2014-09-23 Thread matthes
I solved it :) I moved the lookupObject into the function where I create the broadcast and now all works very well! object lookupObject { private var treeFile : org.apache.spark.broadcast.Broadcast[String] = _ def main(args: Array[String]): Unit = { … val treeFile = sc.broadcast(args(0)) o

Re: MLlib, what online(streaming) algorithms are available?

2014-09-23 Thread Liquan Pei
Hi Oleksiy, Right now, only streaming linear regression is available in MLlib. There are working in progress on Streaming K-means and Streaming SVM. Please take a look at the following jiras for more information. Streaming K-means https://issues.apache.org/jira/browse/SPARK-3254 Streaming SVM htt

Transient association error on a 3 nodes cluster

2014-09-23 Thread Edwin
I'm running my application on a three nodes cluster(8 cores each, 12 G memory each), and I receive the follow actor error, does anyone have any idea? 14:31:18,061 ERROR [akka.remote.EndpointWriter] (spark-akka.actor.default-dispatcher-17) Transient association error (association remains live): akk

Re: spark time out

2014-09-23 Thread Chen Song
I am running the job on 500 executors, each with 8G and 1 core. See lots of fetch failures on reduce stage, when running a simple reduceByKey map tasks -> 4000 reduce tasks -> 200 On Mon, Sep 22, 2014 at 12:22 PM, Chen Song wrote: > I am using Spark 1.1.0 and have seen a lot of Fetch Failure

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-09-23 Thread freedafeng
I don't know if it's relevant, but I had to compile spark for my specific hbase and hadoop version to make that hbase_inputformat.py work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-hbase-inputformat-py-not-work-tp14905p14912.html Sent from

Re: Spark SQL CLI

2014-09-23 Thread Michael Armbrust
A workaround for now would be to save the JSON as parquet and the create a metastore parquet table. Using parquet will be much faster for repeated querying. This function might be helpful: import org.apache.spark.sql.hive.HiveMetastoreTypes def createParquetTable(name: String, file: String, sqlC

Re: Spark SQL CLI

2014-09-23 Thread Michael Armbrust
You can't directly query JSON tables from the CLI or JDBC server since temporary tables only live for the life of the Spark Context. This PR will eventually (targeted for 1.2) let you do what you want in pure SQL: https://github.com/apache/spark/pull/2475 On Mon, Sep 22, 2014 at 4:52 PM, Yin Huai

Re: Exception with SparkSql and Avro

2014-09-23 Thread Michael Armbrust
Can you show me the DDL you are using? Here is an example of a way I got the avro serde to work: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L246 Also, this isn't ready for primetime yet, but a quick plug for some ongoing work: http

Re: access javaobject in rdd map

2014-09-23 Thread Davies Liu
Right now, there is no way to access JVM in Python worker, in order to make this happen, we need to do: 1. setup py4j in Python worker 2. serialize the JVM objects and transfer to executors 3. link the JVM objects and py4j together to get an interface Before these happens, maybe you could try to

Re: Fails to run simple Spark (Hello World) scala program

2014-09-23 Thread Moshe Beeri
Sure in local mode it works for me as well, the issue is that I run master only, I needed worker as well. תודה רבה, משה בארי. 054-3133943 Email | linkedin On Mon, Sep 22, 2014 at 9:58 AM, Akhil Das-2 [via Apache Spark User List] < ml-node+s1001560n14785...@n

Re: access javaobject in rdd map

2014-09-23 Thread Tamas Jambor
Hi Davies, Thanks for the reply. I saw that you guys do that way in the code. Is there no other way? I have implemented all the predict functions in scala, so I prefer not to reimplement the whole thing in python. thanks, On Tue, Sep 23, 2014 at 5:40 PM, Davies Liu wrote: > You should create

Spark 1.1.0 hbase_inputformat.py not work

2014-09-23 Thread Gilberto Lira
Hi, i'm trying to run hbase_inputformat.py example but i'm not getting. this is the error: Traceback (most recent call last): File "/root/spark/examples/src/main/python/hbase_inputformat.py", line 70, in conf=conf) File "/root/spark/python/pyspark/context.py", line 471, in newAPIHadoopR

Re: access javaobject in rdd map

2014-09-23 Thread Davies Liu
You should create a pure Python object (copy the attributes from Java object), then it could be used in map. Davies On Tue, Sep 23, 2014 at 8:48 AM, jamborta wrote: > Hi all, > > I have a java object that contains a ML model which I would like to use for > prediction (in python). I just want to

Re: How to initialize updateStateByKey operation

2014-09-23 Thread Soumitra Kumar
I thought I did a good job ;-) OK, so what is the best way to initialize updateStateByKey operation? I have counts from previous spark-submit, and want to load that in next spark-submit job. - Original Message - From: "Soumitra Kumar" To: "spark users" Sent: Sunday, September 21, 2014

SparkSQL: Freezing while running TPC-H query 5

2014-09-23 Thread Samay
Hi, I am trying to run TPC-H queries with SparkSQL 1.1.0 CLI with 1 r3.4xlarge master + 20 r3.4xlarge slave machines on EC2 (each machine has 16vCPUs, 122GB memory). The TPC-H scale factor I am using is 1000 (i.e. 1000GB of total data). When I try to run TPC-H query 5, the query hangs for a long

Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
Hey, Spark 1.1.0 Kafka 0.8.1.1 Hadoop (YARN/HDFS) 2.5.1 I have a five partition Kafka topic. I can create a single Kafka receiver via KafkaUtils.createStream with five threads in the topic map and consume messages fine. Sifting through the user list and Google, I see that its possible to spl

Re: java.lang.ClassNotFoundException on driver class in executor

2014-09-23 Thread Barrington Henry
Hi Andrew, Thanks for the prompt response. I tried command line and it works fine. But, I want to try from IDE for easier debugging and transparency into code execution. I would try and see if there is any way to get the jar over to the executor from within the IDE. - Barrington > On Sep 21,

Re: Setup an huge Unserializable Object in a mapper

2014-09-23 Thread matthes
Thank you for the answer and sorry for the double question, but now it works! I have one additional question, is it possible to use a broadcast variable in this object, at the moment I try it in the way below, but the broadcast object is still null. object lookupObject { private var treeFile : org

access javaobject in rdd map

2014-09-23 Thread jamborta
Hi all, I have a java object that contains a ML model which I would like to use for prediction (in python). I just want to iterate the data through a mapper and predict for each value. Unfortunately, this fails when it tries to serialise the object to sent it to the nodes. Is there a trick aroun

MLlib, what online(streaming) algorithms are available?

2014-09-23 Thread aka.fe2s
Hi, I'm looking for available online ML algorithms (that improve model with new streaming data). The only one I found is linear regression. Is there anything else implemented as part of MLlib? Thanks, Oleksiy.

recommended values for spark driver memory?

2014-09-23 Thread Greg Hill
I know the recommendation is "it depends", but can people share what sort of memory allocations they're using for their driver processes? I'd like to get an idea of what the range looks like so we can provide sensible defaults without necessarily knowing what the jobs will look like. The custo

Spark 1.1.0 on EC2

2014-09-23 Thread Gilberto Lira
Hi, What better way to use version 1.1.0 of the spark in ec2? Att, Giba

Re: clarification for some spark on yarn configuration options

2014-09-23 Thread Greg Hill
Thanks for looking into it. I'm trying to avoid making the user pass in any parameters by configuring it to use the right values for the cluster size by default, hence my reliance on the configuration. I'd rather just use spark-defaults.conf than the environment variables, and looking at the c

Re: Distributed dictionary building

2014-09-23 Thread Nan Zhu
shall we document this in the API doc? Best, -- Nan Zhu On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote: > zipWithUniqueId is also affected... > > I had to persist the dictionaries to make use of the indices lower down in > the flow... > > On Sun, Sep 21, 2014 at 1:15 AM, S

Re: Distributed dictionary building

2014-09-23 Thread Sean Owen
Yes, Matei made a JIRA last week and I just suggested a PR: https://github.com/apache/spark/pull/2508 On Sep 23, 2014 2:55 PM, "Nan Zhu" wrote: > shall we document this in the API doc? > > Best, > > -- > Nan Zhu > > On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote: > > zipWithUnique

Re: Distributed dictionary building

2014-09-23 Thread Nan Zhu
great, thanks -- Nan Zhu On Tuesday, September 23, 2014 at 9:58 AM, Sean Owen wrote: > Yes, Matei made a JIRA last week and I just suggested a PR: > https://github.com/apache/spark/pull/2508 > On Sep 23, 2014 2:55 PM, "Nan Zhu" (mailto:zhunanmcg...@gmail.com)> wrote: > > shall we document t

TorrentBroadcast causes java.io.IOException: unexpected exception type

2014-09-23 Thread Arun Ahuja
Since upgrading to Spark 1.1 we have been seeing the following error in the logs: 14/09/23 02:14:42 ERROR executor.Executor: Exception in task 1087.0 in stage 0.0 (TID 607) java.io.IOException: unexpected exception type at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java

RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Shao, Saisai
This folder will be created when you start your Spark application under your spark.local.dir, with the name “spark-local-xxx” as prefix. It’s quite strange you don’t see this folder, maybe you miss something. Besides if Spark cannot create this folder on start, persist rdd to disk will be failed

Error launching spark application from Windows to Linux YARN Cluster - Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2014-09-23 Thread dxrodri
I am trying to submit a simple SparkPi application from a windows machine which has spark 1.0.2 to a hadoop 2.3.0 cluster running on Linux. SparkPi application can be launched and executed successfully when running on the Linux machine, however, I get the following error when I launch from Window

Re: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Chitturi Padma
I couldnt even see the spark- folder in the default /tmp directory of local.dir. On Tue, Sep 23, 2014 at 6:01 PM, Priya Ch wrote: > Is it possible to view the persisted RDD blocks ? > > If I use YARN, RDD blocks would be persisted to hdfs then will i be able > to read the hdfs blocks as

Re: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Chitturi Padma
Is it possible to view the persisted RDD blocks ? If I use YARN, RDD blocks would be persisted to hdfs then will i be able to read the hdfs blocks as i could do in hadoop ? On Tue, Sep 23, 2014 at 5:56 PM, Shao, Saisai [via Apache Spark User List] < ml-node+s1001560n14885...@n3.nabble.com> wrote:

RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Shao, Saisai
Hi, Spark.local.dir is the one used to write map output data and persistent RDD blocks, but the path of file has been hashed, so you cannot directly find the persistent rdd block files, but definitely it will be in this folders on your worker node. Thanks Jerry From: Priya Ch [mailto:learnin

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-09-23 Thread RodrigoB
Could you be using by any chance the getOrCreate for the StreamingContext creation? I've seen this happen when I tried to first create the Spark context, then create the broadcast variables, and then recreate the StreamingContext from the checkpoint directory. So the worker process cannot find the

Re: Recommended ways to pass functions

2014-09-23 Thread Yanbo Liang
All these two kinds of function is OK but you need to make your class extends Serializable. But all these kinds of pass functions can not save data which will be send. If you define a function which will not use member parameter of a class or object, you can use val like definition method. For exa

Re: NullPointerException on reading checkpoint files

2014-09-23 Thread RodrigoB
Hi TD, This is actually an important requirement (recovery of shared variables) for us as we need to spread some referential data across the Spark nodes on application startup. I just bumped into this issue on Spark version 1.0.1. I assume the latest one also doesn't include this capability. Are t

  1   2   >