Spark Worker node accessing Hive metastore

2014-10-24 Thread ken
Does a Spark worker node need access to Hive's metastore if part of a job contains Hive queries? Thanks, Ken -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-node-accessing-Hive-metastore-tp17255.html Sent from the Apache Spark User List mailing

How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-24 Thread Arpit Kumar
Hi all, I am using the GrpahLoader class to load graphs from edge list files. But then I need to change the storage level of the graph to some other thing than MEMORY_ONLY. val graph = GraphLoader.edgeListFile(sc, fname, minEdgePartitions = numEPart).persist(StorageLevel.MEMORY_AND_DISK_

Re: Workaround for SPARK-1931 not compiling

2014-10-24 Thread Arpit Kumar
Thanks a lot. Now it is working properly. On Sat, Oct 25, 2014 at 2:13 AM, Ankur Dave wrote: > At 2014-10-23 09:48:55 +0530, Arpit Kumar wrote: > > error: value partitionBy is not a member of > > org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID, > > org.apache.spark.graphx.Edge[ED]

Re: Spark: Order by Failed, java.lang.NullPointerException

2014-10-24 Thread arthur.hk.c...@gmail.com
Hi, Added “l_linestatus” it works, THANK YOU!! sqlContext.sql("select l_linestatus, l_orderkey, l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS limit 10").collect().foreach(println); 14/10/25 07:03:24 INFO DAGScheduler: Stage 12 (

Re: Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-24 Thread arthur.hk.c...@gmail.com
Hi, My Steps: ### HIVE CREATE TABLE CUSTOMER ( C_CUSTKEYBIGINT, C_NAME VARCHAR(25), C_ADDRESSVARCHAR(40), C_NATIONKEY BIGINT, C_PHONE VARCHAR(15), C_ACCTBALDECIMAL, C_MKTSEGMENT VARCHAR(10), C_COMMENTVARCHAR(117) ) row format serde 'com.bizo.hive.serde.csv.CSVSerde'; L

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
yeah, column normalizarion. for some of the datasets, without doing this, it will not be converged. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Oct 24, 2014 at 3:46 PM, Debasish D

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
You mean row/column normalization of data ? how much performance gain you saw using that ? On Fri, Oct 24, 2014 at 3:14 PM, DB Tsai wrote: > oh, we just train the model in the standardized space which will help > the convergence of LBFGS. Then we convert the weights to original > space so the w

Re: Spark: Order by Failed, java.lang.NullPointerException

2014-10-24 Thread Michael Armbrust
Usually when the SparkContext throws an NPE it means that it has been shut down due to some earlier failure. On Wed, Oct 22, 2014 at 5:29 PM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I got java.lang.NullPointerException. Please help! > > > sqlContext.sql("select l_ord

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
oh, we just train the model in the standardized space which will help the convergence of LBFGS. Then we convert the weights to original space so the whole thing is transparent to users. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com Link

Re: Saving very large data sets as Parquet on S3

2014-10-24 Thread Haoyuan Li
Daniel, Currently, having Tachyon will at least help on the input part in this case. Haoyuan On Fri, Oct 24, 2014 at 2:01 PM, Daniel Mahler wrote: > I am trying to convert some json logs to Parquet and save them on S3. > In principle this is just > > import org.apache.spark._ > val sqlContext

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
@dbtsai for condition number what did you use ? Diagonal preconditioning of the inverse of B matrix ? But then B matrix keeps on changing...did u condition it after every few iterations ? Will it be possible to put that code in Breeze since it will be very useful to condition other solvers as well

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
We don't have SVMWithLBFGS, but you can check out how we implement LogisticRegressionWithLBFGS, and we also deal with some condition number improving stuff in LogisticRegressionWithLBFGS which improves the performance dramatically. Sincerely, DB Tsai --

Re: docker spark 1.1.0 cluster

2014-10-24 Thread Nicholas Chammas
Oh snap--first I've heard of this repo. Marek, We are having a discussion related to this on SPARK-3821 you may be interested in. Nick On Fri, Oct 24, 2014 at 5:50 PM, Marek Wiewiorka wrote: > Hi, > here you can find some info regarding 1.0:

Re: docker spark 1.1.0 cluster

2014-10-24 Thread Marek Wiewiorka
Hi, here you can find some info regarding 1.0: https://github.com/amplab/docker-scripts Marek 2014-10-24 23:38 GMT+02:00 Josh J : > Hi, > > Is there a dockerfiles available which allow to setup a docker spark 1.1.0 > cluster? > > Thanks, > Josh >

docker spark 1.1.0 cluster

2014-10-24 Thread Josh J
Hi, Is there a dockerfiles available which allow to setup a docker spark 1.1.0 cluster? Thanks, Josh

Re: Spark LIBLINEAR

2014-10-24 Thread k.tham
Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented. I'll try it out when I have time. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html Sent from the Apache Spark User List mailing list archive at Nab

Re: Spark using non-HDFS data on a distributed file system cluster

2014-10-24 Thread matan
Thanks Marcelo, Let me spin this towards a parallel trajectory then, as the title change implies. I think I will further read some of the articles at https://spark.apache.org/research.html but basically, I understand Spark keeps the data in-memory, and only pulls from hdfs, or at most writes the f

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
This is very experimental and mostly unsupported, but you can start the JDBC server from within your own programs by passing it the HiveContext. On

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David already has a version of TRON in breeze but someone needs to validate it for linear SVM an

Re: Spark LIBLINEAR

2014-10-24 Thread k.tham
Just wondering, any update on this? Is there a plan to integrate CJ's work with mllib? I'm asking since SVM impl in MLLib did not give us good results and we have to resort to training our svm classifier in a serial manner on the driver node with liblinear. Also, it looks like CJ Lin is coming to

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
Thanks for your response Michael. I'm still not clear on all the details - in particular, how do I take a temp table created from a SchemaRDD and allow it to be queried using the Thrift JDBC server? From the Hive guides, it looks like it only supports loading data from files, but I want to query t

Fwd: Saving very large data sets as Parquet on S3

2014-10-24 Thread Daniel Mahler
I am trying to convert some json logs to Parquet and save them on S3. In principle this is just import org.apache.spark._ val sqlContext = new sql.SQLContext(sc) val data = sqlContext.jsonFile(s3n://source/path/*/*",10e-8) data.registerAsTable("data") data.saveAsParquetFile("s3n://target/path) Th

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread Davies Liu
On Fri, Oct 24, 2014 at 1:37 PM, xuhongnever wrote: > Thank you very much. > Changing to groupByKey works, it runs much more faster. > > By the way, could you give me some explanation of the following > configurations, after reading the official explanation, i'm still confused, > what's the relati

Re: Workaround for SPARK-1931 not compiling

2014-10-24 Thread Ankur Dave
At 2014-10-23 09:48:55 +0530, Arpit Kumar wrote: > error: value partitionBy is not a member of > org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID, > org.apache.spark.graphx.Edge[ED])] Since partitionBy is a member of PairRDDFunctions, it sounds like the implicit conversion from RDD

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread xuhongnever
Thank you very much. Changing to groupByKey works, it runs much more faster. By the way, could you give me some explanation of the following configurations, after reading the official explanation, i'm still confused, what's the relationship between them? is there any memory overlap between them?

Re: Function returning multiple Values - problem with using "if-else"

2014-10-24 Thread HARIPRIYA AYYALASOMAYAJULA
Thanks Sean! On Fri, Oct 24, 2014 at 3:04 PM, Sean Owen wrote: > This is just a Scala question really. Use ++ > > def inc(x:Int, y:Int) = { > if (condition) { > for(i <- 0 to 7) yield(x, y+i) > } else { > (for(k <- 0 to 24-y) yield(x, y+k)) ++ (for(j<- 0 to y-16) > yield(x+1,j)) >

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-24 Thread Joseph Bradley
Hi Lokesh, Glad the update fixed the bug. maxBins is a parameter you can tune based on your data. Essentially, larger maxBins is potentially more accurate, but will run more slowly and use more memory. maxBins must be <= training set size; I would say try some small values (4, 8, 16). If there

Re: Function returning multiple Values - problem with using "if-else"

2014-10-24 Thread Sean Owen
This is just a Scala question really. Use ++ def inc(x:Int, y:Int) = { if (condition) { for(i <- 0 to 7) yield(x, y+i) } else { (for(k <- 0 to 24-y) yield(x, y+k)) ++ (for(j<- 0 to y-16) yield(x+1,j)) } } On Fri, Oct 24, 2014 at 8:52 PM, HARIPRIYA AYYALASOMAYAJULA wrote: > Hello, >

Function returning multiple Values - problem with using "if-else"

2014-10-24 Thread HARIPRIYA AYYALASOMAYAJULA
Hello, My map function will call the following function (inc) which should yield multiple values: def inc(x:Int, y:Int) ={ if(condition) { for(i <- 0 to 7) yield(x, y+i) } else { for(k <- 0 to 24-y) yield(x, y+k) for(j<- 0 to y-16) yield(x+1,j) } } The "if" part work

Re: How to use FlumeInputDStream in spark cluster?

2014-10-24 Thread BigDataUser
I am running FlumeEventCount program in CDH 5.0.1 which has Spark 0.9.0. The program runs fine in local process as well as standalone cluster mode. However, the program fails in YARN mode. I see the following error: INFO scheduler.DAGScheduler: Stage 2 (runJob at NetworkInputTracker.scala:182) fini

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
That works perfect. Thanks again Michael On Fri, Oct 24, 2014 at 3:10 PM, Michael Armbrust wrote: > It won't be transparent, but you can do so something like: > > CACHE TABLE newData AS SELECT * FROM allData WHERE date > "..." > > and then query newData. > > On Fri, Oct 24, 2014 at 12:06 PM, Sad

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
It won't be transparent, but you can do so something like: CACHE TABLE newData AS SELECT * FROM allData WHERE date > "..." and then query newData. On Fri, Oct 24, 2014 at 12:06 PM, Sadhan Sood wrote: > Is there a way to cache certain (or most latest) partitions of certain > tables ? > > On Fri

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
Is there a way to cache certain (or most latest) partitions of certain tables ? On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust wrote: > It does have support for caching using either CACHE TABLE or > CACHE TABLE AS SELECT > > On Fri, Oct 24, 2014 at 1:05 AM, ankits wrote: > >> I want t

Under which user is the program run on slaves?

2014-10-24 Thread jan.zikes
Hi, I would like to ask under which user is run the Spark program on slaves? My Spark is running on top of the Yarn. The reason I am asking for this is that I need to download data for NLTK library and these data are dowloaded for specific python user and I am currently struggling with this. 

Re: Job cancelled because SparkContext was shut down - failures!

2014-10-24 Thread Sadhan Sood
These seem like s3 connection errors for the table data. Wondering, since we don't see that many failures on hive. I also set the spark.task.maxFailures = 15. On Fri, Oct 24, 2014 at 12:15 PM, Sadhan Sood wrote: > Hi, > > Trying to run a query on spark-sql but it keeps failing with this error on

Re: PySpark problem with textblob from NLTK used in map

2014-10-24 Thread jan.zikes
Maybe I'll add one more question. I think that the problem is with user, so I would like to ask under which user are run Spark jobs on slaves? __ Hi, I am trying to implement function for text preprocessing in PySpark. I have amazon E

Re: [Spark SQL] Setting variables

2014-10-24 Thread Michael Armbrust
You might be hitting: https://issues.apache.org/jira/browse/SPARK-4037 On Fri, Oct 24, 2014 at 11:32 AM, Yana Kadiyska wrote: > Hi all, > > I'm trying to set a pool for a JDBC session. I'm connecting to the thrift > server via JDBC client. > > My installation appears to be good(queries run fine)

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
It does have support for caching using either CACHE TABLE or CACHE TABLE AS SELECT On Fri, Oct 24, 2014 at 1:05 AM, ankits wrote: > I want to set up spark SQL to allow ad hoc querying over the last X days of > processed data, where the data is processed through spark. This would also > ha

[Spark SQL] Setting variables

2014-10-24 Thread Yana Kadiyska
Hi all, I'm trying to set a pool for a JDBC session. I'm connecting to the thrift server via JDBC client. My installation appears to be good(queries run fine), I can see the pools in the UI, but any attempt to set a variable (I tried spark.sql.shuffle.partitions and spark.sql.thriftserver.schedul

Re: spark-submit memory too larger

2014-10-24 Thread Sameer Farooqui
That does seem a bit odd. How many Executors are running under this Driver? Does the spark-submit process start out using ~60GB of memory right away or does it start out smaller and slowly build up to that high? If so, how long does it take to get that high? Also, which version of Spark are you u

Job cancelled because SparkContext was shut down - failures!

2014-10-24 Thread Sadhan Sood
Hi, Trying to run a query on spark-sql but it keeps failing with this error on the cli ( we are running spark-sql on a yarn cluster): org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$

Re: Measuring execution time

2014-10-24 Thread Reza Zadeh
The Spark UI has timing information. When running locally, it is at http://localhost:4040 Otherwise the url to the UI is printed out onto the console when you startup spark shell or run a job. Reza On Fri, Oct 24, 2014 at 5:51 AM, shahab wrote: > Hi, > > I just wonder if there is any built-in f

Re: How can I set the IP a worker use?

2014-10-24 Thread Theodore Si
I found this. So it seems that we should use -h or --host instead of -i and --ip. -i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h) -h HOST, --host HOST Hostname to listen on 在 10/24/2014 3:35 PM, Akhil Das 写道: Try using the --ip parameter

PySpark problem with textblob from NLTK used in map

2014-10-24 Thread jan.zikes
Hi, I am trying to implement function for text preprocessing in PySpark. I have amazon EMR where I am installing Python dependencies from the bootstrap script. One of these dependencies is textblob "python -m textblob.download_corpora". Then I am trying to use it locally on all the machines wi

spark-submit memory too larger

2014-10-24 Thread marylucy
i used standalone spark,set spark.driver.memory=5g,but spark-submit process use 57g memory, is this normal?how to decrease it?

Re: Problem packing spark-assembly jar

2014-10-24 Thread Yana Kadiyska
thanks -- that was it. I could swear this had worked for me before and indeed it's fixed this morning. On Fri, Oct 24, 2014 at 6:34 AM, Sean Owen wrote: > I imagine this is a side effect of the change that was just reverted, > related to publishing the effective pom? sounds related but I don't >

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Aniket Bhatnagar
Just curious... Why would you not store the processed results in regular relational database? Not sure what you meant by persist the appropriate RDDs. Did you mean output of your job will be RDDs? On 24 October 2014 13:35, ankits wrote: > I want to set up spark SQL to allow ad hoc querying over

Spark doesn't retry task while writing to HDFS

2014-10-24 Thread Aniket Bhatnagar
Hi all I have written a job that reads data from HBASE and writes to HDFS (fairly simple). While running the job, I noticed that a few of the tasks failed with the following error. Quick googling on the error suggests that its an unexplained error and is perhaps intermittent. What I am curious to

Re: Memory requirement of using Spark

2014-10-24 Thread jian.t
Thanks Akhil. I searched DISK_AND_MEMORY_SER trying to figure out how it works, and I cannot find any documentation on that. Do you have a link for that? If what DISK_AND_MEMORY_SER does is reading and writing to the disk with some memory caching, does that mean the output will be written to disk

scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-24 Thread Marius Soutier
Hi, I’m running a job whose simple task it is to find files that cannot be read (sometimes our gz files are corrupted). With 1.0.x, this worked perfectly. Since 1.1.0 however, I’m getting an exception: scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114) sc.wholeT

Measuring execution time

2014-10-24 Thread shahab
Hi, I just wonder if there is any built-in function to get the execution time for each of the jobs/tasks ? in simple words, how can I find out how much time is spent on loading/mapping/filtering/reducing part of a job? I can see printout in the logs but since there is no clear presentation of the

Broadcast failure with variable size of ~ 500mb with "key already cancelled ?"

2014-10-24 Thread htailor
Hi All, I am relatively new to spark and currently having troubles with broadcasting large variables ~500mb in size. Th e broadcast fails with an error shown below and the memory usage on the hosts also blow up. Our hardware consists of 8 hosts (1 x 64gb (driver) and 7 x 32gb (workers)) and we a

Re: unable to make a custom class as a key in a pairrdd

2014-10-24 Thread Gerard Maas
There's an issue in the way case classes are handled on the REPL and you won't be able to use a case class as a key. See: https://issues.apache.org/jira/browse/SPARK-2620 BTW, case classes already implement equals and hashCode. It's not needed to implement those again. Given that you already impl

Re: Problem packing spark-assembly jar

2014-10-24 Thread Sean Owen
I imagine this is a side effect of the change that was just reverted, related to publishing the effective pom? sounds related but I don't know. On Fri, Oct 24, 2014 at 2:22 AM, Yana Kadiyska wrote: > Hi folks, > > I'm trying to deploy the latest from master branch and having some trouble > with t

Re: unable to make a custom class as a key in a pairrdd

2014-10-24 Thread Jaonary Rabarisoa
In the documentation it's said that we need to override the hashCode and equals methods. Without overriding it does't work too. I get this error on REPL and stand alone application On Fri, Oct 24, 2014 at 3:29 AM, Prashant Sharma wrote: > Are you doing this in REPL ? Then there is a bug filed fo

Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
I want to set up spark SQL to allow ad hoc querying over the last X days of processed data, where the data is processed through spark. This would also have to cache data (in memory only), so the approach I was thinking of was to build a layer that persists the appropriate RDDs and stores them in me

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-24 Thread lokeshkumar
Hi Joseph, Thanks for the help. I have tried this DecisionTree example with the latest spark code and it is working fine now. But how do we choose the maxBins for this model? Thanks Lokesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLIB-Decisi

Re: How can I set the IP a worker use?

2014-10-24 Thread Akhil Das
Try using the --ip parameter while starting the worker. like: spark-1.0.1/bin/spark-class org.apache.spark.deploy.worker.Worker --ip 1.2.3.4 spark://1.2.3.4:7077 Thanks Best Regards On Fri, Oct 24, 2014 at 12

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread Davies Liu
On Thu, Oct 23, 2014 at 3:14 PM, xuhongnever wrote: > my code is here: > > from pyspark import SparkConf, SparkContext > > def Undirect(edge): > vector = edge.strip().split('\t') > if(vector[0].isdigit()): > return [(vector[0], vector[1])] > return [] > > > conf = SparkConf() >

Re: Memory requirement of using Spark

2014-10-24 Thread Akhil Das
You can use spark-sql to solve this usecase, and you don't need to have 800G of memory (but of course if you are caching the whole data into memory, then you would need it.). You can persist the data by setting DISK_AND_MEMORY_SER property if you don't want to bring whole data into memory, in this

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread Akhil Das
Try providing the level of parallelism parameter to your reduceByKey operation. Thanks Best Regards On Fri, Oct 24, 2014 at 3:44 AM, xuhongnever wrote: > my code is here: > > from pyspark import SparkConf, SparkContext > > def Undirect(edge): > vector = edge.strip().split('\t') > if(vec

Re: NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-24 Thread Akhil Das
Make sure the guava jar is present in the classpath. Thanks Best Regards On Thu, Oct 23, 2014 at 2:13 PM, Stephen Boesch wrote: > After having checked out from master/head the following error occurs when > attempting to run any tes

How can I set the IP a worker use?

2014-10-24 Thread Theodore Si
Hi all, I have two network interface card on one node, one is a Eithernet card, the other Infiniband HCA. The master has two IP addresses, lets say 1.2.3.4 (for Eithernet card) and 2.3.4.5 (for HCA). I can start the master by export SPARK_MASTER_IP='1.2.3.4';sbin/start-master.sh to let master

Re: how to run a dev spark project without fully rebuilding the fat jar ?

2014-10-24 Thread Akhil Das
You can use the --jars option to submit multiple jars using the spark-submit, so you can simply build the jar that you have modified. Thanks Best Regards On Thu, Oct 23, 2014 at 11:16 AM, Mohit Jaggi wrote: > i think you can give a list of jars - not just one - to spark-submit, so > build only

Re: hive timestamp column always returns null

2014-10-24 Thread Akhil Das
Try doing a *cat -v your_data | head -n3 *and make sure you are not having any ^M at the end of the lines. Also your 2,3 rows doens't contain any space in the data. Thanks Best Regards On Thu, Oct 23, 2014 at 9:23 AM, tridib wrote: > Hello Experts, > I created a table using spark-sql CLI. No Hi

Re: Spark: Order by Failed, java.lang.NullPointerException

2014-10-24 Thread Akhil Das
Not sure if this would help, but make sure you are having the column l_linestatus in the data. Thanks Best Regards On Thu, Oct 23, 2014 at 5:59 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I got java.lang.NullPointerException. Please help! > > > sqlContext.sql("selec