Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hmm.. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782 Jianshi On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang wrote: > Hi, > > What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? > > I'm currently converting each Map to a JSON String and do > JsonRDD.inferS

Re: run JavaAPISuite with mavem

2014-12-06 Thread Michael Armbrust
Not sure about maven, but you can run that test with sbt: sbt/sbt "sql/test-only org.apache.spark.sql.api.java.JavaAPISuite" On Sat, Dec 6, 2014 at 9:59 PM, Ted Yu wrote: > I tried to run tests for core but there were failures. e.g. : > > ^[[32mExternalAppendOnlyMapSuite:^[[0m > ^[[32m- simple

Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hi, What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? I'm currently converting each Map to a JSON String and do JsonRDD.inferSchema. How about adding inferSchema support to Map[String, Any] directly? It would be very useful. Thanks, -- Jianshi Huang LinkedIn: jianshi Twitte

Re: run JavaAPISuite with mavem

2014-12-06 Thread Ted Yu
I tried to run tests for core but there were failures. e.g. : ^[[32mExternalAppendOnlyMapSuite:^[[0m ^[[32m- simple insert^[[0m ^[[32m- insert with collision^[[0m ^[[32m- ordering^[[0m ^[[32m- null keys and values^[[0m ^[[32m- simple aggregator^[[0m ^[[32m- simple cogroup^[[0m Spark assembly has b

Re: run JavaAPISuite with mavem

2014-12-06 Thread Koert Kuipers
Ted, i mean core/src/test/java/org/apache/spark/JavaAPISuite.java On Sat, Dec 6, 2014 at 9:27 PM, Ted Yu wrote: > Pardon me, the test is here: > > sql/core/src/test/java/org/apache/spark/sql/api/java/JavaAPISuite.java > > You can run 'mvn test' under sql/core > > Cheers > > On Sat, Dec 6, 2014 a

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Ok, found another possible bug in Hive. My current solution is to use ALTER TABLE CHANGE to rename the column names. The problem is after renaming the column names, the value of the columns became all NULL. Before renaming: scala> sql("select `sorted::cre_ts` from pmt limit 1").collect res12: Ar

"vcores used" in cluster metrics(yarn resource manager ui) when running spark on yarn

2014-12-06 Thread yuemeng1
Hi, all When i running an app with this cmd: ./bin/spark-sql --master yarn-client --num-executors 2 --executor-cores 3, i noticed that yarn resource manager ui shows the `vcores used` in cluster metrics is 3. It seems `vcores used` show wrong num (should be 7?)? Or i miss something? Tha

Recovered executor num in yarn-client mode

2014-12-06 Thread yuemeng1
Hi, all I have (maybe a clumsy) question about executor recovery num in yarn-client mode. My situation is as follows: We have a 1(resource manager) + 3(node manager) cluster, a app is running with one driver on the resource manager and 12 executors on all the node managers, and there are

Re: run JavaAPISuite with mavem

2014-12-06 Thread Ted Yu
Pardon me, the test is here: sql/core/src/test/java/org/apache/spark/sql/api/java/JavaAPISuite.java You can run 'mvn test' under sql/core Cheers On Sat, Dec 6, 2014 at 5:55 PM, Ted Yu wrote: > In master branch, I only found JavaAPISuite in comment: > > spark tyu$ find . -name '*.scala' -exec

Re: java.lang.ExceptionInInitializerError/Unable to load YARN support

2014-12-06 Thread maven
I noticed that when I unset HADOOP_CONF_DIR, I'm able to work in the local mode without any errors. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ExceptionInInitializerError-Unable-to-load-YARN-support-tp20560p20561.html Sent from the Apache Spa

Re: Including data nucleus tools

2014-12-06 Thread spark.dubovsky.jakub
Next try. I copied whole dist directory created by make-distribution script to cluster not just assembly jar. Then I used ./bin/spark-submit --num-executors 200 --master yarn-cluster --class org. apache.spark.mllib.CreateGuidDomainDictionary ../spark/root-0.1.jar ${args}  ...to run app again. St

java.lang.ExceptionInInitializerError/Unable to load YARN support

2014-12-06 Thread maven
All, I just built Spark-1.2 on my enterprise server (which has Hadoop 2.3 with YARN). Here're the steps I followed for the build: $ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package $ export SPARK_HOME=/path/to/spark/folder $ export HADOOP_CONF_DIR=/etc/hadoop/conf However

Re: run JavaAPISuite with mavem

2014-12-06 Thread Ted Yu
In master branch, I only found JavaAPISuite in comment: spark tyu$ find . -name '*.scala' -exec grep JavaAPISuite {} \; -print * For usage example, see test case JavaAPISuite.testJavaJdbcRDD. * converted into a `Object` array. For usage example, see test case JavaAPISuite.testJavaJdbcRDD. ./

run JavaAPISuite with mavem

2014-12-06 Thread Koert Kuipers
when i run "mvn test -pl core", i dont see JavaAPISuite being run. or if it is, its being very very quiet about it. is this by design?

Re: Spark on YARN memory utilization

2014-12-06 Thread Denny Lee
Got it - thanks! On Sat, Dec 6, 2014 at 14:56 Arun Ahuja wrote: > Hi Denny, > > This is due the spark.yarn.memoryOverhead parameter, depending on what > version of Spark you are on the default of this may differ, but it should > be the larger of 1024mb per executor or .07 * executorMemory. > > Wh

Re: Is there a way to force spark to use specific ips?

2014-12-06 Thread Matt Narrell
Its much easier if you access your nodes by name. If you’re using Vagrant, use the hosts provisioner: https://github.com/adrienthebo/vagrant-hosts mn > On Dec 6, 2014, at 8:37 AM, Ashic Mahtab wrote: > > Hi, > It appears that spark is always at

Re: Spark on YARN memory utilization

2014-12-06 Thread Arun Ahuja
Hi Denny, This is due the spark.yarn.memoryOverhead parameter, depending on what version of Spark you are on the default of this may differ, but it should be the larger of 1024mb per executor or .07 * executorMemory. When you set executor memory, the yarn resource request is executorMemory + yarn

Re: Modifying an RDD in forEach

2014-12-06 Thread Mohit Jaggi
“ideal for iterative workloads” is a comparison to hadoop map-reduce. if you are happy with a single machine, by all means, do that. scaling out may be useful when: 1) you want to finish the task faster by using more machines. this may not involve any additional cost if you are using utility com

RE: Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
These are very interesting comments. The vast majority of cases I'm working on are going to be in the 3 million range and 100 million was thrown out as something to shoot for. I upped it to 500 million. But all things considering, I believe I may be able to directly translate what I have to Java

Spark on YARN memory utilization

2014-12-06 Thread Denny Lee
This is perhaps more of a YARN question than a Spark question but i was just curious to how is memory allocated in YARN via the various configurations. For example, if I spin up my cluster with 4GB with a different number of executors as noted below 4GB executor-memory x 10 executors = 46GB (4G

Re: Modifying an RDD in forEach

2014-12-06 Thread Mohit Jaggi
Ron, “appears to be working” might be true when there are no failures. on large datasets being processed on a large number of machines, failures of several types(server, network, disk etc) can happen. At that time, Spark will not “know” that you changed the RDD in-place and will use any version

Re: Including data nucleus tools

2014-12-06 Thread Michael Armbrust
On Sat, Dec 6, 2014 at 5:53 AM, wrote: > > Bonus question: Should the class > org.datanucleus.api.jdo.JDOPersistenceManagerFactory be part of assembly? > Because it is not in jar now. > No these jars cannot be put into the assembly because they have extra metadata files that live in the same loca

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-06 Thread Ashic Mahtab
Hi,Just checked cassandra connector 1.1.0-beta1 runs fine. The issue seems to be 1.1.0 for spark streaming and 1.1.0 cassandra connector (final). Regards,Ashic. Date: Sat, 6 Dec 2014 13:52:20 -0500 Subject: Re: Adding Spark Cassandra dependency breaks Spark Streaming? From: jayunit100.apa...@

Re: Where can you get nightly builds?

2014-12-06 Thread Nicholas Chammas
To expand on Ted's response, there are currently no nightly builds published for users to use. You can watch SPARK-1517 (which Ted linked to) to be updated when that happens. On Sat Dec 06 2014 at 10:19:10 AM Ted Yu wrote: > See https://amplab.cs.berkeley.edu/jenkins/view/Spark/ > > See also htt

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-06 Thread Ashic Mahtab
Update: It seems the following combo causes things in spark streaming to go missing: spark-core 1.1.0spark-streaming 1.1.0spark-cassandra-connector 1.1.0 The moment I add the three together, things like StreamingContext and Seconds are unavailable. sbt assembly fails saying those aren't there. Sbt

Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Aaron Davidson
You can actually submit multiple jobs to a single SparkContext in different threads. In the case you mentioned with 2 stages having a common parent, both will wait for the parent stage to complete and then the two will execute in parallel, sharing the cluster resources. Solutions that submit multi

Re: SQL query in scala API

2014-12-06 Thread Arun Luthra
Thanks, I will try this. On Fri, Dec 5, 2014 at 1:19 AM, Cheng Lian wrote: > Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write > you own aggregation with aggregateByKey: > > users.aggregateByKey((0, Set.empty[String]))({ case ((count, seen), user) => > (count + 1, seen

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Hmm... another issue I found doing this approach is that ANALYZE TABLE ... COMPUTE STATISTICS will fail to attach the metadata to the table, and later broadcast join and such will fail... Any idea how to fix this issue? Jianshi On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang wrote: > Very intere

Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Corey Nolet
Reading the documentation a little more closely, I'm using the wrong terminology. I'm using stages to refer to what spark is calling a job. I guess application (more than one spark context) is what I'm asking about On Dec 5, 2014 5:19 PM, "Corey Nolet" wrote: > I've read in the documentation that

Is there a way to force spark to use specific ips?

2014-12-06 Thread Ashic Mahtab
Hi,It appears that spark is always attempting to use the driver's hostname to connect / broadcast. This is usually fine, except when the cluster doesn't have DNS configured. For example, in a vagrant cluster with a private network. The workers and masters, and the host (where the driver runs fro

Re: Where can you get nightly builds?

2014-12-06 Thread Ted Yu
See https://amplab.cs.berkeley.edu/jenkins/view/Spark/ See also https://issues.apache.org/jira/browse/SPARK-1517 Cheers On Sat, Dec 6, 2014 at 6:41 AM, Simone Franzini wrote: > I recently read in the mailing list that there are now nightly builds > available. However, I can't find them anywher

Where can you get nightly builds?

2014-12-06 Thread Simone Franzini
I recently read in the mailing list that there are now nightly builds available. However, I can't find them anywhere. Is this really done? If so, where can I get them? Thanks, Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini

Re: cartesian on pyspark not paralleised

2014-12-06 Thread Akhil Das
You could try increasing the level of parallelism (spark.default.parallelism) while creating the sparkContext Thanks Best Regards On Fri, Dec 5, 2014 at 6:37 PM, Antony Mayi wrote: > Hi, > > using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel - > I can seen multiple python

PySpark Loading Json Following by groupByKey seems broken in spark 1.1.1

2014-12-06 Thread Brad Willard
When I run a groupByKey it seems to create a single tasks after the groupByKey that never stops executing. I'm loading a smallish json dataset that is 4 million. This is the code I'm running. rdd = sql_context.jsonFile(uri) rdd = rdd.cache() grouped = rdd.map(lambda row: (row.id, row)).groupByKey

Re: Including data nucleus tools

2014-12-06 Thread spark.dubovsky.jakub
Hi again, I have tried to recompile and run this again with new assembly created by ./make-distribution.sh -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.1.3 -Pyarn - Phive -DskipTests It results in exactly the same error. Any other hints? Bonus question: Should the class org.datanucleus.api.jdo. JDOP

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Very interesting, the line doing drop table will throws an exception. After removing it all works. Jianshi On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang wrote: > Here's the solution I got after talking with Liancheng: > > 1) using backquote `..` to wrap up all illegal characters > > val rdd

Re: Modifying an RDD in forEach

2014-12-06 Thread Mayur Rustagi
You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how it optimizes execution of iterative jobs. Simple answer is 1. Spark doesn't materialize RDD when you do an iteration but lazily captures the transformation functions in RDD.(only function and closure , no data operation actu

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
Hiearchical K-means require a massive amount of iterations whereas flat K-means does not but I've found flat to be generally useless since in most UIs it is nice to be able to drill down into more and more specific clusters. If you have 100 million documents and your branching factor is 8 (8-sec

Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
This is from a separate thread with a differently named title. Why can't you modify the actual contents of an RDD using forEach? It appears to be working for me. What I'm doing is changing cluster assignments and distances per data item for each iteration of the clustering algorithm. The cluster

Re: Java RDD Union

2014-12-06 Thread Sean Owen
I guess a major problem with this is that you lose fault tolerance. You have no way of recreating the local state of the mutable RDD if a partition is lost. Why would you need thousands of RDDs for kmeans? it's a few per iteration. An RDD is more bookkeeping that data structure, itself. They don'

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
With that said, and the nature of iterative algorithms that Spark is advertised for, isn't this a bit of an unnecessary restriction since I don't see where the problem is. For instance, it is clear that when aggregating you need operations to be associative because of the way they are divided an

Re: spark-submit on YARN is slow

2014-12-06 Thread Sandy Ryza
Great to hear! -Sandy On Fri, Dec 5, 2014 at 11:17 PM, Denny Lee wrote: > Okay, my bad for not testing out the documented arguments - once i use the > correct ones, the query shrinks completes in ~55s (I can probably make it > faster). Thanks for the help, eh?! > > > > On Fri Dec 05 2014 at 1