Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-08-03 Thread Pei-Lun Lee
Hi, We have a PR to support fixed length byte array in parquet file. https://github.com/apache/spark/pull/1737 Can someone help verifying it? Thanks. 2014-07-15 19:23 GMT+08:00 Pei-Lun Lee : > Sorry, should be SPARK-2489 > > > 2014-07-15 19:22 GMT+08:00 Pei-Lun Lee : > > Filed SPARK-2446 >> >

Re: Timing the codes in GraphX

2014-08-03 Thread Larry Xiao
Hi Deep, I think you can refer to GraphLoader.scala. use Logging val startTime = System.currentTimeMillis logInfo("It took %d ms to load the edges".format(System.currentTimeMillis - startTime)) Larry On 8/4/14, 12:37 PM, Deep Pradhan wrote: Is there any way to time the execution of GraphX

Re: SQLCtx cacheTable

2014-08-03 Thread Gurvinder Singh
On 08/03/2014 02:33 AM, Michael Armbrust wrote: > I am not a mesos expert... but it sounds like there is some mismatch > between the size that mesos is giving you and the maximum heap size of > the executors (-Xmx). > It seems that mesos is giving the correct size to java process. It has exact siz

NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-03 Thread Ryan Braley
Hi Folks,   I have an assembly jar that I am submitting using spark-submit script on a cluster I created with the spark-ec2 script. I keep running into the java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass error on my workers even though jar tf clearly shows that class being

Re: MLLib: implementing ALS with distributed matrix

2014-08-03 Thread Xiangrui Meng
To be precise, the optimization is not `get all products that are related to this user` but `get all products that are related to users inside this block`. So a product factor won't be sent to the same block more than once. We considered using GraphX to implement ALS, which is much easier to unders

Timing the codes in GraphX

2014-08-03 Thread Deep Pradhan
Is there any way to time the execution of GraphX codes? Thank You

Re: Low Level Kafka Consumer for Spark

2014-08-03 Thread Patrick Wendell
I'll let TD chime on on this one, but I'm guessing this would be a welcome addition. It's great to see community effort on adding new streams/receivers, adding a Java API for receivers was something we did specifically to allow this :) - Patrick On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattach

MLLib: implementing ALS with distributed matrix

2014-08-03 Thread Wei Tan
Hi, I wrote my centralized ALS implementation, and read the distributed implementation in MLlib. It uses InLink and OutLink to implement functions like "get all products which are related to this user", and ultimately achieves model distribution. If we have a distributed matrix lib, the c

Re: Tasks fail when ran in cluster but they work fine when submited using local local

2014-08-03 Thread salemi
Let me answer the solution to this problem. I had to set the spark.httpBroadcast.uri to the FQDN of the driver. Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Tasks-fail-when-ran-in-cluster-but-they-work-fine-when-submited-using-local-local-tp11167p1

Kafka and Spark application after polling twice.

2014-08-03 Thread salemi
Hi All, My application works when I use the spark-submit with master=local[*]. But if I deploy the application to a standalone cluster master=spark://master:7077 then the application polls twice twice from kafka topic and then it stops working. I don't get any error logs. I can see application c

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-03 Thread Ron's Yahoo!
I think you’re going to have to make it serializable by registering it with the Kryo registrator. I think multiple workers are running as separate VMs so it might need to be able to serialize and deserialize broadcasted variables to the different executors. Thanks, Ron On Aug 3, 2014, at 6:38

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-03 Thread Fengyun RAO
Could anybody help? I wonder if I asked a stupid question or I didn't make the question clear? 2014-07-31 21:47 GMT+08:00 Fengyun RAO : > As shown here: > 2 - Why Is My Spark Job so Slow and Only Using a Single Thread? >

Re: pyspark script fails on EMR with an ERROR in configuring object.

2014-08-03 Thread jay vyas
I think this looks like the typical LZO error that people get when they dont install it and try to use the codec it happens because LZO isnt (cant be) bundled, so you wont have it by default in any canned hadoop installation. On Sun, Aug 3, 2014 at 8:29 PM, Eric Friedman wrote: > I am clos

Re: pyspark script fails on EMR with an ERROR in configuring object.

2014-08-03 Thread Eric Friedman
I am close to giving up on PySpark on YARN. It simply doesn't work for straightforward operations and it's quite difficult to understand why. I would love to be proven wrong, by the way. Eric Friedman > On Aug 3, 2014, at 7:03 AM, Rahul Bhojwani > wrote: > > The logs provided in the i

Cached RDD Block Size - Uneven Distribution

2014-08-03 Thread iramaraju
I am running spark 1.0.0, Tachyon 0.5 and Hadoop 1.0.4. I am selecting a subset of a large dataset and trying to run queries on the cached schema RDD. Strangely, in web UI, I see the following. 150 Partitions Block Name Storage Level Size in Memory ▴Size on Disk Executors rdd_

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-03 Thread Patrick Wendell
Here is the exact sequence of commands you can use for the workaround: === $ cd ~/.ivy2/cache/org.scala-lang/ $ mkdir -p scala-library && cd scala-library $ wget https://raw.githubusercontent.com/peterklipfel/scala_koans/master/ivyrepo/cache/org.scala-lang/scala-library/ivy-2.10.2.xml $ wget https

Re: disable log4j for spark-shell

2014-08-03 Thread Patrick Wendell
If you want to customize the logging behavior - the simplest way is to copy conf/log4j.properties.tempate to conf/log4j.properties. Then you can go and modify the log level in there. The spark shells should pick this up. On Sun, Aug 3, 2014 at 6:16 AM, Sean Owen wrote: > That's just a templat

Writing to RabbitMQ

2014-08-03 Thread jschindler
I have been trying to write to RabbitMQ in my Spark Streaming app and I receive the below exception: java.io.NotSerializableException: com.rabbitmq.client.impl.ChannelN Does anyone have experience sending their data to rabbit? I am using the basicpublish call like so -> SQLChannel.basicPublish(""

Re: GraphX runs without Spark?

2014-08-03 Thread Deep Pradhan
We need to pass the URL only when we are using the interactive shell right? Now, I am not using the interactive shell, I am just doing ./bin/run-example.. when I am in the Spark directory. >>If not, Spark may be ignoring your single-node cluster and defaulting to local mode. What does this

Re: Low Level Kafka Consumer for Spark

2014-08-03 Thread hodgesz
Very nice! I also was wondering about the offset autocommit in KafkaUtils. Since incoming streamed Kafka data is replicated across Spark nodes in memory it seems it is possible to have up to a batch of data loss if tasks hang or crash. It seems you have avoided this case by using the Kafka simpl

Re: pyspark script fails on EMR with an ERROR in configuring object.

2014-08-03 Thread Rahul Bhojwani
The logs provided in the image may not be enough for help. Here I have copied the whole logs: WARNING: Running python applications through ./bin/pyspark is deprecated as of Spark 1.0. Use ./bin/spark-submit 14/08/03 11:10:57 INFO SparkConf: Using Spark's default log4j profile: org/apache/spark/l

Re: disable log4j for spark-shell

2014-08-03 Thread Sean Owen
That's just a template. Nothing consults that file by default. It's looking inside the Spark .jar. If you edit core/src/main/resources/org/apache/spark/log4j-defaults.properties and rebuild Spark, it will pick up those changes. I think you could also use the JVM argument "-Dlog4j.configuration=co

disable log4j for spark-shell

2014-08-03 Thread Gil Vernik
Hi, I would like to run spark-shell without any INFO messages printed. To achieve this I edited /conf/log4j.properties and added line log4j.rootLogger=OFF that suppose to disable all logging. However, when I run ./spark-shell I see the message 4/08/03 16:02:15 INFO SecurityManager: Using Spar

pyspark script fails on EMR with an ERROR in configuring object.

2014-08-03 Thread Rahul Bhojwani
Hi, I used to run spark scripts on local machine. Now i am porting my codes to EMR and i am facing lots of problem. The main one now is that the spark script which is running properly on my local machine is giving error when run on Amazon EMR Cluster. Here is the error: [image: Inline image 1]

Re: error while running kafka-spark-example

2014-08-03 Thread Sean Owen
You have marked Spark dependencies as 'provided', but are evidently not 'providing' them at runtime. You haven't said how you are running them. Running with spark-submit should set up the classpath correctly. On Sun, Aug 3, 2014 at 12:47 PM, Mahebub Sayyed wrote: > Hello, > > I am getting followi

Re: error while running kafka-spark-example

2014-08-03 Thread Sameer Sayyed
I have jar file "kafka-spark-example.jar". What should be the location of jar file while runing kafka-spark-example using *cloudera-quickstart-vm-5.0.0-0-vmware* On Sun, Aug 3, 2014 at 2:47 PM, Mahebub Sayyed wrote: > Hello, > > I am getting following error while running kafka-spark-example: >

error while running kafka-spark-example

2014-08-03 Thread Mahebub Sayyed
Hello, I am getting following error while running kafka-spark-example: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/api/java/function/Function at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java

Re: Starting with spark

2014-08-03 Thread Sean Owen
If the question is likely about the Quickstart VM, it's better to ask in the VM forum: https://community.cloudera.com/t5/Apache-Hadoop-Concepts-and/bd-p/ApacheHadoopConcepts Please give more detail though; it's not clear what you mean is not working. On Sun, Aug 3, 2014 at 10:09 AM, Mahebub Sayye

Re: Starting with spark

2014-08-03 Thread Mahebub Sayyed
Hello, I have enabled Spark in the Quickstart VM and Running SparkPi in Standalone Mode reference: *http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_running_spark_apps.html

Re: GraphX runs without Spark?

2014-08-03 Thread Ankur Dave
At 2014-08-03 13:14:52 +0530, Deep Pradhan wrote: > I have a single node cluster on which I have Spark running. I ran some > graphx codes on some data set. Now when I stop all the workers in the > cluster (sbin/stop-all.sh), the codes still run and gives the answers. Why > is it so? I mean does gr

GraphX runs without Spark?

2014-08-03 Thread Deep Pradhan
I have a single node cluster on which I have Spark running. I ran some graphx codes on some data set. Now when I stop all the workers in the cluster (sbin/stop-all.sh), the codes still run and gives the answers. Why is it so? I mean does graphx run even without Spark coming up? Same thing even whil