Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Eric Charles
i have the same issue running spark sql code from eclipse workspace. If you run your code from the command line (with a packaged jar) or from Intellij, I bet it should work. IMHO This is some how related to eclipse env, but would love to know how to fix it (whether via eclipse conf, or via a patch

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Sounds great! So can I expect response time in milliseconds from the join query over this much data ( 0.5 million in each table) ? Thanks, Udbhav Agarwal From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 13 March, 2015 12:27 PM To: Udbhav Agarwal Cc: user@spark.apache.org Subject: Re: s

Re: spark sql performance

2015-03-13 Thread Akhil Das
Can't say that unless you try it. Thanks Best Regards On Fri, Mar 13, 2015 at 12:32 PM, Udbhav Agarwal wrote: > Sounds great! > > So can I expect response time in milliseconds from the join query over > this much data ( 0.5 million in each table) ? > > > > > > *Thanks,* > > *Udbhav Agarwal* >

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Okay Akhil! Thanks for the information. Thanks, Udbhav Agarwal From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 13 March, 2015 12:34 PM To: Udbhav Agarwal Cc: user@spark.apache.org Subject: Re: spark sql performance Can't say that unless you try it. Thanks Best Regards On Fri, Mar 13,

Re: Using Neo4j with Apache Spark

2015-03-13 Thread Gautam Bajaj
I have been trying to do the same, but where exactly do you suggest creating the static object? As creating it inside for each RDD will ultimately result in same problem and not creating it inside will result in serializability issue. On Fri, Mar 13, 2015 at 11:47 AM, Tathagata Das wrote: > Well

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Additionally I wanted to tell that presently I was running the query on one machine with 3gm ram and the join query was taking around 6 seconds. Thanks, Udbhav Agarwal From: Udbhav Agarwal Sent: 13 March, 2015 12:45 PM To: 'Akhil Das' Cc: user@spark.apache.org Subject: RE: spark sql performance

Re: spark sql performance

2015-03-13 Thread Akhil Das
You can see where it is spending time, whether there is any GC Time etc from the webUI (running on 4040),Also how many cores are you having? Thanks Best Regards On Fri, Mar 13, 2015 at 1:05 PM, Udbhav Agarwal wrote: > Additionally I wanted to tell that presently I was running the query on > on

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Okay Akhil. I am having 4 cores cpu.(2.4 ghz) Thanks, Udbhav Agarwal From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 13 March, 2015 1:07 PM To: Udbhav Agarwal Cc: user@spark.apache.org Subject: Re: spark sql performance You can see where it is spending time, whether there is any GC Tim

Explanation on the Hive in the Spark assembly

2015-03-13 Thread bit1...@163.com
Hi, sparkers, I am kind of confused about hive in the spark assembly. I think hive in the spark assembly is not the same thing as Hive On Spark(Hive On Spark, is meant to run hive using spark execution engine). So, my question is: 1. What is the difference between Hive in the spark assembly an

No assemblies found in assembly/target/scala-2.10

2015-03-13 Thread Patcharee Thongtra
Hi, I am trying to build spark 1.3 from source. After I executed| mvn -DskipTests clean package| I tried to use shell but got this error [root@sandbox spark]# ./bin/spark-shell Exception in thread "main" java.lang.IllegalStateException: No assemblies found in '/root/spark/assembly/target/scal

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I guess it's a ClassLoader issue. But I have no idea how to debug it. Any hints? Jianshi On Fri, Mar 13, 2015 at 3:00 PM, Eric Charles wrote: > i have the same issue running spark sql code from eclipse workspace. If > you run your code from the command line (with a packaged jar) or from > Inte

RE: Explanation on the Hive in the Spark assembly

2015-03-13 Thread Wang, Daoyuan
Hi bit1129, 1, hive in spark assembly removed most dependencies of hive on Hadoop to avoid conflicts. 2, this hive is used to run some native command, which does not rely on spark or mapreduce. Thanks, Daoyuan From: bit1...@163.com [mailto:bit1...@163.com] Sent: Friday, March 13, 2015 4:24 PM

Spark SQL. Cast to Bigint

2015-03-13 Thread Masf
Hi. I have a query in Spark SQL and I can not covert a value to BIGINT: CAST(column AS BIGINT) or CAST(0 AS BIGINT) The output is: Exception in thread "main" java.lang.RuntimeException: [34.62] failure: ``DECIMAL'' expected but identifier BIGINT found Thanks!! Regards. Miguel Ángel

How to do spares vector product in Spark?

2015-03-13 Thread Xi Shen
Hi, I have two RDD[Vector], both Vector are spares and of the form: (id, value) "id" indicates the position of the value in the vector space. I want to apply dot product on two of such RDD[Vector] and get a scale value. The none exist values are treated as zero. Any convenient tool to do th

RDD to InputStream

2015-03-13 Thread Ayoub
Hello, I need to convert an RDD[String] to a java.io.InputStream but I didn't find an east way to do it. Currently I am saving the RDD as temporary file and then opening an inputstream on the file but that is not really optimal. Does anybody know a better way to do that ? Thanks, Ayoub. -- V

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I'm almost certain the problem is the ClassLoader. So adding fork := true solves problems for test and run. The problem is how can I fork a JVM for sbt console? fork in console := true seems not working... Jianshi On Fri, Mar 13, 2015 at 4:35 PM, Jianshi Huang wrote: > I guess it's a Cla

Re: RDD to InputStream

2015-03-13 Thread Sean Owen
These are quite different creatures. You have a distributed set of Strings, but want a local stream of bytes, which involves three conversions: - collect data to driver - concatenate strings in some way - encode strings as bytes according to an encoding Your approach is OK but might be faster to

Re: No assemblies found in assembly/target/scala-2.10

2015-03-13 Thread Sean Owen
FWIW I do not see any problem. Are you sure the build completely successfully? you should see an assembly then in assembly/target/scala-2.10, so should check that. On Fri, Mar 13, 2015 at 8:26 AM, Patcharee Thongtra wrote: > Hi, > > I am trying to build spark 1.3 from source. After I executed > >

Re: RDD to InputStream

2015-03-13 Thread Ayoub
Thanks Sean, I forgot to mention that the data is too big to be collected on the driver. So yes your proposition would work in theory but in my case I cannot hold all the data in the driver memory, therefore it wouldn't work. I guess the crucial point to to do the collect in a lazy way and in th

Visualizing the DAG of a Spark application

2015-03-13 Thread t1ny
Hi all, We are looking for a tool that would let us visualize the DAG generated by a Spark application as a simple graph. This graph would represent the Spark Job, its stages and the tasks inside the stages, with the dependencies between them (either narrow or shuffle dependencies). The Spark Re

Using rdd methods with Dstream

2015-03-13 Thread Laeeq Ahmed
Hi, I normally use dstream.transform whenever I need to use methods which are available in RDD API but not in streaming API. e.g. dstream.transform(x => x.sortByKey(true)) But there are other RDD methods which return types other than RDD. e.g. dstream.transform(x => x.top(5)) top here returns Ar

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Liancheng also found out that the Spark jars are not included in the classpath of URLClassLoader. Hmm... we're very close to the truth now. Jianshi On Fri, Mar 13, 2015 at 6:03 PM, Jianshi Huang wrote: > I'm almost certain the problem is the ClassLoader. > > So adding > > fork := true > > so

Re: Visualizing the DAG of a Spark application

2015-03-13 Thread Todd Nist
There is the PR https://github.com/apache/spark/pull/2077 for doing this. On Fri, Mar 13, 2015 at 6:42 AM, t1ny wrote: > Hi all, > > We are looking for a tool that would let us visualize the DAG generated by > a > Spark application as a simple graph. > This graph would represent the Spark Job, i

Re: Using rdd methods with Dstream

2015-03-13 Thread Akhil Das
Like this? dtream.repartition(1).mapPartitions(it => it.take(5)) Thanks Best Regards On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed wrote: > Hi, > > I normally use dstream.transform whenever I need to use methods which are > available in RDD API but not in streaming API. e.g. dstream.transform

Pyspark saveAsTextFile exceptions

2015-03-13 Thread Madabhattula Rajesh Kumar
Hi Team, I'm getting below exception for saving the results into hadoop. *Code :* rdd.saveAsTextFile("hdfs://localhost:9000/home/rajesh/data/result.rdd") Could you please help me how to resolve this issue. 15/03/13 17:19:31 INFO spark.SparkContext: Starting job: saveAsTextFile at NativeMethodA

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-13 Thread shahab
Thanks, it makes sense. On Thursday, March 12, 2015, Daniel Siegmann wrote: > Join causes a shuffle (sending data across the network). I expect it will > be better to filter before you join, so you reduce the amount of data which > is sent across the network. > > Note this would be true for *any

Re: Using rdd methods with Dstream

2015-03-13 Thread Laeeq Ahmed
Hi, repartition is expensive. Looking for an efficient to do this. Regards,Laeeq On Friday, March 13, 2015 12:24 PM, Akhil Das wrote: Like this? dtream.repartition(1).mapPartitions(it => it.take(5)) ThanksBest Regards On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed wrote: Hi, I

Re: Explanation on the Hive in the Spark assembly

2015-03-13 Thread bit1...@163.com
Can anyone have a look on this question? Thanks. bit1...@163.com From: bit1...@163.com Date: 2015-03-13 16:24 To: user Subject: Explanation on the Hive in the Spark assembly Hi, sparkers, I am kind of confused about hive in the spark assembly. I think hive in the spark assembly is not the s

Re: Using rdd methods with Dstream

2015-03-13 Thread Sean Owen
Hm, aren't you able to use the SparkContext here? DStream operations happen on the driver. So you can parallelize() the result? take() won't work as it's not the same as top() On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das wrote: > Like this? > > dtream.repartition(1).mapPartitions(it => it.take(5)

Re: RDD to InputStream

2015-03-13 Thread Sean Owen
OK, then you do not want to collect() the RDD. You can get an iterator, yes. There is no such thing as making an Iterator into an InputStream. An Iterator is a sequence of arbitrary objects; an InputStream is a channel to a stream of bytes. I think you can employ similar Guava / Commons utilities t

Re: Using rdd methods with Dstream

2015-03-13 Thread Laeeq Ahmed
Hi, Earlier my code was like follwing but slow due to repartition. I want top K of each window in a stream. val counts = keyAndValues.map(x => math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))val topCounts = counts.repartition(1).map(_.swap).transform(rdd => rdd.sortByKey

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-13 Thread Jaonary Rabarisoa
It runs faster but there is some drawbacks. It seems to consume more memory. I get java.lang.OutOfMemoryError: Java heap space error if I don't have a sufficient partitions for a fixed amount of memory. With the older (ampcamp) implementation for the same data size I didn't get it. On Thu, Mar 12,

Re: Spark SQL. Cast to Bigint

2015-03-13 Thread Yin Huai
Are you using SQLContext? Right now, the parser in the SQLContext is quite limited on the data type keywords that it handles (see here ) and unfortunately "bigint" is not hand

[GRAPHX] could not process graph with 230M edges

2015-03-13 Thread Hlib Mykhailenko
Hello, I cannot process graph with 230M edges. I cloned apache.spark, build it and then tried it on cluster. I used Spark Standalone Cluster: -5 machines (each has 12 cores/32GB RAM) -'spark.executor.memory' == 25g -'spark.driver.memory' == 3g Graph has 231359027 edges. And its file weig

Re: Visualizing the DAG of a Spark application

2015-03-13 Thread t1ny
For anybody who's interested in this, here's a link to a PR that addresses this feature : https://github.com/apache/spark/pull/2077 (thanks to Todd Nist for sending it to me) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Visualizing-the-DAG-of-a-Spark-app

Errors in SPARK

2015-03-13 Thread sandeep vura
Hi Sparkers, Can anyone please check the below error and give solution for this.I am using hive version 0.13 and spark 1.2.1 . Step 1 : I have installed hive 0.13 with local metastore (mySQL database) Step 2: Hive is running without any errors and able to create tables and loading data in hive t

RE: How to do spares vector product in Spark?

2015-03-13 Thread Daniel, Ronald (ELS-SDG)
> Any convenient tool to do this [sparse vector product] in Spark? Unfortunately, it seems that there are very few operations defined for sparse vectors. I needed to add some, and ended up converting them to (dense) numpy vectors and doing the addition on those. Best regards, Ron From: Xi She

Re: How to do spares vector product in Spark?

2015-03-13 Thread Sean Owen
In Java/Scala-land, the intent is to use Breeze for this. "Vector" in Spark is an opaque wrapper around the Breeze representation, which contains a bunch of methods like this. On Fri, Mar 13, 2015 at 3:28 PM, Daniel, Ronald (ELS-SDG) wrote: >> Any convenient tool to do this [sparse vector product

Lots of fetch failures on saveAsNewAPIHadoopDataset PairRDDFunctions

2015-03-13 Thread freedafeng
spark1.1.1 + Hbase (CDH5.3.1). 20 nodes each with 4 cores and 32G memory. 3 cores and 16G memory were assigned to spark in each worker node. Standalone mode. Data set is 3.8 T. wondering how to fix this. Thanks! org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala

Workflow layer for Spark

2015-03-13 Thread Karthikeyan Muthukumar
Hi, We are building a machine learning platform based on ML-Lib in Spark. We would be using Scala for the development. We need a thin workflow layer where we can easily configure the different actions to be done, configuration for the actions (like load-data, clean-data, split-data etc), and the or

Re: Workflow layer for Spark

2015-03-13 Thread Ted Yu
Have you seen this thread http://search-hadoop.com/m/JW1q50XEa82 ? On Fri, Mar 13, 2015 at 8:46 AM, Karthikeyan Muthukumar < mkarthiksw...@gmail.com> wrote: > Hi, > We are building a machine learning platform based on ML-Lib in Spark. We > would be using Scala for the development. > We need a thi

Re: Workflow layer for Spark

2015-03-13 Thread Peter Rudenko
Take a look to the new spark ml api with Pipeline functionality and also to spark dataflow - Google Cloud Dataflow API implementation on top of spark. Thanks, Peter Rudenko On 2015-03-13 17:46, Ka

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-13 Thread Daniel Siegmann
On Thu, Mar 12, 2015 at 1:45 AM, wrote: > > In your response you say “When you call reduce and *similar *methods, > each partition can be reduced in parallel. Then the results of that can be > transferred across the network and reduced to the final result”. By similar > methods do you mean all ac

Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi
Hi, I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1. Strange thing, the exact same code does work (after upgrade) in the spark-shell. But this information might be misleading as it works with 1.1.1... *The job takes as input two data sets:* - rdd A of +170gb (with less it is h

[ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Patrick Wendell
Hi All, I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is the fourth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! Visit the release notes [1] to read about the new features, or d

jar conflict with Spark default packaging

2015-03-13 Thread Shuai Zheng
Hi All, I am running spark to deal with AWS. And aws sdk latest version is working with httpclient 3.4+. Then but spark-assembly-*-.jar file has packaged an old httpclient version which cause me: ClassNotFoundException for org/apache/http/client/methods/HttpPatch Even when I put the rig

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Ted Yu
Might be related: what's the value for spark.yarn.executor.memoryOverhead ? See SPARK-6085 Cheers On Fri, Mar 13, 2015 at 9:45 AM, Eugen Cepoi wrote: > Hi, > > I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1. Strange > thing, the exact same code does work (after upgrade) in t

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-13 Thread Shivaram Venkataraman
Do you have a small test case that can reproduce the out of memory error ? I have also seen some errors on large scale experiments but haven't managed to narrow it down. Thanks Shivaram On Fri, Mar 13, 2015 at 6:20 AM, Jaonary Rabarisoa wrote: > It runs faster but there is some drawbacks. It se

Re: [ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Kushal Datta
Kudos to the whole team for such a significant achievement! On Fri, Mar 13, 2015 at 10:00 AM, Patrick Wendell wrote: > Hi All, > > I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is > the fourth release on the API-compatible 1.X line. It is Spark's > largest release ever, with

Re: [ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Sebastián Ramírez
Awesome! Thanks! *Sebastián Ramírez* Head of Software Development Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo Email: se

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi
The one by default 0.07 of executor memory. I'll try increasing it and post back the result. Thanks 2015-03-13 18:09 GMT+01:00 Ted Yu : > Might be related: what's the value for spark.yarn.executor.memoryOverhead ? > > See SPARK-6085 > > Cheers > > On Fri, Mar 13, 2015 at 9:45 AM, Eugen Cepoi >

How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Adamantios Corais
Hi, I want change the default combination of keys that exit the Spark shell (i.e. CTRL + C) to something else, such as CTRL + H? Thank you in advance. *// Adamantios*

Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Marcelo Vanzin
You can type ":quit". On Fri, Mar 13, 2015 at 10:29 AM, Adamantios Corais wrote: > Hi, > > I want change the default combination of keys that exit the Spark shell > (i.e. CTRL + C) to something else, such as CTRL + H? > > Thank you in advance. > > // Adamantios > > > -- Marcelo -

Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Adamantios Corais
this doesn't solve my problem... apparently, my problem is that from time to time I accidentally press CTRL + C (instead of CTRL + ALT + V for copying commands in the shell) and that results in closing my shell. In order to solve this I was wondering if I just deactivating the CTRL + C combination

serialization stakeoverflow error during reduce on nested objects

2015-03-13 Thread ilaxes
Hi, I'm working on a RDD of a tuple of objects which represent trees (Node containing a hashmap of nodes). I'm trying to aggregate these trees over the RDD. Let's take an example, 2 graphs : C - D - B - A - D - B - E F - E - B - A - D - B - F I'm spliting each graphs according to the vertex A re

Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Sean Owen
This is more of a function of your shell, since ctrl-C will send SIGINT. I'm sure you can change that, depending on your shell or terminal, but it's not a Spark-specific answer. On Fri, Mar 13, 2015 at 5:44 PM, Adamantios Corais wrote: > this doesn't solve my problem... apparently, my problem is

Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Marcelo Vanzin
BTW this might be useful: http://superuser.com/questions/160388/change-bash-shortcut-keys-such-as-ctrl-c On Fri, Mar 13, 2015 at 10:53 AM, Sean Owen wrote: > This is more of a function of your shell, since ctrl-C will send > SIGINT. I'm sure you can change that, depending on your shell or > termi

Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Marcelo Vanzin
I'm not aware of a way to do that. CTRL-C is generally handled by the terminal emulator and just sends a SIGINT to the process; and the JVM doesn't allow you to ignore SIGINT. Setting a signal handler on the parent process doesn't work either. And there are other combinations you need to care abou

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi
Hum increased it to 1024 but doesn't help still have the same problem :( 2015-03-13 18:28 GMT+01:00 Eugen Cepoi : > The one by default 0.07 of executor memory. I'll try increasing it and > post back the result. > > Thanks > > 2015-03-13 18:09 GMT+01:00 Ted Yu : > >> Might be related: what's the v

RE: Handling worker batch processing during driver shutdown

2015-03-13 Thread Jose Fernandez
Thanks, I am using 1.2.0 so it looks like I am affected by the bug you described. It also appears that the shutdown hook doesn’t work correctly when the driver is running in YARN. According to the logs it looks like the SparkContext is closed and the code you suggested is never executed and fai

Date and decimal datatype not working

2015-03-13 Thread BASAK, ANANDA
Hi All, I am very new in Spark world. Just started some test coding from last week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding. I am having issues while using Date and decimal data types. Following is my code that I am simply running on scala prompt. I am trying to define a table and

Any way to find out feature importance in Spark SVM?

2015-03-13 Thread Natalia Connolly
Hello, While running an SVMClassifier in spark, is there any way to print out the most important features the model selected? I see there's a way to print out the weights but I am not sure how to associate them with the input features, by name or in any other way. Any help would be appreciat

Re: Getting incorrect weights for LinearRegression

2015-03-13 Thread Burak Yavuz
Hi, I would suggest you use LBFGS, as I think the step size is hurting you. You can run the same thing in LBFGS as: ``` val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater()) val initialWeights = Vectors.dense(Array.fill(3)( scala.util.Random.nextDouble())) val weights = algo

how to print RDD by key into file with grouByKey

2015-03-13 Thread Adrian Mocanu
Hi I have an RDD: RDD[(String, scala.Iterable[(Long, Int)])] which I want to print into a file, a file for each key string. I tried to trigger a repartition of the RDD by doing group by on it. The grouping gives RDD[(String, scala.Iterable[Iterable[(Long, Int)]])] so I flattened that: Rdd.gro

RE: Handling worker batch processing during driver shutdown

2015-03-13 Thread Tathagata Das
Are you running the code before or after closing the spark context? It must be after stopping streaming context (without spark context) and before stopping spark context. Cc'ing Sean. He may have more insights. On Mar 13, 2015 11:19 AM, "Jose Fernandez" wrote: > Thanks, I am using 1.2.0 so it l

Re: Getting incorrect weights for LinearRegression

2015-03-13 Thread EcoMotto Inc.
Thanks a lot Burak, that helped. On Fri, Mar 13, 2015 at 1:55 PM, Burak Yavuz wrote: > Hi, > > I would suggest you use LBFGS, as I think the step size is hurting you. > You can run the same thing in LBFGS as: > > ``` > val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater()) >

Re: spark sql writing in avro

2015-03-13 Thread M. Dale
I probably did not do a good enough job explaining the problem. If you used Maven with the default Maven repository you have an old version of spark-avro that does not contain AvroSaver and does not have the saveAsAvro method implemented: Assuming you use the default Maven repo location: cd

Re: spark sql writing in avro

2015-03-13 Thread Kevin Peng
Markus, Thanks. That makes sense. I was able to get this to work with spark-shell passing in the git built jar. I did notice that I couldn't get AvroSaver.save to work with SQLContext, but it works with HiveContext. Not sure if that is an issue, but for me, it is fine. Once again, thanks for

spark flume tryOrIOException NoSuchMethodError

2015-03-13 Thread jaredtims
I am trying to process events from a flume avro sink, but i keep getting this same error. I am just running it locally using flumes avro-client. With the following commands to start the job and client. It seems like it should be a configuration problems since its a NoSuchMethodError, but everythi

Re: spark sql writing in avro

2015-03-13 Thread Michael Armbrust
BTW, I'll add that we are hoping to publish a new version of the Avro library for Spark 1.3 shortly. It should have improved support for writing data both programmatically and from SQL. On Fri, Mar 13, 2015 at 2:01 PM, Kevin Peng wrote: > Markus, > > Thanks. That makes sense. I was able to ge

Partitioning

2015-03-13 Thread Mohit Anchlia
I am trying to look for a documentation on partitioning, which I can't seem to find. I am looking at spark streaming and was wondering how does it partition RDD in a multi node environment. Where are the keys defined that is used for partitioning? For instance in below example keys seem to be impli

Re: Joining data using Latitude, Longitude

2015-03-13 Thread Ankur Srivastava
Hi Everyone, Thank you for your suggestions, based on that I was able to move forward. I am now generating a Geohash for all the lats and lons in our reference data and then creating a trie of all the Geohash's I am then broadcasting that trie and then using it to search the nearest Geohash for

LogisticRegressionWithLBFGS shows ERRORs

2015-03-13 Thread cjwang
I am running LogisticRegressionWithLBFGS. I got these lines on my console: 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.5 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch | Encount

Unable to connect

2015-03-13 Thread Mohit Anchlia
I am running spark streaming standalone in ec2 and I am trying to run wordcount example from my desktop. The program is unable to connect to the master, in the logs I see, which seems to be an issue with hostname. 15/03/13 17:37:44 ERROR EndpointWriter: dropping message [class akka.actor.ActorSele

Re: Unable to connect

2015-03-13 Thread Tathagata Das
Are you running the driver program (that is your application process) in your desktop and trying to run it on the cluster in EC2? It could very well be a hostname mismatch in some way due to the all the public hostname, private hostname, private ip, firewall craziness of ec2. You have to probably t

Re: Partitioning

2015-03-13 Thread Tathagata Das
If you want to access the keys in an RDD that is partition by key, then you can use RDD.mapPartition(), which gives you access to the whole partition as an iterator. You have the option of maintaing the partitioning information or not by setting the preservePartitioning flag in mapPartition (see do

Re: Using rdd methods with Dstream

2015-03-13 Thread Tathagata Das
Is the number of top K elements you want to keep small? That is, is K small? In which case, you can 1. either do it in the driver on the array DStream.foreachRDD ( rdd => { val topK = rdd.top(K) ; // use top K }) 2. Or, you can use the topK to create another RDD using sc.makeRDD DStream.tr

Spark on HDFS vs. Lustre vs. other file systems - formal research and performance evaluation

2015-03-13 Thread Edmon Begoli
All, Does anyone have any reference to a publication or other, informal sources (blogs, notes), showing performance of Spark on HDFS vs. other shared (Lustre, etc.) or other file system (NFS). I need this for formal performance research. We are currently doing a research into this on a very spec

Re: Partitioning

2015-03-13 Thread Mohit Anchlia
I still don't follow how spark is partitioning data in multi node environment. Is there a document on how spark does portioning of data. For eg: in word count eg how is spark distributing words to multiple nodes? On Fri, Mar 13, 2015 at 3:01 PM, Tathagata Das wrote: > If you want to access the k

org.apache.spark.SparkException Error sending message

2015-03-13 Thread Chen Song
When I ran Spark SQL query (a simple group by query) via hive support, I have seen lots of failures in map phase. I am not sure if that is specific to Spark SQL or general. Any one has seen such errors before? java.io.IOException: org.apache.spark.SparkException: Error sending message [message =

Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-13 Thread Shuai Zheng
Hi All, I try to run a sorting on a r3.2xlarge instance on AWS. I just try to run it as a single node cluster for test. The data I use to sort is around 4GB and sit on S3, output will also on S3. I just connect spark-shell to the local cluster and run the code in the script (because I just

Re: Partitioning

2015-03-13 Thread Gerard Maas
In spark-streaming, the consumers will fetch data and put it into 'blocks'. Each block becomes a partition of the rdd generated during that batch interval. The size of each is block controlled by the conf: 'spark.streaming.blockInterval'. That is, the amount of data the consumer can collect in that

RE: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-13 Thread Shuai Zheng
And one thing forget to mention, even I have this exception and the result is not well format in my target folder (part of them are there, rest are under different folder structure of _tempoary folder). In the webUI of spark-shell, it is still be marked as successful step. I think this is a bug?

Need Advice about reading lots of text files

2015-03-13 Thread Pat Ferrel
We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The list can get very large (maybe 1) and is all on hdfs. The question is: what is the most performant way to read them? Should they be bro

Spark SQL 1.3 max operation giving wrong results

2015-03-13 Thread gtinside
Hi , I am playing around with Spark SQL 1.3 and noticed that "max" function does not give the correct result i.e doesn't give the maximum value. The same query works fine in Spark SQL 1.2 . Is any one aware of this issue ? Regards, Gaurav -- View this message in context: http://apache-spark-

Loading in json with spark sql

2015-03-13 Thread kpeng1
Hi All, I was noodling around with loading in a json file into spark sql's hive context and I noticed that I get the following message after loading in the json file: PhysicalRDD [_corrupt_record#0], MappedRDD[5] at map at JsonRDD.scala:47 I am using the HiveContext to load in the json file using

building all modules in spark by mvn

2015-03-13 Thread sequoiadb
guys, is there any easier way to build all modules by mvn ? right now if I run “mvn package” in spark root directory I got: [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 8.327 s] [INFO] Spark Project Networking ... SKI

Re: building all modules in spark by mvn

2015-03-13 Thread Ted Yu
Please try the following command: mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package Cheers On Fri, Mar 13, 2015 at 5:57 PM, sequoiadb wrote: > guys, is there any easier way to build all modules by mvn ? > right now if I run “mvn package” in spark root directory I got: > [INFO] Reactor

Re: Partitioning

2015-03-13 Thread Tathagata Das
If you want to learn about how Spark partitions the data based on keys, here is a recent talk about that http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs?related=1 Of course you can read the original Spark paper https://www.cs.berkeley.edu

Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-13 Thread EH
Hi all, I've been using Spark 1.1.0 for a while, and now would like to upgrade to Spark 1.1.1 or above. However, it throws the following errors: 18:05:31.522 [sparkDriver-akka.actor.default-dispatcher-3hread] ERROR TaskSchedulerImpl - Lost executor 37 on hcompute001: remote Akka client disassoci

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-13 Thread Zhan Zhang
It is during function evaluation in the line search, the value is either infinite or NaN, which may be caused too large step size. In the code, the step is reduced to half. Thanks. Zhan Zhang On Mar 13, 2015, at 2:41 PM, cjwang wrote: > I am running LogisticRegressionWithLBFGS. I got these

spark there is no space on the disk

2015-03-13 Thread Peng Xia
Hi I was running a logistic regression algorithm on a 8 nodes spark cluster, each node has 8 cores and 56 GB Ram (each node is running a windows system). And the spark installation driver has 1.9 TB capacity. The dataset I was training on are has around 40 million records with around 6600 feature

How does Spark honor data locality when allocating computing resources for an application

2015-03-13 Thread bit1...@163.com
Hi, sparkers, When I read the code about computing resources allocation for the newly submitted application in the Master#schedule method, I got a question about data locality: // Pack each app into as few nodes as possible until we've assigned all its cores for (worker <- workers if worker.c

Re: RE: Explanation on the Hive in the Spark assembly

2015-03-13 Thread bit1...@163.com
Thanks Daoyuan. What do you mean by running some native command, I never thought that hive will run without an computing engine like Hadoop MR or spark. Thanks. bit1...@163.com From: Wang, Daoyuan Date: 2015-03-13 16:39 To: bit1...@163.com; user Subject: RE: Explanation on the Hive in the Sp

Re: Problem connecting to HBase

2015-03-13 Thread fightf...@163.com
Hi, there You may want to check your hbase config. e.g. the following property can be changed to /hbase zookeeper.znode.parent /hbase-unsecure fightf...@163.com From: HARIPRIYA AYYALASOMAYAJULA Date: 2015-03-14 10:47 To: user Subject: Problem connecting to HBase Hello,

Re: Problem connecting to HBase

2015-03-13 Thread Ted Yu
In HBaseTest.scala: val conf = HBaseConfiguration.create() You can add some log (for zookeeper.znode.parent, e.g.) to see if the values from hbase-site.xml are picked up correctly. Please use pastebin next time you want to post errors. Which Spark release are you using ? I assume it contains

Re: Loading in json with spark sql

2015-03-13 Thread Yin Huai
Seems you want to use array for the field of "providers", like "providers":[{"id": ...}, {"id":...}] instead of "providers":{{"id": ...}, {"id":...}} On Fri, Mar 13, 2015 at 7:45 PM, kpeng1 wrote: > Hi All, > > I was noodling around with loading in a json file into spark sql's hive > context and

Re: Loading in json with spark sql

2015-03-13 Thread Kevin Peng
Yin, Yup thanks. I fixed that shortly after I posted and it worked. Thanks, Kevin On Fri, Mar 13, 2015 at 8:28 PM, Yin Huai wrote: > Seems you want to use array for the field of "providers", like > "providers":[{"id": > ...}, {"id":...}] instead of "providers":{{"id": ...}, {"id":...}} > >

Aggregation of distributed datasets

2015-03-13 Thread raggy
I am a PhD student trying to understand the internals of Spark, so that I can make some modifications to it. I am trying to understand how the aggregation of the distributed datasets(through the network) onto the driver node works. I would very much appreciate it if someone could point me towards t

Re: serialization stakeoverflow error during reduce on nested objects

2015-03-13 Thread Ted Yu
Have you registered your class with kryo ? See core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala and core/src/test/scala/org/apache/spark/SparkConfSuite.scala On Fri, Mar 13, 2015 at 10:52 AM, ilaxes wrote: > Hi, > > I'm working on a RDD of a tuple of objects which represent

  1   2   >