Re: Linkage error - duplicate class definition

2015-01-20 Thread Hafiz Mujadid
Have you solved this problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Linkage-error-duplicate-class-definition-tp9482p21260.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, I'm trying to migrate some hive scripts to Spark-SQL. However, I found some statement is incompatible in Spark-sql. Here is my SQL. And the same SQL works fine in HIVE environment. SELECT *if(ad_user_id>1000, 1000, ad_user_id) as user_id* FROM ad_search_keywor

RE: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Wang, Daoyuan
Hi Xuelin, What version of Spark are you using? Thanks, Daoyuan From: Xuelin Cao [mailto:xuelincao2...@gmail.com] Sent: Tuesday, January 20, 2015 5:22 PM To: User Subject: IF statement doesn't work in Spark-SQL? Hi, I'm trying to migrate some hive scripts to Spark-SQL. However, I found

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Sean Owen
You can persist the RDD in (2) right after it is created. It will not cause it to be persisted immediately, but rather the first time it is materialized. If you persist after (3) is calculated, then it will be re-calculated (and persisted) after (4) is calculated. On Tue, Jan 20, 2015 at 3:38 AM,

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, I'm using Spark 1.2 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan wrote: > Hi Xuelin, > > > > What version of Spark are you using? > > > > Thanks, > > Daoyuan > > > > *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com] > *Sent:* Tuesday, January 20, 2015 5:22 PM > *To:* User > *Subject:*

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Ashish
Thanks Sean ! On Tue, Jan 20, 2015 at 3:32 PM, Sean Owen wrote: > You can persist the RDD in (2) right after it is created. It will not > cause it to be persisted immediately, but rather the first time it is > materialized. If you persist after (3) is calculated, then it will be > re-calculated (

Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

2015-01-20 Thread thanhtien522
Hi, I try to run Spark on YARN cluster by set master as yarn-client on java code. It works fine with count task by not working with other command. It threw ClassCastException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.j

spark streaming kinesis issue

2015-01-20 Thread Hafiz Mujadid
Hi experts! I am using spark streaming with kinesis and getting this exception while running program java.lang.LinkageError: loader (instance of org/apache/spark/executor/ChildExecutorURLClassLoader$userClassLoader$): attempted duplicate class definition for name: "com/amazonaws/services/kine

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Which context are you using HiveContext or SQLContext ? Can you try with HiveContext ?? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao wrote: > > Hi, I'm using Spark 1.2 > > > On Tue

spark streaming with checkpoint

2015-01-20 Thread balu.naren
I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb.

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Add one more library libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.2.0" val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) repalce sqlContext with hiveContext. Its working while using HiveContext for me. Devan M.S. | Research Associate | Cyber Security | AMR

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, Yes, this is what I'm doing. I'm using hiveContext.hql() to run my query. But, the problem still happens. On Tue, Jan 20, 2015 at 7:24 PM, DEVAN M.S. wrote: > Add one more library > > libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.2.0" > > > val hiveContext

Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Night Wolf
In Spark SQL we have Row objects which contain a list of fields that make up a row. A Rowhas ordinal accessors such as .getInt(0) or getString(2). Say ordinal 0 = ID and ordinal 1 = Name. It becomes hard to remember what ordinal is what, making the code confusing. Say for example I have the follo

Can multiple streaming apps use the same checkpoint directory?

2015-01-20 Thread Ashic Mahtab
Hi, For client mode spark submits of applications, we can do the following: def createStreamingContext() = { ... val sc = new SparkContext(conf) // Create a StreamingContext with a 1 second batch size val ssc = new StreamingContext(sc, Seconds(1)) } ... val ssc = StreamingContext.getOrCreate(ch

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Can you share your code ? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 5:03 PM, Xuelin Cao wrote: > > Hi, > > Yes, this is what I'm doing. I'm using hiveContext.hql() to run my > query. > >

RE: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Bob Tiernay
I found the following to be a good discussion of the same topic: http://apache-spark-user-list.1001560.n3.nabble.com/The-concurrent-model-of-spark-job-stage-task-td13083.html > From: so...@cloudera.com > Date: Tue, 20 Jan 2015 10:02:20 + > Subject: Re: Does Spark automatically run different

Re: Issues with constants in Spark HiveQL queries

2015-01-20 Thread yana
I run Spark 1.2 and do not have this issue. I dont believe the Hive version would matter(I run spark1.2 with Hive12 profile) but that would be a good test. The last version I tried for you was a cdh4.2 spark1.2 prebuilt without pointing to an external hive install(in fact I tried it on a machine

Saving a mllib model in Spark SQL

2015-01-20 Thread Divyansh Jain
Hey people, I have run into some issues regarding saving the k-means mllib model in Spark SQL by converting to a schema RDD. This is what I am doing: case class Model(id: String, model: org.apache.spark.mllib.clustering.KMeansModel) import sqlContext.createSchemaRDD val rowRdd = sc.makeRD

Re: Spark Streaming checkpoint recovery causes IO re-execution

2015-01-20 Thread RodrigoB
Hi Hannes, Good to know I'm not alone on the boat. Sorry about not posting back, I haven't gone in a while onto the user list. It's on my agenda to get over this issue. Will be very important for our recovery implementation. I have done an internal proof of concept but without any conclusions so

RE: spark streaming with checkpoint

2015-01-20 Thread Shao, Saisai
Hi, Seems you have such a large window (24 hours), so the phenomena of memory increasing is expectable, because of window operation will cache the RDD within this window in memory. So for your requirement, memory should be enough to hold the data of 24 hours. I don't think checkpoint in Spark

getting started writing unit tests for my app

2015-01-20 Thread Matthew Cornell
Hi Folks, I'm writing a GraphX app and I need to do some test-driven development. I've got Spark running on our little cluster and have built and run some hello world apps, so that's all good. I've looked through the test source and found lots of helpful examples that use SharedSparkContext, and

spark-submit --py-files remote: "Only local additional python files are supported"

2015-01-20 Thread Vladimir Grigor
Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding t

Re: Error for first run from iPython Notebook

2015-01-20 Thread Dave
Not sure if anyone who can help has seen this. Any suggestions would be appreciated, thanks! On Mon Jan 19 2015 at 1:50:43 PM Dave wrote: > Hi, > > I've setup my first spark cluster (1 master, 2 workers) and an iPython > notebook server that I'm trying to setup to access the cluster. I'm running

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Sunita Arvind
The below is not exactly a solution to your question but this is what we are doing. For the first time we do end up doing row.getstring() and we immediately parse it through a map function which aligns it to either a case class or a structType. Then we register it as a table and use just column nam

spark-submit --py-files remote: "Only local additional python files are supported"

2015-01-20 Thread Vladimir Grigor
Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding t

Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

2015-01-20 Thread Yana Kadiyska
Hm, you might want to ask on the dev list if you don't get a good answer here. I'm also trying to decipher this part of the code as I'm having issues with predicate pushes. I can see (in master branch) that the SQL codepath (which is taken if you don't convert the metastore) C:\spark-master\sql\co

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Wang, Ningjun (LNG-NPV)
Can anybody answer this? Do I have to have hdfs to achieve this? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent: Friday, January 16, 2015 1:15 PM To: Imran Rashid

Apply function to all elements along each key

2015-01-20 Thread Luis Guerra
Hi all, I would like to apply a function over all elements for each key (assuming key-value RDD). For instance, imagine I have: import numpy as np a = np.array([[1, 'hola', 'adios'],[2, 'hi', 'bye'],[2, 'hello', 'goodbye']]) a = sc.parallelize(a) Then I want to create a key-value RDD, using the

NPE in Parquet

2015-01-20 Thread Alessandro Baretta
All, I strongly suspect this might be caused by a glitch in the communication with Google Cloud Storage where my job is writing to, as this NPE exception shows up fairly randomly. Any ideas? Exception in thread "Thread-126" java.lang.NullPointerException at scala.collection.mutable.ArrayO

PySpark Client

2015-01-20 Thread Chris Beavers
Hey all, Is there any notion of a lightweight python client for submitting jobs to a Spark cluster remotely? If I essentially install Spark on the client machine, and that machine has the same OS, same version of Python, etc., then I'm able to communicate with the cluster just fine. But if Python

Re: NPE in Parquet

2015-01-20 Thread Iulian Dragoș
It’s an array.length, where the array is null. Looking through the code, it looks like the type converter assumes that FileSystem.globStatus never returns null, but that is not the case. Digging through the Hadoop codebase, inside Globber.glob, here’s what I found: /* * When the input pat

Aggregate order semantics when spilling

2015-01-20 Thread Justin Uang
Hi, I am trying to aggregate a key based on some timestamp, and I believe that spilling to disk is changing the order of the data fed into the combiner. I have some timeseries data that is of the form: ("key", "date", "other data") Partition 1 ("A", 2, ...) ("B", 4, ...) ("A", 1,

Re: How to output to S3 and keep the order

2015-01-20 Thread Anny Chen
Thanks Aniket! It is working now. Anny On Mon, Jan 19, 2015 at 5:56 PM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > When you repartiton, ordering can get lost. You would need to sort after > repartitioning. > > Aniket > > On Tue, Jan 20, 2015, 7:08 AM anny9699 wrote: > >> Hi, >> >>

Re: PySpark Client

2015-01-20 Thread Andrew Or
Hi Chris, Short answer is no, not yet. Longer answer is that PySpark only supports client mode, which means your driver runs on the same machine as your submission client. By corollary this means your submission client must currently depend on all of Spark and its dependencies. There is a patch t

Re: spark-submit --py-files remote: "Only local additional python files are supported"

2015-01-20 Thread Andrew Or
Hi Vladimir, Yes, as the error messages suggests, PySpark currently only supports local files. This does not mean it only runs in local mode, however; you can still run PySpark on any cluster manager (though only in client mode). All this means is that your python files must be on your local file

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Kane Kim
Related question - is execution of different stages optimized? I.e. map followed by a filter will require 2 loops or they will be combined into single one? On Tue, Jan 20, 2015 at 4:33 AM, Bob Tiernay wrote: > I found the following to be a good discussion of the same topic: > > http://apache-spar

Re: Aggregate order semantics when spilling

2015-01-20 Thread Andrew Or
Hi Justin, I believe the intended semantics of groupByKey or cogroup is that the ordering *within a key *is not preserved if you spill. In fact, the test cases for the ExternalAppendOnlyMap only assert that the Set representation of the results is as expected (see this line

dynamically change receiver for a spark stream

2015-01-20 Thread jamborta
Hi all, we have been trying to setup a stream using a custom receiver that would pick up data from sql databases. we'd like to keep that stream context running and dynamically change the streams on demand, adding and removing streams based on demand. alternativel, if a stream is fixed, is it possi

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-01-20 Thread Cheng Lian
Hey Yi, I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like to investigate this issue later. Would you please open an JIRA for it? Thanks! Cheng On 1/19/15 1:00 AM, Yi Tian wrote: Is there any way to support multiple users executing SQL on one thrift server? I

Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

2015-01-20 Thread Cheng Lian
|spark.sql.parquet.filterPushdown| defaults to |false| because there’s a bug in Parquet which may cause NPE, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration This bug hasn’t been fixed in Parquet master. We’ll turn this on once the bug is fixed. Ch

Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

2015-01-20 Thread Cheng Lian
In Spark SQL, Parquet filter pushdown doesn’t cover |HiveTableScan| for now. May I ask why do you prefer |HiveTableScan| rather than |ParquetTableScan|? Cheng On 1/19/15 5:02 PM, Xiaoyu Wang wrote: The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set *spark.sql.hive.**co

Re: dynamically change receiver for a spark stream

2015-01-20 Thread Akhil Das
Can you not do it with RDDs? Thanks Best Regards On Wed, Jan 21, 2015 at 12:38 AM, jamborta wrote: > Hi all, > > we have been trying to setup a stream using a custom receiver that would > pick up data from sql databases. we'd like to keep that stream context > running and dynamically change the

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Cheng Lian
I had once worked on a named row feature but haven’t got time to finish it. It looks like this: |sql("...").named.map { row:NamedRow => row[Int]('key) -> row[String]('value) } | Basically the |named| method generates a field name to ordinal map for each RDD partition. This map is then share

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Cheng Lian
This is because |KMeanModel| is neither a built-in type nor a user defined type recognized by Spark SQL. I think you can write your own UDT version of |KMeansModel| in this case. You may refer to |o.a.s.mllib.linalg.Vector| and |o.a.s.mllib.linalg.VectorUDT| as an example. Cheng On 1/20/15 5

Confused about shuffle read and shuffle write

2015-01-20 Thread Darin McBeath
I have the following code in a Spark Job. // Get the baseline input file(s) JavaPairRDD hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile, SequenceFileInputFormat.class, Text.class, Text.class); JavaPairRDD hsfBaselinePairRDD = hsfBaselinePairRDDReadable.mapToPair(new ConvertFr

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Cheng Lian
|IF| is implemented as a generic UDF in Hive (|GenericUDFIf|). It seems that this function can’t be properly resolved. Could you provide a minimum code snippet that reproduces this issue? Cheng On 1/20/15 1:22 AM, Xuelin Cao wrote: Hi, I'm trying to migrate some hive scripts to Spark

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Michael Armbrust
I use extractors: sql("SELECT name, age FROM people").map { Row(name: String, age: Int) => ... } On Tue, Jan 20, 2015 at 6:48 AM, Sunita Arvind wrote: > The below is not exactly a solution to your question but this is what we > are doing. For the first time we do end up doing row.getstrin

Re: Spark SQL: Assigning several aliases to the output (several return values) of an UDF

2015-01-20 Thread Cheng Lian
Guess this can be helpful: http://stackoverflow.com/questions/14252615/stack-function-in-hive-how-to-specify-multiple-aliases On 1/19/15 8:26 AM, mucks17 wrote: Hello I use Hive on Spark and have an issue with assigning several aliases to the output (several return values) of an UDF. I ran i

Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-20 Thread Xiangrui Meng
The assumption of implicit feedback model is that the unobserved ratings are more likely to be negative. So you may want to add some negatives for evaluation. Otherwise, the input ratings are all 1 and the test ratings are all 1 as well. The baseline predictor, which uses the average rating (that i

Re: How to create distributed matrixes from hive tables.

2015-01-20 Thread Xiangrui Meng
You can get a SchemaRDD from the Hive table, map it into a RDD of Vectors, and then construct a RowMatrix. The transformations are lazy, so there is no external storage requirement for intermediate data. -Xiangrui On Sun, Jan 18, 2015 at 4:07 AM, guxiaobo1982 wrote: > Hi, > > We have large datase

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-20 Thread Cheng Lian
First of all, even if the underlying dataset is partitioned as expected, a shuffle can’t be avoided. Because Spark SQL knows nothing about the underlying data distribution. However, this does reduce network IO. You can prepare your data like this (say |CustomerCode| is a string field with ordi

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Xiangrui Meng
You can save the cluster centers as a SchemaRDD of two columns (id: Int, center: Array[Double]). When you load it back, you can construct the k-means model from its cluster centers. -Xiangrui On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian wrote: > This is because KMeanModel is neither a built-in ty

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-20 Thread Cheng Lian
On 1/15/15 11:26 PM, Nathan McCarthy wrote: Thanks Cheng! Is there any API I can get access too (e.g. ParquetTableScan) which would allow me to load up the low level/baseRDD of just RDD[Row] so I could avoid the defensive copy (maybe lose our on columnar storage etc.). We have parts of our

Re: Spark Sql reading whole table from cache instead of required coulmns

2015-01-20 Thread Cheng Lian
Hey Surbhit, In this case, the web UI stats is not accurate. Please refer to this thread for an explanation: https://www.mail-archive.com/user@spark.apache.org/msg18919.html Cheng On 1/13/15 1:46 AM, Surbhit wrote: Hi, I am using spark 1.1.0. I am using the spark-sql shell to run all the b

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Cheng Lian
Yeah, as Michael said, I forgot that UDT is not a public API. Xiangrui's suggestion makes more sense. Cheng On 1/20/15 12:49 PM, Xiangrui Meng wrote: You can save the cluster centers as a SchemaRDD of two columns (id: Int, center: Array[Double]). When you load it back, you can construct the k-

Re: Support for SQL on unions of tables (merge tables?)

2015-01-20 Thread Cheng Lian
I think you can resort to a Hive table partitioned by date https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables On 1/11/15 9:51 PM, Paul Wais wrote: Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g. su

Re: Aggregate order semantics when spilling

2015-01-20 Thread Justin Uang
Hi Andrew, Thanks for your response! For our use case, we aren't actually grouping, but rather updating running aggregates. I just picked grouping because it made the example easier to write out. However, when we merge combiners, the combiners have to have data that are adjacent to each other in t

Re: Error for first run from iPython Notebook

2015-01-20 Thread Felix C
+1. I can confirm this. It says collect fails in Py4J --- Original Message --- From: "Dave" Sent: January 20, 2015 6:49 AM To: user@spark.apache.org Subject: Re: Error for first run from iPython Notebook Not sure if anyone who can help has seen this. Any suggestions would be appreciated, thanks

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Mohammed Guller
I don’t think it will work without HDFS. Mohammed From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent: Tuesday, January 20, 2015 7:55 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: RE: Can I save RDD to local file system and then read it back on spark clust

Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Shailesh Birari
Hello, I recently upgraded my setup from Spark 1.1 to Spark 1.2. My existing applications are working fine on ubuntu cluster. But, when I try to execute Spark MLlib application from Eclipse (Windows node) it gives java.lang.NoClassDefFoundError: com/google/common/base/Preconditions exception. Not

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Sean Owen
Guava is shaded in Spark 1.2+. It looks like you are mixing versions of Spark then, with some that still refer to unshaded Guava. Make sure you are not packaging Spark with your app and that you don't have other versions lying around. On Tue, Jan 20, 2015 at 11:55 PM, Shailesh Birari wrote: > Hel

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Ted Yu
Please also see this thread: http://search-hadoop.com/m/JW1q5De7pU1 On Tue, Jan 20, 2015 at 3:58 PM, Sean Owen wrote: > Guava is shaded in Spark 1.2+. It looks like you are mixing versions > of Spark then, with some that still refer to unshaded Guava. Make sure > you are not packaging Spark with

Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Davies Liu
If the dataset is not huge (in a few GB), you can setup NFS instead of HDFS (which is much harder to setup): 1. export a directory in master (or anyone in the cluster) 2. mount it in the same position across all slaves 3. read/write from it by file:///path/to/monitpoint On Tue, Jan 20, 2015 at 7:

Spark 1.1.0 - spark-submit failed

2015-01-20 Thread ey-chih chow
Hi, I issued the following command in a ec2 cluster launched using spark-ec2: ~/spark/bin/spark-submit --class com.crowdstar.cluster.etl.ParseAndClean --master spark://ec2-54-185-107-113.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --total-executor-cores 4 file:///tmp/etl-admin/jar/

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Mark Hamstra
A map followed by a filter will not be two stages, but rather one stage that pipelines the map and filter. > On Jan 20, 2015, at 10:26 AM, Kane Kim wrote: > > Related question - is execution of different stages optimized? I.e. > map followed by a filter will require 2 loops or they will be com

Re: Spark 1.1.0 - spark-submit failed

2015-01-20 Thread Ted Yu
Please check which netty jar(s) are on the classpath. NioWorkerPool(Executor workerExecutor, int workerCount) was added in netty 3.5.4 Cheers On Tue, Jan 20, 2015 at 4:15 PM, ey-chih chow wrote: > Hi, > > I issued the following command in a ec2 cluster launched using spark-ec2: > > ~/spark/bin

Python connector for spark-cassandra

2015-01-20 Thread Nishant Sinha
Hello everyone, Is there a python connector for Spark and Cassandra as there is one for Java. I found a Java connector by DataStax on github: https://github.com/datastax/spark-cassandra-connector I am looking for something similar in Java. Thanks

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Shailesh Birari
Hello, I double checked the libraries. I am linking only with Spark 1.2. Along with Spark 1.2 jars I have Scala 2.10 jars and JRE 7 jars linked and nothing else. Thanks, Shailesh On Wed, Jan 21, 2015 at 12:58 PM, Sean Owen wrote: > Guava is shaded in Spark 1.2+. It looks like you are mixing

Re: S3 Bucket Access

2015-01-20 Thread bbailey
Hi sranga, Were you ever able to get authentication working with the temporary IAM credentials (id, secret, & token)? I am in the same situation and it would be great if we could document a solution so others can benefit from this Thanks! sranga wrote > Thanks Rishi. That is exactly what I am

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Frank Austin Nothaft
Shailesh, To add, are you packaging Hadoop in your app? Hadoop will pull in Guava. Not sure if you are using Maven (or what) to build, but if you can pull up your builds dependency tree, you will likely find com.google.guava being brought in by one of your dependencies. Regards, Frank Austin

What will happen if the Driver exits abnormally?

2015-01-20 Thread personal_email0
As title. Is there some mechanism to recover to make the job can be completed? Any comments will be very appreciated. Best Regards, Anzhsoft - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,

MapType in spark-sql

2015-01-20 Thread Kevin Jung
Hi all How can I add MapType and ArrayType to schema when I create StructType programmatically? val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) above code from spark document works fine but if I change StringType to MapType or Array

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-20 Thread Cheng Lian
Hey Yana, Sorry for the late reply, missed this important thread somehow. And many thanks for reporting this. It turned out to be a bug — filter pushdown is only enabled when using client side metadata, which is not expected, because task side metadata code path is more performant. And I guess

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread jagaximo
Kevin (Sangwoo) Kim wrote > If keys are not too many, > You can do like this: > > val data = List( > ("A", Set(1,2,3)), > ("A", Set(1,2,4)), > ("B", Set(1,2,3)) > ) > val rdd = sc.parallelize(data) > rdd.persist() > > rdd.filter(_._1 == "A").flatMap(_._2).distinct.count > rdd.filter(_._1 =

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Shailesh Birari
Hi Frank, Its a normal eclipse project where I added Scala and Spark libraries as user libraries. Though, I am not attaching any hadoop libraries, in my application code I have following line. System.setProperty("hadoop.home.dir", "C:\\SB\\HadoopWin") This Hadoop home dir contains "winutils.ex

Re: Spark Streaming with Kafka

2015-01-20 Thread firemonk9
Hi, I am having similar issues. Have you found any resolution ? Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-tp21222p21276.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread Kevin (Sangwoo) Kim
Great to hear you got solution!! Cheers! Kevin On Wed Jan 21 2015 at 11:13:44 AM jagaximo wrote: > Kevin (Sangwoo) Kim wrote > > If keys are not too many, > > You can do like this: > > > > val data = List( > > ("A", Set(1,2,3)), > > ("A", Set(1,2,4)), > > ("B", Set(1,2,3)) > > ) > > val r

Re: MapType in spark-sql

2015-01-20 Thread Cheng Lian
You need to provide key type, value type for map type, element type for array type, and whether they contain null: |StructType(Array( StructField("map_field",MapType(keyType =IntegerType, valueType =StringType, containsNull =true), nullable =true), StructField("array_field",ArrayType(elemen

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Aaron Davidson
Spark's network-common package depends on guava as a "provided" dependency in order to avoid conflicting with other libraries (e.g., Hadoop) that depend on specific versions. com/google/common/base/Preconditions has been present in Guava since version 2, so this is likely a "dependency not found" r

Task result deserialization error (1.1.0)

2015-01-20 Thread Dmitriy Lyubimov
Hi, I am getting task result deserialization error (kryo is enabled). Is it some sort of `chill` registration issue at front end? This is application that lists spark as maven dependency (so it gets correct hadoop and chill dependencies in classpath, i checked). Thanks in advance. 15/01/20 18:2

RE: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Bob Tiernay
If using Maven, one simply use whatever version they prefer and at build time and the artifact using something like: org.apache.maven.plugins maven-shade-plugin package shade

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Shailesh Birari
Thanks Aaron. Adding Guava jar resolves the issue. Shailesh On Wed, Jan 21, 2015 at 3:26 PM, Aaron Davidson wrote: > Spark's network-common package depends on guava as a "provided" dependency > in order to avoid conflicting with other libraries (e.g., Hadoop) that > depend on specific versions

RE: dynamically change receiver for a spark stream

2015-01-20 Thread Shao, Saisai
Hi, I don't think current Spark Streaming support this feature, all the DStream lineage is fixed after the context is started. Also stopping a stream is not supported, instead currently we need to stop the whole streaming context to meet what you want. Thanks Saisai -Original Message-

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2015-01-20 Thread Nicholas Chammas
Are the gz files roughly equal in size? Do you know that your partitions are roughly balanced? Perhaps some cores get assigned tasks that end very quickly, while others get most of the work. On Sat Jan 17 2015 at 2:02:49 AM Gautham Anil wrote: > Hi, > > Thanks for getting back to me. Sorry for t

Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread TJ Klein
Hi, I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster than Spark 1.1. However, the initial joy faded quickly when I noticed that all my stuff didn't successfully terminate operations anymore. Using Spark

RangePartitioner

2015-01-20 Thread Rishi Yadav
I am joining two tables as below, the program stalls at below log line and never proceeds. What might be the issue and possible solution? >>> INFO SparkContext: Starting job: RangePartitioner at Exchange.scala:79 Table 1 has  450 columns Table2 has  100 columns Both tables have few million r

Fwd: [Spark Streaming] The FileInputDStream newFilesOnly=false does not work in 1.2 since

2015-01-20 Thread Terry Hole
Hi, I am trying to move from 1.1 to 1.2 and found that the newFilesOnly=false (Intend to include old files) does not work anymore. It works great in 1.1, this should be introduced by the last change of this class. Does this flag behavior change or is it a regression? Issue should be caused by

KNN for large data set

2015-01-20 Thread DEVAN M.S.
Hi all, Please help me to find out best way for K-nearest neighbor using spark for large data sets.

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-20 Thread Fengyun RAO
currently we migrate from 1.1 to 1.2, and found our program 3x slower, maybe due to the singleton hack? could you explain in detail why or how "The singleton hack works very different in spark 1.2.0 " thanks! 2015-01-18 20:56 GMT+08:00 octavian.ganea : > The singleton hack works very different

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
Could you provide a short script to reproduce this issue? On Tue, Jan 20, 2015 at 9:00 PM, TJ Klein wrote: > Hi, > > I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using > PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster > than Spark 1.1. However, the in

Re: [Spark Streaming] The FileInputDStream newFilesOnly=false does not work in 1.2 since

2015-01-20 Thread Sean Owen
See also SPARK-3276 and SPARK-3553. Can you say more about the problem? what are the file timestamps, what happens when you run, what log messages if any are relevant. I do not expect there was any intended behavior change. On Wed, Jan 21, 2015 at 5:17 AM, Terry Hole wrote: > Hi, > > I am trying

spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Fengyun RAO
Currently we are migrating from spark 1.1 to spark 1.2, but found the program 3x slower, with nothing else changed. note: our program in spark 1.1 has successfully processed a whole year data, quite stable. the main script is as below sc.textFile(inputPath) .flatMap(line => LogParser.parseLine(li

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Davies Liu
Maybe some change related to serialize the closure cause LogParser is not a singleton any more, then it is initialized for every task. Could you change it to a Broadcast? On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO wrote: > Currently we are migrating from spark 1.1 to spark 1.2, but found the

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Tassilo Klein
Hi, It's a bit of a longer script that runs some deep learning training. Therefore it is a bit hard to wrap up easily. Essentially I am having a loop, in which a gradient is computed on each node and collected (this is where it freezes at some point). grads = zipped_trainData.map(distributed_gr

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Fengyun RAO
the LogParser instance is not serializable, and thus cannot be a broadcast, what’s worse, it contains an LRU cache, which is essential to the performance, and we would like to share among all the tasks on the same node. If it is the case, what’s the recommended way to share a variable among all t

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
Could you try to disable the new feature of reused worker by: spark.python.worker.reuse = false On Tue, Jan 20, 2015 at 11:12 PM, Tassilo Klein wrote: > Hi, > > It's a bit of a longer script that runs some deep learning training. > Therefore it is a bit hard to wrap up easily. > > Essentially I a

Closing over a var with changing value in Streaming application

2015-01-20 Thread Tobias Pfeiffer
Hi, I am developing a Spark Streaming application where I want every item in my stream to be assigned a unique, strictly increasing Long. My input data already has RDD-local integers (from 0 to N-1) assigned, so I am doing the following: var totalNumberOfItems = 0L // update the keys of the s

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-20 Thread Sean Owen
I don't know of any reason to think the singleton pattern doesn't work or works differently. I wonder if, for example, task scheduling is different in 1.2 and you have more partitions across more workers and so are loading more copies more slowly into your singletons. On Jan 21, 2015 7:13 AM, "Feng

Re: Closing over a var with changing value in Streaming application

2015-01-20 Thread Akhil Das
How about using accumulators ? Thanks Best Regards On Wed, Jan 21, 2015 at 12:53 PM, Tobias Pfeiffer wrote: > Hi, > > I am developing a Spark Streaming application where I want every item in > my stream to be assigned a uni

Re: Closing over a var with changing value in Streaming application

2015-01-20 Thread Tobias Pfeiffer
Hi, On Wed, Jan 21, 2015 at 4:46 PM, Akhil Das wrote: > How about using accumulators > ? > As far as I understand, they solve the part of the problem that I am not worried about, namely increasing the counter. I was more wo

  1   2   >