How to deal with the spark streaming application while upgrade spark

2015-07-23 Thread JoneZhang
My spark streaming on kafka application is running in spark 1.3. I want upgrade spark to 1.4 now. How to deal with the spark streaming application? Save the kafka topic partition offset, then kill the application, then upgrade, then run spark streaming again? Is there more elegant way? -- View

Re: Re: Need help in setting up spark cluster

2015-07-23 Thread Jeetendra Gangele
Thanks for reply and your valuable suggestions I have 10 GB data generated every day so this data I need to write in my database also this data is schema base and schema changes frequently , so consider this as unstructured data sometimes I may have to serve 1 write/secs with 4 m1.xLarge machi

Re: Udf's in spark

2015-07-23 Thread Takeshi Yamamuro
Sure and sparksql supports Hive UDFs. ISTM that the UDF 'DATE_FORMAT' is just not registered in your metastore? Did you say 'CREATE FUNCTION' in advance? Thanks, On Tue, Jul 14, 2015 at 6:30 PM, Ravisankar Mani wrote: > Hi Everyone, > > As mentioned in Spark sQL programming guide, Spark SQL sup

SQL Server to Spark

2015-07-23 Thread vinod kumar
Hi Everyone, I am in need to use the table from MsSQLSERVER in SPARK.Any one please share me the optimized way for that? Thanks in advance, Vinod

RE: Help accessing protected S3

2015-07-23 Thread Ewan Leith
I think the standard S3 driver used in Spark from the Hadoop project (S3n) doesn't support IAM role based authentication. However, S3a should support it. If you're running Hadoop 2.6 via the spark-ec2 scripts (I'm not sure what it launches with by default) try accessing your bucket via s3a:// U

Re: How to deal with the spark streaming application while upgrade spark

2015-07-23 Thread Tathagata Das
Currently that is the best way. On Thu, Jul 23, 2015 at 12:51 AM, JoneZhang wrote: > My spark streaming on kafka application is running in spark 1.3. > I want upgrade spark to 1.4 now. > > How to deal with the spark streaming application? > Save the kafka topic partition offset, then kill the ap

Re: SQL Server to Spark

2015-07-23 Thread Denny Lee
It sort of depends on optimized. There is a good thread on the topic at http://search-hadoop.com/m/q3RTtJor7QBnWT42/Spark+and+SQL+server/v=threaded If you have an archival type strategy, you could do daily BCP extracts out to load the data into HDFS / S3 / etc. This would result in minimal impact

Facing problem in Oracle VM Virtual Box

2015-07-23 Thread Chintan Bhatt
Hi. I'm facing following error while running .ova file containing Hortonworks with Spark in Oracle VM Virtual Box: Failed to open a session for the virtual machine *Hortonworks Sandbox with HDP 2.2.4*. VT-x features locked or unavailable in MSR. (VERR_VMX_MSR_LOCKED_OR_DISABLED). Result Code:

Asked to remove non-existent executor exception

2015-07-23 Thread Pa Rö
hello spark community, i have build an application with geomesa, accumulo and spark. if it run on spark local mode, it is working, but on spark cluster not. in short it says: No space left on device. Asked to remove non-existent executor XY. I´m confused, because there were many GB´s of free space

Re: Need help in SparkSQL

2015-07-23 Thread ayan guha
Another typical solution is build a search using elasticsearch and use it as secondary index for hbase On 23 Jul 2015 15:50, "Jörn Franke" wrote: > I do not think you can put all your queries into the row key without > duplicating the data for each query. However, this would be more last > resor

Create table from local machine

2015-07-23 Thread vinod kumar
Hi, I am in need to create a table in spark.for that I have uploaded a csv file in HDFS and created a table using following query CREATE EXTERNAL table IF NOT EXISTS " + tableName + " (teams string,runs int) " + "ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '" + hdfspath + "'"; May I k

[MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Saif.A.Ellafi
I tried with a RDD[DenseVector] but RDDs are not transformable, so T+ RDD[DenseVector] not >: RDD[Vector] and can't get to use the RDD input method of correlation. Thanks, Saif

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Rishi Yadav
can you explain what transformation is failing. Here's a simple example. http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/ On Thu, Jul 23, 2015 at 5:37 AM, wrote: > I tried with a RDD[DenseVector] but RDDs are not transformable, so T+ > RDD[DenseVector] not >: RDD[

Schedule lunchtime today for a free webinar "IoT data ingestion in Spark Streaming using Kaa" 11 a.m. PDT (2 p.m. EDT)

2015-07-23 Thread Oleh Rozvadovskyy
Hi there! Only couple of hours left to our first webinar on* IoT data ingestion in Spark Streaming using Kaa*. During the webinar we will build a solution that ingests real-time data from Intel Edison into Apache Spark for stream processing. This solution includes a client, middleware, and anal

RE: Issue with column named "count" in a DataFrame

2015-07-23 Thread Young, Matthew T
Thanks Michael, using backticks resolves the issue. Wouldn't this fix also be something that should go into Spark 1.4.2, or at least have the limitation noted in the documentation? From: Michael Armbrust [mich...@databricks.com] Sent: Wednesday, July 22, 2015 4:

Writing binary files in Spark

2015-07-23 Thread Oren Shpigel
Hi, I use Spark to read binary files using SparkContext.binaryFiles(), and then do some calculations, processing, and manipulations to get new objects (also binary). The next thing I want to do is write the results back to binary files on disk. Is there any equivalence like saveAsTextFile just fo

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Robin East
The OP’s problem is he gets this: :47: error: type mismatch; found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Ve

ERROR TaskResultGetter: Exception while getting task result when reading avro files that contain arrays

2015-07-23 Thread Arbi Akhina
Hi, I'm trying to read an avro file into a spark RDD, but I'm having an Exception while getting task result. The avro schema file has the following content: { "type" : "record", "name" : "sample_schema", "namespace" : "com.adomik.avro", "fields" : [ { "name" : "username", "type" :

Class weights and prediction probabilities in random forest?

2015-07-23 Thread Patrick Crenshaw
I was just wondering if there were plans to implement class weights and prediction probabilities in random forest? Is anyone working on this? smime.p7s Description: S/MIME cryptographic signature

Re: Writing binary files in Spark

2015-07-23 Thread Akhil Das
You can look into .saveAsObjectFiles Thanks Best Regards On Thu, Jul 23, 2015 at 8:44 PM, Oren Shpigel wrote: > Hi, > I use Spark to read binary files using SparkContext.binaryFiles(), and then > do some calculations, processing, and manipulations to get new objects > (also > binary). > The nex

Re: writing/reading multiple Parquet files: Failed to merge incompatible data types StringType and StructType

2015-07-23 Thread Akhil Das
Currently, the only way for you would be to create proper schema for the data. This is not a bug, but you could open a jira (since this would help others to solve their similar use-cases) for feature and in future version it could be implemented and included. Thanks Best Regards On Tue, Jul 21, 2

Re: Help accessing protected S3

2015-07-23 Thread Steve Loughran
> On 23 Jul 2015, at 01:50, Ewan Leith wrote: > > I think the standard S3 driver used in Spark from the Hadoop project (S3n) > doesn't support IAM role based authentication. > > However, S3a should support it. If you're running Hadoop 2.6 via the > spark-ec2 scripts (I'm not sure what it laun

RE: Help accessing protected S3

2015-07-23 Thread Greg Anderson
So when I go to ~/ephemeral-hdfs/bin/hadoop and check its version, it says Hadoop 2.0.0-cdh4.2.0. If I run pyspark and use the s3a address, things should work, right? What am I missing? And thanks so much for the help so far! From: Steve Loughran [ste..

Re: 1.4.0 classpath issue with spark-submit

2015-07-23 Thread Akhil Das
You can try adding that jar in SPARK_CLASSPATH (its deprecated though) in spark-env.sh file. Thanks Best Regards On Tue, Jul 21, 2015 at 7:34 PM, Michal Haris wrote: > I have a spark program that uses dataframes to query hive and I run it > both as a spark-shell for exploration and I have a run

Re: Using Dataframe write with newHdoopApi

2015-07-23 Thread Akhil Das
Did you happened to look into esDF ? You can open an issue over here if that doesn't solve your problem https://github.com/elastic/elasticsearch-hadoop/issues Thanks Best Regards On Tue, Jul 21, 2015 at 5:33 PM, ayan guha wrote: > Guys

Re: NullPointerException inside RDD when calling sc.textFile

2015-07-23 Thread Akhil Das
Did you try: val data = indexed_files.groupByKey val *modified_data* = data.map { a => var name = a._2.mkString(",") (a._1, name) } *modified_data*.foreach { a => var file = sc.textFile(a._2) println(file.count) } Thanks Best Regards On Wed, Jul 22, 2015 at 2:18 AM, MorEru wrote: >

Re: problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-23 Thread Akhil Das
It looks like its picking up the wrong namenode uri from the HADOOP_CONF_DIR, make sure it is proper. Also for submitting a spark job to a remote cluster, you might want to look at the spark.driver host and spark.driver.port Thanks Best Regards On Wed, Jul 22, 2015 at 8:56 PM, rok wrote: > I am

Re: spark thrift server supports timeout?

2015-07-23 Thread Akhil Das
Here's a few more configurations https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-ConfigurationPropertiesinthehive-site.xmlFile can't find anything on the timeouts though. Thanks Best Regards On Wed, Jul 22, 2015 at 1:01 AM, Judy Nash wrote: > Hello

Re: Using Dataframe write with newHdoopApi

2015-07-23 Thread ayan guha
Hi Akhil Thanks.Will definitely take a look. Couple of questions 1. Is it possible to use newHadoopAPI from dataframe.write or saveAs? 2. is esDF usable rom Python? On Fri, Jul 24, 2015 at 2:29 AM, Akhil Das wrote: > Did you happened to look into esDF >

Help with Dataframe syntax ( IN / COLLECT_SET)

2015-07-23 Thread Yana Kadiyska
Hi folks, having trouble expressing IN and COLLECT_SET on a dataframe. In other words, I'd like to figure out how to write the following query: "select collect_set(b),a from mytable where c in (1,2,3) group by a" I've started with someDF .where( -- not sure what do for c here--- .groupBy($

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-23 Thread Dan Dong
The problem should be "toMap", as I tested that "val maps2=maps.collect" runs ok. When I run spark-shell, I run with "--master mesos://cluster-1:5050" parameter which is the same with "spark-submit". Confused here. 2015-07-22 20:01 GMT-05:00 Yana Kadiyska : > Is it complaining about "collect" o

Twitter4J streaming question

2015-07-23 Thread pjmccarthy
Hopefully this is an easy one. I am trying to filter a twitter dstream by user ScreenName - my code is as follows val stream = TwitterUtils.createStream(ssc, None) .filter(_.getUser.getScreenName.contains("markets")) however nothing gets returned and I can see that Bloomberg has tweeted.

RE: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Saif.A.Ellafi
Thank you very much, working fine so far Saif From: Robin East [mailto:robin.e...@xense.co.uk] Sent: Thursday, July 23, 2015 12:26 PM To: Rishi Yadav Cc: Ellafi, Saif A.; user@spark.apache.org; Liu, Weicheng Subject: Re: [MLLIB] Anyone tried correlation with RDD[Vector] ? The OP’s problem is he

Re: Twitter4J streaming question

2015-07-23 Thread Enno Shioji
You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter stream, and then looking for the tweet by Bloomberg, so there is a very good chance you don't see the particular tweet. In order to get all Bloomberg related tweets, you must connect to

Fail to load hive tables through Spark

2015-07-23 Thread Mithila Joshi
I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark. Background: 1. I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox. 2. I have do

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
How can I tell if it's the sample stream or full stream ? Thanks Sent from my iPhone On Jul 23, 2015, at 4:17 PM, Enno Shioji mailto:eshi...@gmail.com>> wrote: You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter stream, and then looki

Re: Twitter4J streaming question

2015-07-23 Thread Enno Shioji
You need to pay a lot of money to get the full stream, so unless you are doing that, it's the sample stream! On Thu, Jul 23, 2015 at 9:26 PM, Patrick McCarthy wrote: > How can I tell if it's the sample stream or full stream ? > Thanks > > Sent from my iPhone > > On Jul 23, 2015, at 4:17 PM, Enn

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
Ahh Makes sense - thanks for the help Sent from my iPhone On Jul 23, 2015, at 4:29 PM, Enno Shioji mailto:eshi...@gmail.com>> wrote: You need to pay a lot of money to get the full stream, so unless you are doing that, it's the sample stream! On Thu, Jul 23, 2015 at 9:26 PM, Patrick McCarthy

java.lang.NoSuchMethodError for "list.toMap".

2015-07-23 Thread Dan Dong
Hi, When I ran with spark-submit the following simple Spark program of: import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark._ import SparkContext._ object TEST2{ def main(args:Array[

spark dataframe gc

2015-07-23 Thread Mohit Jaggi
Hi There, I am testing Spark DataFrame and havn't been able to get my code to finish due to what I suspect are GC issues. My guess is that GC interferes with heartbeating and executors are detected as failed. The data is ~50 numeric columns, ~100million rows in a CSV file. We are doing a groupBy us

Re: Partition parquet data by ENUM column

2015-07-23 Thread Jerry Lam
Hi Cheng, I ran into issues related to ENUM when I tried to use Filter push down. I'm using Spark 1.5.0 (which contains fixes for parquet filter push down). The exception is the following: java.lang.IllegalArgumentException: FilterPredicate column: item's declared type (org.apache.parquet.io.api.

Re: Fail to load hive tables through Spark

2015-07-23 Thread ayan guha
Please check if your metastore service is running. You may need to switch on automatic metastore service restart on restart of vm On 24 Jul 2015 06:20, "Mithila Joshi" wrote: > I am new to Spark and needed help in figuring out why my Hive databases > are not accessible to perform a data load thro

Zeppelin notebook question

2015-07-23 Thread Stefan Panayotov
Hi, When I create a DataFrame through Spark SQLContext and then register temp table I can use %sql Zeppelin interpreter to open a nice SQL paragraph. If on the other hand I do the same through HiveContext, I can't see those tables in the %sql show tables. Is there a way to query the HiveConte

[ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-23 Thread Elkhan Dadashov
Hi all, While running Spark Word count python example with intentional mistake in *Yarn cluster mode*, Spark terminal states final status as SUCCEEDED, but log files state correct results indicating that the job failed. Why terminal log output & application log output contradict each other ? If

Re: Help accessing protected S3

2015-07-23 Thread Steve Loughran
> On 23 Jul 2015, at 10:47, Greg Anderson > wrote: > > So when I go to ~/ephemeral-hdfs/bin/hadoop and check its version, it says > Hadoop 2.0.0-cdh4.2.0. If I run pyspark and use the s3a address, things > should work, right? What am I missing? And thanks so much for the help so > far! n

Re: Using Wurfl in Spark

2015-07-23 Thread Zhongxiao Ma
After several tests, it turns out that wurfl itself is not thread-safe. That cause the problem when more that one mapPartition are running, wurfl engines are conflicting. I don’t know if there is a better way than handling wurfl lookup outside. Thanks, Zhongxiao From: "z...@4info.com

Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2015-07-23 Thread Cheolsoo Park
Hi, I am wondering if anyone has successfully enabled "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I usually set this property to 25 to speed up file listing in MR jobs (Hive and Pig). But for some reason, this property does not take effect in Spark HadoopRDD resulting

SparkR Supported Types - Please add "bigint"

2015-07-23 Thread Exie
Hi Folks, Using Spark to read in JSON files and detect the schema, it gives me a dataframe with a "bigint" filed. R then fails to import the dataframe as it cant convert the type. > head(mydf) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""jobj"" to a data.fram

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-23 Thread Ruslan Dautkhanov
Or Spark on HBase ) http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ -- Ruslan Dautkhanov On Tue, Jul 14, 2015 at 7:07 PM, Ted Yu wrote: > bq. that is, key-value stores > > Please consider HBase for this purpose :-) > > On Tue, Jul 14, 2015 at 5:55 PM, Tathagata Das

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-23 Thread Zoran Jeremic
Hi Yana, Sorry for late response. I just saw your email. At the end I ended with the following pom https://www.dropbox.com/s/19fldb9qnnfieck/pom.xml?dl=0 There were multiple problems I had to struggle with. One of these were that my application had REST implemented with jboss jersey which got conf

Re: Twitter4J streaming question

2015-07-23 Thread Jörn Franke
He should still see something. I think you need to subscribe to the Screenname first and not filter it out only in the filter method. I do not have the apis from mobile at hand, but there should be a method. Le jeu. 23 juil. 2015 à 22:30, Enno Shioji a écrit : > You need to pay a lot of money to

Spark - Eclipse IDE - Maven

2015-07-23 Thread Siva Reddy
Hi All, I am trying to setup the Eclipse (LUNA) with Maven so that I create a maven projects for developing spark programs. I am having some issues and I am not sure what is the issue. Can Anyone share a nice step-step document to configure eclipse with maven for spark development. Tha

RE: SparkR Supported Types - Please add "bigint"

2015-07-23 Thread Sun, Rui
Exie, Reported your issue: https://issues.apache.org/jira/browse/SPARK-9302 SparkR has support for long(bigint) type in serde. This issue is related to support complex Scala types in serde. -Original Message- From: Exie [mailto:tfind...@prodevelop.com.au] Sent: Friday, July 24, 2015 10

Re: SparkR Supported Types - Please add "bigint"

2015-07-23 Thread Exie
Interestingly, after more digging, df.printSchema() in raw spark shows the columns as a long, not a bigint. root |-- localEventDtTm: timestamp (nullable = true) |-- asset: string (nullable = true) |-- assetCategory: string (nullable = true) |-- assetType: string (nullable = true) |-- event: s

Re: How to restart Twitter spark stream

2015-07-23 Thread Zoran Jeremic
Hi Akhil, Thank you for sending this code. My apologize if I will ask something that is obvious here, since I'm newbie in Scala, but I still don't see how I can use this code. Maybe my original question was not very clear. What I need is to get each Twitter Status that contains one of the hashtag

[POWERED BY] Please add our organization

2015-07-23 Thread Baxter, James
Hi there, Details below. Organisation: Woodside URL: http://www.woodside.com.au/ Components: Spark Core 1.31/1/41 and Spark SQL Use Case: Spark is being used for near real time predictive analysis over millions of equipment sensor readings and within our Data Integration processes for data quali

Re: Partition parquet data by ENUM column

2015-07-23 Thread Cheng Lian
Could you please provide the full stack trace of the exception? And what's the Git commit hash of the version you were using? Cheng On 7/24/15 6:37 AM, Jerry Lam wrote: Hi Cheng, I ran into issues related to ENUM when I tried to use Filter push down. I'm using Spark 1.5.0 (which contains fix

Re: Spark - Eclipse IDE - Maven

2015-07-23 Thread Anas Sherwani
Can you explain the issue? Further, in which language do you want to code? There are number of blogs to create a simple a maven project in Eclipse, and they are pretty simple and straightforward. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-

RE: SparkR Supported Types - Please add "bigint"

2015-07-23 Thread Sun, Rui
printSchema calls StructField. buildFormattedString() to output schema information. buildFormattedString() use DataType.typeName as string representation of the data type. LongType. typeName = "long" LongType.simpleString = "bigint" I am not sure about the difference of these two type name rep

Re: Using Dataframe write with newHdoopApi

2015-07-23 Thread Akhil Das
1. Yes you can, have a look at the EsOutputFormat 2. I'm not quiet sure about that, you could ask the ES people about it. Thanks Best Regards On Thu, Jul 23, 2015 at 11:5