Re: Write to S3 with server side encryption in KMS mode

2016-01-26 Thread Nisrina Luthfiyati
Ah, alright then, that looks like that's the case. Thank you for the info. I'm probably going to try to use the s3 managed encryption, from what I read this is supported by setting fs.s3a.server-side-encryption-algorithm parameter. Thanks! Nisrina On Tue, Jan 26, 2016 at 11:55 PM, Ewan Leith wro

Fwd: Issues with Long subtraction in an RDD when utilising tailrecursion

2016-01-26 Thread Nkechi Achara
Hi, Yep, strangely I get values where the successful auction has a smaller time than the other relevant auctions. I have also attempted to reverse the statement, and I receive auctions that are still greater than the successful auction. But also they are of a greater value. On 26 January 2016 at

Re: Spark partition size tuning

2016-01-26 Thread Gene Pang
Hi Jia, If you want to change the Tachyon block size, you can set the tachyon.user.block.size.bytes.default parameter ( http://tachyon-project.org/documentation/Configuration-Settings.html). You can set it via extraJavaOptions per job, or adding it to tachyon-site.properties. I hope that helps, G

Streaming: mapWithState "Error during Java deserialization."

2016-01-26 Thread Lin Zhao
I'm using mapWithState, and hit https://issues.apache.org/jira/browse/SPARK-12591. While 1.6.1 is not released, I tried the workaround in the comment. But I had these error in one of the nodes. While millions of events go throught the mapWithState, only 7 show up in the log. Is this related to

Re: Spark, Mesos, Docker and S3

2016-01-26 Thread Sathish Kumaran Vairavelu
Hi Mao, I want to check on accessing the S3 from Spark docker in Mesos. The EC2 instance that I am using has the AWS profile/IAM included. Should we build the docker image with any AWS profile settings or --net=host docker option takes care of it? Please help Thanks Sathish On Tue, Jan 26,

How to debug ClassCastException: java.lang.String cannot be cast to java.lang.Long in SparkSQL

2016-01-26 Thread Anfernee Xu
Hi, I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data from 3rdparty datasource, the data type mapping has been taken care of in my code, but when I issued below query, SELECT * FROM ( SELECT count(*) as failures from test WHERE state != 'success' ) as tmp WHERE ( COALESCE(f

Re: Spark 2.0.0 release plan

2016-01-26 Thread Koert Kuipers
thanks thats all i needed On Tue, Jan 26, 2016 at 6:19 PM, Sean Owen wrote: > I think it will come significantly later -- or else we'd be at code > freeze for 2.x in a few days. I haven't heard anyone discuss this > officially but had batted around May or so instead informally in > conversation.

How to debug

2016-01-26 Thread Anfernee Xu
Hi, I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data from 3rdparty datasource, the data type mapping has been taken care of in my code, but when I issued below query, SELECT * FROM ( SELECT count(*) as failures from test WHERE state != 'success' ) as tmp WHERE ( COALESCE(f

SQL

2016-01-26 Thread Madabhattula Rajesh Kumar
Hi, To read data from oracle. I am using sqlContext. Below is the method signature. Is lowerBound and upperBound values are belongs to actual table lower and upper value of column (or) We can give any numbers Please clarify. sqlContext.read.format("jdbc").options( Map("url" -> "jdbcURL",

Re: Spark, Mesos, Docker and S3

2016-01-26 Thread Mao Geng
Thank you very much, Jerry! I changed to "--jars /opt/spark/lib/hadoop-aws-2.7.1.jar,/opt/spark/lib/aws-java-sdk-1.7.4.jar" then it worked like a charm! >From Mesos task logs below, I saw Mesos executor downloaded the jars from the driver, which is a bit unnecessary (as the docker image already h

Re: Spark, Mesos, Docker and S3

2016-01-26 Thread Jerry Lam
Hi Mao, Can you try --jars to include those jars? Best Regards, Jerry Sent from my iPhone > On 26 Jan, 2016, at 7:02 pm, Mao Geng wrote: > > Hi there, > > I am trying to run Spark on Mesos using a Docker image as executor, as > mentioned > http://spark.apache.org/docs/latest/running-on-m

Re: NA value handling in sparkR

2016-01-26 Thread Devesh Raj Singh
Hi, If we want to create dummy variables out of categorical columns for data manipulation purpose, how would we do it in sparkR? On Wednesday, January 27, 2016, Deborah Siegel wrote: > While fitting the currently available sparkR models, such as glm for > linear and logistic regression, columns

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread Ram Sriharsha
btw, OneVsRest is using the labels in the dataset that is fed to the fit method, in case the metadata is missing. So if the metadata contains a label, we expect that label to be present in the dataset passed to the fit method. If you want OneVsRest to compute the labels you can leave the label meta

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread Ram Sriharsha
Hey David In your scenario, OneVsRest is training a classifier for 1 vs not 1... and the input dataset for fit (or train) has labeled data for label 1 But the underlying binary classifier (LogisticRegression) uses sampling to determine the subset of data to sample during each iteration and it is

naive bayes results to not match published results

2016-01-26 Thread Andy Davidson
I have been getting strange results from Naïve Bayes. The javadoc included a link to a reference paper http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classifica tion-1.html . The test data in trivial you can easily do the computations by hand. To try and figure out what was goi

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread David Brooks
Hi Ram, Joseph, That's right, but I will clarify: (a) a random split can generate a training set that does not contain some rare class (b) when LogisticRegression is run over a dataframe where all instances have the same class label, it throws an ArrayIndexOutOfBoundsException. When (a) occurs,

Spark, Mesos, Docker and S3

2016-01-26 Thread Mao Geng
Hi there, I am trying to run Spark on Mesos using a Docker image as executor, as mentioned http://spark.apache.org/docs/latest/running-on-mesos.html#mesos-docker-support . I built a docker image using the following Dockerfile (which is based on https://github.com/apache/spark/blob/master/docker/s

Re: Spark LDA model reuse with new set of data

2016-01-26 Thread Joseph Bradley
Hi, This is more a question for the user list, not the dev list, so I'll CC user. If you're using mllib.clustering.LDAModel (RDD API), then can you make sure you're using a LocalLDAModel (or convert to it from DistributedLDAModel)? You can then call topicDistributions() on the new data. If you'r

Re: Spark 2.0.0 release plan

2016-01-26 Thread Sean Owen
I think it will come significantly later -- or else we'd be at code freeze for 2.x in a few days. I haven't heard anyone discuss this officially but had batted around May or so instead informally in conversation. Does anyone have a particularly strong opinion on that? That's basically an extra 3 mo

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread Ram Sriharsha
Hi David If I am reading the email right, there are two problems here right? a) for rare classes the random split will likely miss the rare class. b) if it misses the rare class an exception is thrown I thought the exception stems from b), is that right?... i wouldn't expect an exception to be th

RE: withColumn

2016-01-26 Thread Mohammed Guller
Naga – I believe that the second argument to the withColumn method has to be a column calculated from the source DataFrame on which you call that method. The following will work: df2.withColumn("age2", $"age"+10) Mohammed Author: Big Data Analytics with Spark

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread Ram Sriharsha
Hey David, Yeah absolutely!, feel free to create a JIRA and attach your patch to it. We can help review it and pull in the fix... happy to accept contributions! ccing Joseph who is one of the maintainers of MLLib as well.. when creating the JIRA can you attach a simple test case? On Tue, Jan 26, 2

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread David Brooks
Hi again Ram, Sorry, I was too hasty in my previous response. I've done a bit more digging through the code, and StringIndexer does indeed provide metadata, as a NominalAttribute with a known number of class labels. I don't think the issue is related to the use of metadata, however. It seems to

Re: newAPIHadoopFile uses AWS credentials from other threads

2016-01-26 Thread Wayne Song
Hmmm, I seem to be able to get around this by setting hadoopConf1.setBoolean("fs.s3n.impl.disable.cache", true) in my code. Is there anybody more familiar with Hadoop who can confirm that the filesystem cache would cause this issue? -- View this message in context: http://apache-spark-user-lis

Re: Spark SQL joins taking too long

2016-01-26 Thread Raghu Ganti
Yes, the SomeUDF is Contains, shape is a UDT that maps a custom geometry type to sql binary type. Custom geometry type is a Java class. Please let me know if you need further info. Regards Raghu > On Jan 26, 2016, at 17:13, Ted Yu wrote: > > What's the type of shape column ? > > Can you dis

Re: Spark SQL joins taking too long

2016-01-26 Thread Ted Yu
What's the type of shape column ? Can you disclose what SomeUDF does (by showing the code) ? Cheers On Tue, Jan 26, 2016 at 12:41 PM, raghukiran wrote: > Hi, > > I create two tables, one counties with just one row (it actually has 2k > rows, but I used only one) and another hospitals, which ha

Spark 2.0.0 release plan

2016-01-26 Thread Koert Kuipers
Is the idea that spark 2.0 comes out roughly 3 months after 1.6? So quarterly release as usual? Thanks

Re: Spark Pattern and Anti-Pattern

2016-01-26 Thread Jörn Franke
Spark has its best use cases in in-memory batch processing / machine learning. Connecting multiple different sources/destination requires some thinking and probably more than spark. Connecting spark to a database makes only in very few cases sense. You will have huge performance issues due to th

Re: Issues with Long subtraction in an RDD when utilising tailrecursion

2016-01-26 Thread Ted Yu
bq. (successfulAuction.timestampNanos - auction.timestampNanos) < 1000L && Have you included the above condition into consideration when inspecting timestamps of the results ? On Tue, Jan 26, 2016 at 1:10 PM, Nkechi Achara wrote: > > > down votefavorite >

Re: withColumn

2016-01-26 Thread Ted Yu
A brief search among the Spark source code showed no support for referencing column the way shown in your code. Are you trying to do a join ? Cheers On Tue, Jan 26, 2016 at 1:04 PM, naga sharathrayapati < sharathrayap...@gmail.com> wrote: > I was trying to append a Column to a dataframe df2 by

Spark Pattern and Anti-Pattern

2016-01-26 Thread Daniel Schulz
Hi, We are currently working on a solution architecture to solve IoT workloads on Spark. Therefore, I am interested in getting to know whether it is considered an Anti-Pattern in Spark to get records from a database and make a ReST call to an external server with that data. This external server

Re: Generic Dataset Aggregator

2016-01-26 Thread Arkadiusz Bicz
Hi Deenar, You just need to encapsulate Array in Case Class ( you can not define case class inside spark shell as it can not be inner class) import com.hsbc.rsl.spark.aggregation.MinVectorAggFunction import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Aggregator import

Issues with Long subtraction in an RDD when utilising tailrecursion

2016-01-26 Thread Nkechi Achara
down votefavorite I am having an issue with the subtraction of a long within an RDD to filter out items in the RDD that are within a certain time range. So my code filters an RDD

withColumn

2016-01-26 Thread naga sharathrayapati
I was trying to append a Column to a dataframe df2 by using 'withColumn'(as shown below), can anyone help me understand what went wrong? scala> case class Sharath(name1: String, age1: Long) defined class Sharath scala> val df1 = Seq(Sharath("Sharath", 29)).toDF() df1: org.apache.spark.sql.Data

Spark SQL joins taking too long

2016-01-26 Thread raghukiran
Hi, I create two tables, one counties with just one row (it actually has 2k rows, but I used only one) and another hospitals, which has 6k rows. The join command I use is as follows, which takes way too long to run and has never finished successfully (even after nearly 10mins). The following is wh

Re: NoSuchMethod from transitive dependency jackson-databind in MaxMind GeoIP2

2016-01-26 Thread asdf zxcv
Thanks Ted, I think I was able to resolve this issue without modifying spark by setting a relocation for jackson in my shaded jar. On Tue, Jan 26, 2016 at 12:11 PM, Ted Yu wrote: > Then maybe changing the following in pom.xml to 2.7.0 and rebuild Spark ? > > 2.5.3 > > On Tue, Jan 26, 2016 at

Spark GraphX + TitanDB + Cassandra?

2016-01-26 Thread Joe Bako
I’ve found some references online to various implementations (such as Dendrite) leveraging HDFS via TitanDB + HBase for graph processing. GraphLab also uses HDFS/Hadoop. I am wondering if (and how) one might use TitanDB + Cassandra as the data source for Spark GraphX? The Gremlin language see

Re: Terminating Spark Steps in AWS

2016-01-26 Thread Daniel Imberman
Hi Jonathan, Thank you, that worked perfectly. (apologies for the noob question) On Tue, Jan 26, 2016 at 11:20 AM Jonathan Kelly wrote: > Daniel, > > The "hadoop job -list" command is a deprecated form of "mapred job -list", > which is only for Hadoop MapReduce jobs. For Spark jobs, which run o

Re: NoSuchMethod from transitive dependency jackson-databind in MaxMind GeoIP2

2016-01-26 Thread Ted Yu
Then maybe changing the following in pom.xml to 2.7.0 and rebuild Spark ? 2.5.3 On Tue, Jan 26, 2016 at 11:53 AM, asdf zxcv wrote: > Hmm, this did not seem to resolve the issue. I also tried adding a > relocation for jackson as well. > > On Tue, Jan 26, 2016 at 10:09 AM, Ted Yu wrote: > >>

Re: NA value handling in sparkR

2016-01-26 Thread Deborah Siegel
While fitting the currently available sparkR models, such as glm for linear and logistic regression, columns which contains strings are one-hot encoded behind the scenes, as part of the parsing of the RFormula. Does that help, or did you have something else in mind? > Thank you so much for your

Databricks Cloud vs AWS EMR

2016-01-26 Thread Alex Nastetsky
As a user of AWS EMR (running Spark and MapReduce), I am interested in potential benefits that I may gain from Databricks Cloud. I was wondering if anyone has used both and done comparison / contrast between the two services. In general, which resource manager(s) does Databricks Cloud use for Spar

Re: NoSuchMethod from transitive dependency jackson-databind in MaxMind GeoIP2

2016-01-26 Thread asdf zxcv
Hmm, this did not seem to resolve the issue. I also tried adding a relocation for jackson as well. On Tue, Jan 26, 2016 at 10:09 AM, Ted Yu wrote: > I wonder if the following change would solve the problem you described (by > shading jackson.core): > > diff --git a/pom.xml b/pom.xml > index fb77

Re: multi-threaded Spark jobs

2016-01-26 Thread Elango Cheran
I think I understand what you're saying, but I think whether you're "over-provisioning" or not depends on the nature of your workload, your system's resources, and how Spark determines how to spawn task threads inside executor processes. As I concluded in the post, if you're doing CPU-bound work,

Re: Terminating Spark Steps in AWS

2016-01-26 Thread Jonathan Kelly
Daniel, The "hadoop job -list" command is a deprecated form of "mapred job -list", which is only for Hadoop MapReduce jobs. For Spark jobs, which run on YARN, you instead want "yarn application -list". Hope this helps, Jonathan (from the EMR team) On Tue, Jan 26, 2016 at 10:05 AM Daniel Imberman

NoSuchMethod from transitive dependency jackson-databind in MaxMind GeoIP2

2016-01-26 Thread laxatives
I'm trying to run MaxMind GeoIP2 in a Spark task, but get a runtime error at init due to a NoSuchMethodError for ArrayNode from com.fasterxml.jackson.core:jackson-databind. This succeeds locally in unit tests, but fails in Spark tasks. I've excluded jackson-databind from all other dependencies, in

save rdd with gzip compresson but without .gz extension?

2016-01-26 Thread Alexander Pivovarov
Question #1 When spark saves rdd using Gzip codec it generates files with .gz extension. Is it possible to ask spark not to add .gz extension to file names and keep file names like part-x I want to compress existing text files to gzip and want to keep original file names (and context) Questi

Issue with spark-shell in yarn mode

2016-01-26 Thread ndjido
Hi folks, On Spark 1.6.0, I submitted 2 lines of code via spark-shell in Yarn-client mode: 1) sc.parallelize(Array(1,2,3,3,3,3,4)).collect() 2) sc.parallelize(Array(1,2,3,3,3,3,4)).map( x => (x, 1)).collect() 1) works well whereas 2) raises the following exception: Driver stacktrace:

Window range in Spark

2016-01-26 Thread Krishna
Hi, We receive bursts of data with sequential ids and I would like to find the range for each burst-window. What's the best way to find the "window" ranges in Spark? Input --- 1 2 3 4 6 7 8 100 101 102 500 700 701 702 703 704 Output (window start, window end) ---

RE: ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Younes Naguib
It seems that for partitioned tables, you need to create the table 1st, and run an insert into table to take advantage of the dynamic partition allocation. That worked for me. @Ted I just realized you were asking for a complete stack trace. 2016-01-26 15:36:04 ERROR SparkExecuteStatementOperation

Re: FAIR scheduler in Spark Streaming

2016-01-26 Thread Sebastian Piu
Thanks Shixiong, I'll give it a try and report back Cheers On 26 Jan 2016 6:10 p.m., "Shixiong(Ryan) Zhu" wrote: > The number of concurrent Streaming job is controlled by > "spark.streaming.concurrentJobs". It's 1 by default. However, you need to > keep in mind that setting it to a bigger number

Re: Need a sample code to load XML files into cassandra database using spark streaming

2016-01-26 Thread Shixiong(Ryan) Zhu
You can use spark-xml to read the xml files. https://github.com/databricks/spark-xml has some examples. To save your results to cassandra, you can use spark-cassandra-connector: https://github.com/datastax/spark-cassandra-connector On Tue, Jan 26, 2016 at 10:10 AM, Sree Eedupuganti wrote: > Hel

Re: FAIR scheduler in Spark Streaming

2016-01-26 Thread Shixiong(Ryan) Zhu
The number of concurrent Streaming job is controlled by "spark.streaming.concurrentJobs". It's 1 by default. However, you need to keep in mind that setting it to a bigger number will allow jobs of several batches running at the same time. It's hard to predicate the behavior and sometimes will surpr

Re: NoSuchMethod from transitive dependency jackson-databind in MaxMind GeoIP2

2016-01-26 Thread Ted Yu
I wonder if the following change would solve the problem you described (by shading jackson.core): diff --git a/pom.xml b/pom.xml index fb77506..32a3237 100644 --- a/pom.xml +++ b/pom.xml @@ -2177,6 +2177,7 @@ org.eclipse.jetty:jetty-util org.eclipse.jetty:jetty-server

Need a sample code to load XML files into cassandra database using spark streaming

2016-01-26 Thread Sree Eedupuganti
Hello everyone, new to spark streaming, need a sample code to load xml files from AWS S3 server to cassandra database. Any suggesttions please, Thanks in advance. -- Best Regards, Sreeharsha Eedupuganti Data Engineer innData Analytics Private Limited

Re: NPE from sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply?

2016-01-26 Thread Michael Armbrust
That is a bug in generated code. It would be great if you could post a reproduction. On Tue, Jan 26, 2016 at 9:15 AM, Jacek Laskowski wrote: > Hi, > > Does this say anything to anyone? :) It's with Spark 2.0.0-SNAPSHOT > built today. Is this something I could fix myself in my code or is > this

Terminating Spark Steps in AWS

2016-01-26 Thread Daniel Imberman
Hi all, I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs running. I don't want to terminate the cluste

FAIR scheduler in Spark Streaming

2016-01-26 Thread Sebastian Piu
Hi, I'm trying to get *FAIR *scheduling to work in a spark streaming app (1.6.0). I've found a previous mailing list where it is indicated to do: dstream.foreachRDD { rdd => rdd.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1") // set the pool rdd.count() // or whatever job } This

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-26 Thread Andres.Fernandez
True thank you. Is there a way of having the shell not closed (how to avoid the :quit statement). Thank you both. Andres From: Ewan Leith [mailto:ewan.le...@realitymine.com] Sent: Tuesday, January 26, 2016 1:50 PM To: Iulian Dragoș; Fernandez, Andres Cc: user Subject: RE: how to correctly run sc

RE: a question about web ui log

2016-01-26 Thread Mohammed Guller
If the application history is turned on, it should work, even through ssh tunnel. Can you elaborate on what you mean by “it does not work?” Also, are you able to see the application web UI while an application is executing a job? Mohammed Author: Big Data Analytics with Spark

Scala closure exceeds ByteArrayOutputStream limit (~2gb)

2016-01-26 Thread Joel Keller
Hello, I am running RandomForest from mllib on a data-set which has very-high dimensional data (~50k dimensions). I get the following stack trace: 16/01/22 21:52:48 ERROR ApplicationMaster: User class threw exception: java.lang.OutOfMemoryError java.lang.OutOfMemoryError at java.io.ByteArrayOutp

NoSuchMethod from transitive dependency jackson-databind in MaxMind GeoIP2

2016-01-26 Thread asdf zxcv
Hi all, I'm trying to run MaxMind GeoIP2 in a Spark task, but get a runtime error at init due to a NoSuchMethodError for ArrayNode from com.fasterxml.jackson.core:jackson-databind. This succeeds locally in unit tests, but fails in Spark tasks. I've excluded jackson-databind from all other depende

NPE from sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply?

2016-01-26 Thread Jacek Laskowski
Hi, Does this say anything to anyone? :) It's with Spark 2.0.0-SNAPSHOT built today. Is this something I could fix myself in my code or is this Spark SQL? Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown

RE: Write to S3 with server side encryption in KMS mode

2016-01-26 Thread Ewan Leith
Hi Nisrina, I’m not aware of any support for KMS keys in s3n, s3a or the EMR specific EMRFS s3 driver. If you’re using EMRFS with Amazon’s EMR, you can use KMS keys with client-side encryption http://docs.aws.amazon.com/kms/latest/developerguide/services-emr.html#emrfs-encrypt If this has chan

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-26 Thread Ewan Leith
I’ve just tried running this using a normal stdin redirect: ~/spark/bin/spark-shell < simple.scala Which worked, it started spark-shell, executed the script, the stopped the shell. Thanks, Ewan From: Iulian Dragoș [mailto:iulian.dra...@typesafe.com] Sent: 26 January 2016 15:00 To: fernandrez19

RE: ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Younes Naguib
The destination table is partitioned. If I don’t specify the columns I get : Error: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Partition column name year conflicts with table columns. (state=,code=0) younes From: Tejas Pat

Stage shows incorrect output size

2016-01-26 Thread Noorul Islam K M
Hi all, I am trying to copy data from one cassandra cluster to another using spark + cassandra connector. At the source I have around 200 GB of data But while running the spark stage shows output as 406 GB and the data is still getting copied. I wonder why is it showing this high a number. Envir

Re: ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Tejas Patil
In CTAS, you should not specify the column information as it is derived from the result of SELECT statement. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS) ~tejasp On Tue, Jan 26, 2016 at 9:48 PM, Younes Naguib < younes.nag...@t

Re: org.netezza.error.NzSQLException: ERROR: Invalid datatype - TEXT

2016-01-26 Thread Sri
Thanks Ted trick worked, need to commit this feature in next spark release. Thanks Sri Sent from my iPhone > On 26 Jan 2016, at 15:49, Ted Yu wrote: > > Please take a look at getJDBCType() in: > sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala > > You can register diale

Re: org.netezza.error.NzSQLException: ERROR: Invalid datatype - TEXT

2016-01-26 Thread kali.tumm...@gmail.com
Fixed by creating a new netezza Dialect and registered in jdbcDialects using JdbcDialects.registerDialect(NetezzaDialect) method (spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala) package com.citi.ocean.spark.elt /** * Created by st84879 on 26/01/2016. */ import ja

Re: Off-heap memory usage of Spark Executors keeps increasing

2016-01-26 Thread nir
Are you having this issue with spark 1.5 as well? We had similar OOM issue and was told by databricks to upgrade to 1.5 to resolve that. I guess they are trying to sell Tachyon :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Off-heap-memory-usage-of-Spark

Re: Regarding Off-heap memory

2016-01-26 Thread Nirav Patel
>From my experience with spark 1.3.1 you can also set spark.executor.memoryOverhead to about 7-10% of your spark.executor.memory. Total of which will be requested for a Yarn container. On Tue, Jan 26, 2016 at 4:20 AM, Xiaoyu Ma wrote: > Hi all, > I saw spark 1.6 has new off heap settings: spark.

Re: Spark ODBC Driver Windows Desktop problem

2016-01-26 Thread Я
If someone need, on tableau desktop 8.3 all is ok. >Вторник, 26 января 2016, 18:04 +03:00 от Я : > > >Hi, i'm trying to connect tableau desktop 9.2 to spark sql. >i'm using this guide http://samx18.io/blog/2015/09/05/tableau-spark-hive.html >but on last step, then i'm try to get the contents of

RE: ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Younes Naguib
The CTAS works when not using partitions or not defining columns. Ex: Create table default.tab1 stored as parquet location 'hdfs://mtl2-alabs-dwh01.streamtheworld.net:9000/younes/geo_location_enrichment' as Select * from tab2 Works. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: January-26-16

Re: ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Ted Yu
Maybe try enabling the following (false by default): "spark.sql.hive.convertCTAS" doc = "When true, a table created by a Hive CTAS statement (no USING clause) will be " + "converted to a data source table, using the data source set by spark.sql.sources.default.") FYI On Tue, Jan 26, 2016 at 8:06

RE: ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Younes Naguib
SQL on beeline and connecting to the thriftserver. Younes From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: January-26-16 11:05 AM To: Younes Naguib Cc: user@spark.apache.org Subject: Re: ctas fails with "No plan for CreateTableAsSelect" Were you using HiveContext or SQLContext ? Can you show the

Re: ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Ted Yu
Were you using HiveContext or SQLContext ? Can you show the complete stack trace ? Thanks On Tue, Jan 26, 2016 at 8:00 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi, > > > > I’m running CTAS, and it fails with “Error: java.lang.AssertionError: > assertion failed: No plan for

ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Younes Naguib
Hi, I'm running CTAS, and it fails with "Error: java.lang.AssertionError: assertion failed: No plan for CreateTableAsSelect HiveTable" Here what my sql looks like : Create tbl ( Col1 timestamp , Col2 string, Col3 int, . ) partitioned by (year int, month int, day

Re: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-26 Thread Iulian Dragoș
On Tue, Jan 26, 2016 at 4:08 PM, wrote: > Yes no option –i. Thanks Iulian, but do you know how can I send three > lines to be executed just after spark-shell has initiated. Please check > http://apache-spark-user-list.1001560.n3.nabble.com/how-to-correctly-run-scala-script-using-spark-shell-throu

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Erisa Dervishi
Actually now that I was taking a close look at the thread dump, it looks like all the worker threads are in a "Waiting" condition: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionOb

Re: org.netezza.error.NzSQLException: ERROR: Invalid datatype - TEXT

2016-01-26 Thread Ted Yu
Please take a look at getJDBCType() in: sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala You can register dialect for Netezza as shown in sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala Cheers On Tue, Jan 26, 2016 at 7:26 AM, kali.tumm...@gmail.com < k

org.netezza.error.NzSQLException: ERROR: Invalid datatype - TEXT

2016-01-26 Thread kali.tumm...@gmail.com
Hi All, I am using Spark jdbc df to store data into Netezza , I think spark is trying to create table using data type TEXT for string column , netezza doesn't support data type text. how to overwrite spark method to use VARCHAR instead of data type text ? val sourcedfmode=sourcedf.persist(Stora

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-26 Thread Daniel Darabos
Have you tried setting spark.emr.dropCharacters to a lower value? (It defaults to 8.) :) Just joking, sorry! Fantastic bug. What data source do you have for this DataFrame? I could imagine for example that it's a Parquet file and on EMR you are running with two wrong version of the Parquet librar

Spark ODBC Driver Windows Desktop problem

2016-01-26 Thread Я
Hi, i'm trying to connect tableau desktop 9.2 to spark sql. i'm using this guide http://samx18.io/blog/2015/09/05/tableau-spark-hive.html but on last step, then i'm try to get the contents of the table i'm getting only rows count and empty table. hadoop 2.6 hive 1.21 spark 1.4.1 tableau desktop

Re: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-26 Thread Iulian Dragoș
I don’t see -i in the output of spark-shell --help. Moreover, in master I get an error: $ bin/spark-shell -i test.scala bad option: '-i' iulian ​ On Tue, Jan 26, 2016 at 3:47 PM, fernandrez1987 < andres.fernan...@wellsfargo.com> wrote: > spark-shell -i file.scala is not working for me in Spark

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-26 Thread fernandrez1987
spark-shell -i file.scala is not working for me in Spark 1.6.0, was this removed or what do I have to take into account? The script does not get run at all. What can be happening?

How to migrate spark code to spark streaming ?

2016-01-26 Thread Sree Eedupuganti
Hello everyone, Loading XML files from S3 to database [i.e Cassandra]. Right now my code is in Spark Core. I want to migrate my code to Spark Streaming because for every 15 minutes we have to load XML files into database. So in this case i need to migrate my code to Spark Streaming. Any suggestions

py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

2016-01-26 Thread Lior Baber
'm trying to perform a simple task in spark dataframe (python) which is create new dataframe by selecting specific column and nested columns from another dataframe for example : df.printSchema() root |-- time_stamp: long (nullable = true) |-- country: struct (nullable = true) ||-- code: st

Re: Worker's BlockManager Folder not getting cleared

2016-01-26 Thread Abhishek Anand
Hi Adrian, I am running spark in standalone mode. The spark version that I am using is 1.4.0 Thanks, Abhi On Tue, Jan 26, 2016 at 4:10 PM, Adrian Bridgett wrote: > Hi Abhi - are you running on Mesos perchance? > > If so then with spark <1.6 you will be hitting > https://issues.apache.org/jira

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread David Brooks
Hi Ram, I didn't include an explicit label column in my reproduction as I thought it superfluous. However, in my original use-case, I was using a StringIndexer, where the labels were indexed across the entire dataset (training+validation+test). The (indexed) label column was then explicitly prov

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Gourav Sengupta
Hi, Are you creating RDD's using textfile option? Can you please let me know the following: 1. Number of partitions 2. Number of files 3. Time taken to create the RDD's Regards, Gourav Sengupta On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta wrote: > Hi, > > are you creating RDD's out of th

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Gourav Sengupta
Hi, are you creating RDD's out of the data? Regards, Gourav On Tue, Jan 26, 2016 at 12:45 PM, aecc wrote: > Sorry, I have not been able to solve the issue. I used speculation mode as > workaround to this. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.n

Re: cartesian in the loop, runtime grows

2016-01-26 Thread efa
Problem solved: for i in range(1,6): L=L.cartesian(D) L.unpersist() L=L.reduceByKey(min).coalesce(6).map(lambda (l,n):l).cache() L.collect() Number of partitions should be constant -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cartesian-in

Re: a question about web ui log

2016-01-26 Thread Philip Lee
Yes, I tried it, but it simply does not work. so, my concern is *to use "ssh tunnel" to forward a port of cluster to localhost port. * But in Spark UI, there are two ports which I should forward using "*ssh tunnel*". Considering a default port, 8080 is web-ui port to come into web-ui, and 4040 is

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread aecc
Sorry, I have not been able to solve the issue. I used speculation mode as workaround to this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html Sent from the Apache Spark User List mail

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Erisa Dervishi
Hi, I kind am in your situation now while trying to read from S3. Where you able to find a workaround in the end? Thnx, Erisa On Thu, Nov 12, 2015 at 12:00 PM, aecc wrote: > Some other stats: > > The number of files I have in the folder is 48. > The number of partitions used when reading data

Regarding Off-heap memory

2016-01-26 Thread Xiaoyu Ma
Hi all, I saw spark 1.6 has new off heap settings: spark.memory.offHeap.size The doc said we need to shrink on heap size accordingly. But on Yarn on-heap and yarn limit is set all together via spark.executor.memory (jvm opts for memory is not allowed according to doc), how can we set executor JVM

Write to S3 with server side encryption in KMS mode

2016-01-26 Thread Nisrina Luthfiyati
Hi all, I'm trying to save a spark application output to a bucket in S3. The data is supposed to be encrypted with S3's server side encryption using KMS mode, which typically (using java api/cli) would require us to pass the sse-kms key when writing the data. I currently have not found a way to do

Re: Worker's BlockManager Folder not getting cleared

2016-01-26 Thread Adrian Bridgett
Hi Abhi - are you running on Mesos perchance? If so then with spark <1.6 you will be hitting https://issues.apache.org/jira/browse/SPARK-10975 With spark >= 1.6: https://issues.apache.org/jira/browse/SPARK-12430 and also be aware of: https://issues.apache.org/jira/browse/SPARK-12583 On 25/01/2

Re: streaming textFileStream problem - got only ONE line

2016-01-26 Thread Saisai Shao
Any possibility that this file is still written by other application, so what Spark Streaming processed is an incomplete file. On Tue, Jan 26, 2016 at 5:30 AM, Shixiong(Ryan) Zhu wrote: > Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or > write into it directly? `textFile

Re: SparkR works from command line but not from rstudio

2016-01-26 Thread Sandeep Khurana
Resolved this issue after reinstalling r, rstudio. Had issues with earlier installation. On Jan 22, 2016 6:48 PM, "Sandeep Khurana" wrote: > This problem is fixed by restarting R from R studio. Now see > > 16/01/22 08:08:38 INFO HiveMetaStore: No user is added in admin role, since > config is e

  1   2   >