Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Deepak Sharma
Hi Anupama To me it looks like issue with the SPN with which you are trying to connect to hive2 , i.e. hive@hostname. Are you able to connect to hive from spark-shell? Try getting the tkt using any other user keytab but not hadoop services keytab and then try running the spark submit. Thanks

Re: Spark output data to S3 is very slow

2016-09-16 Thread Takeshi Yamamuro
Hi, Have you seen the previous thread? https://www.mail-archive.com/user@spark.apache.org/msg56791.html // maropu On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li wrote: > Hi, > > > I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very > quickly, but the last step, spark spend lot

Re: JDBC Very Slow

2016-09-16 Thread Takeshi Yamamuro
Hi, It'd be better to set `predicates` in jdbc arguments for loading in parallel. See: https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L200 // maropu On Sat, Sep 17, 2016 at 7:46 AM, Benjamin Kim wrote: > I am testing this in s

Spark output data to S3 is very slow

2016-09-16 Thread Qiang Li
Hi, I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very quickly, but the last step, spark spend lots of time to rename or move data from s3 temporary directory to real directory, then I try to set spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.exec

feasibility of ignite and alluxio for interfacing MPI and Spark

2016-09-16 Thread AlexG
Do Ignite and Alluxio offer reasonable means of transferring data, in memory, from Spark to MPI? A straightforward way to transfer data is use piping, but unless you have MPI processes running in a one-to-one mapping to the Spark partitions, this will require some complicated logic to get working (

Re: spark streaming kafka connector questions

2016-09-16 Thread 毅程
Thanks, That is what I am missing. I have added cache before action, and that 2nd processing is avoided. 2016-09-10 5:10 GMT-07:00 Cody Koeninger : > Hard to say without seeing the code, but if you do multiple actions on an > Rdd without caching, the Rdd will be computed multiple times. > > On Se

Re: JDBC Very Slow

2016-09-16 Thread Benjamin Kim
I am testing this in spark-shell. I am following the Spark documentation by simply adding the PostgreSQL driver to the Spark Classpath. SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell Then, I run the code below to connect to the PostgreSQL database to query. This is when I have problems.

Re: JDBC Very Slow

2016-09-16 Thread Nikolay Zhebet
Hi! Can you split init code with current comand? I thing it is main problem in your code. 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" написал: > Has anyone using Spark 1.6.2 encountered very slow responses from pulling > data from PostgreSQL using JDBC? I can get to the table and see the

Re: Apache Spark 2.0.0 on Microsoft Windows Create Dataframe

2016-09-16 Thread Jacek Laskowski
Hi Advait, It's due to https://issues.apache.org/jira/browse/SPARK-15565. See http://stackoverflow.com/a/38945867/1305344 for a solution (that's spark.sql.warehouse.dir away). Upvote if it works for you. Thanks! Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apach

Apache Spark 2.0.0 on Microsoft Windows Create Dataframe

2016-09-16 Thread Advait Mohan Raut
Hi I am trying to run Spark 2.0.0 in the Microsoft Windows environment without hadoop or hive. I am running it in the local mode i.e. cmd> spark-shell and can run the shell. When I try to run the sample example here

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Mich Talebzadeh
Is your Hive Thrift Server up and running on port jdbc:hive2://10001? Do the following netstat -alnp |grep 10001 and see whether it is actually running HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread anupama . gangadhar
Hi, I am trying to connect to Hive from Spark application in Kerborized cluster and get the following exception. Spark version is 1.4.1 and Hive is 1.2.1. Outside of spark the connection goes through fine. Am I missing any configuration parameters? ava.sql.SQLException: Could not open connecti

Re: How PolynomialExpansion works

2016-09-16 Thread Sean Owen
The result includes, essentially, all the terms in (x+y) and (x+y)^2, and so on up if you chose a higher power. It is not just the second-degree terms. On Fri, Sep 16, 2016 at 7:43 PM, Nirav Patel wrote: > Doc says: > > Take a 2-variable feature vector as an example: (x, y), if we want to expand

How PolynomialExpansion works

2016-09-16 Thread Nirav Patel
Doc says: Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y). I know polynomial expansion of (x+y)^2 = x^2 + 2xy + y^2 but can't relate it to above. Thanks -- [image: What's New with Xactly]

JDBC Very Slow

2016-09-16 Thread Benjamin Kim
Has anyone using Spark 1.6.2 encountered very slow responses from pulling data from PostgreSQL using JDBC? I can get to the table and see the schema, but when I do a show, it takes very long or keeps timing out. The code is simple. val jdbcDF = sqlContext.read.format("jdbc").options( Map("u

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Roshani Nagmote
Hello, Thanks for your reply. Yes, Its netflix dataset. And when I get no space on device, my ‘/mnt’ directory gets filled up. I checked. /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --class org.apache.spark.examples.mllib.MovieLensALS --jars /usr/lib/spark/examples/ja

Re: App works, but executor state is "killed"

2016-09-16 Thread satishl
Any solutions for this? Spark version: 1.4.1, running in standalone mode. All my applications complete succesfully but the spark master UI shows that the executors in KILLED Status. Is it just a UI bug or are my executors actually KILLED? -- View this message in context: http://apache-spark-us

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen
Oh this is the netflix dataset right? I recognize it from the number of users/items. It's not fast on a laptop or anything, and takes plenty of memory, but succeeds. I haven't run this recently but it worked in Spark 1.x. On Fri, Sep 16, 2016 at 5:13 PM, Roshani Nagmote wrote: > I am also surpris

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Roshani Nagmote
I am also surprised that I face this problems with fairy small dataset on 14 M4.2xlarge machines. Could you please let me know on which dataset you can run 100 iterations of rank 30 on your laptop? I am currently just trying to run the default example code given with spark to run ALS on movie len

Re: Spark metrics when running with YARN?

2016-09-16 Thread Vladimir Tretyakov
Hello. Found that there is also Spark metric Sink like MetricsServlet. which is enabled by default: https://apache.googlesource.com/spark/+/refs/heads/master/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#40 Tried urls: On master: http://localhost:8080/metrics/master/json/ htt

Re: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
No, that's not the right way of doing it. Remember that RDD operations are lazy, due to performance reasons. Whenever you call one of those operation methods (count, reduce, collect, ...) they will execute all the functions that you have done to create that RDD. It would help if you can post your c

Re: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
Sorry, it wasn't the count it was the reduce method that retrieves information from the RDD. I has to go through all the rdd values to return the result. 2016-09-16 11:18 GMT-03:00 chen yong : > Dear Dirceu, > > > I am totally confused . In your reply you mentioned ".the count does > that, .

答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread chen yong
Also, I wonder what is the right way to debug spark program. If I use ten anonymous function in one spark program, for debugging each of them, i have to place a COUNT action in advace and then remove it after debugging. Is that the right way? 发件人: Dirceu Semig

答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread chen yong
Dear Dirceu, I am totally confused . In your reply you mentioned ".the count does that, ..." .However, in the code snippet shown in the attachment file FelixProblem.png of your previous mail, I cannot find any 'count' ACTION is called. Would you please clearly show me the line it is whi

Re: Missing output partition file in S3

2016-09-16 Thread Tracy Li
Sent from my iPhone > On Sep 15, 2016, at 1:37 PM, Chen, Kevin wrote: > > Hi, > > Has any one encountered an issue of missing output partition file in S3 ? My > spark job writes output to a S3 location. Occasionally, I noticed one > partition file is missing. As a result, one chunk of data

Spark can't connect to secure phoenix

2016-09-16 Thread Ashish Gupta
Hi All, I am running a spark program on secured cluster which creates SqlContext for creating dataframe over phoenix table. When I run my program in local mode with --master option set to local[2] my program works completely fine, however when I try to run same program with master option set t

Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
Hello Felix, No, this line isn't the one that is triggering the execution of the function, the count does that, unless your count val is a lazy val. The count method is the one that retrieves the information of the rdd, it has do go through all of it's data do determine how many records the RDD has

Hive api vs Dataset api

2016-09-16 Thread igor.berman
Hi, I wanted to understand if there is any other advantage besides api syntax when using hive/table api vs. dataset api in spark sql(v2.0)? Any additional optimizations maybe? I'm most interested in parquet partitioned tables stored on s3. Is there any difference if I'm comfortable with dataset api

Re: Missing output partition file in S3

2016-09-16 Thread Igor Berman
are you using speculation? On 15 September 2016 at 21:37, Chen, Kevin wrote: > Hi, > > Has any one encountered an issue of missing output partition file in S3 ? > My spark job writes output to a S3 location. Occasionally, I noticed one > partition file is missing. As a result, one chunk of data

Re: Impersonate users using the same SparkContext

2016-09-16 Thread Steve Loughran
> On 16 Sep 2016, at 04:43, gsvigruha > wrote: > > Hi, > > is there a way to impersonate multiple users using the same SparkContext > (e.g. like this > https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Superusers.html) > when going through the Spark API? > > What I'd lik

Re: Spark Streaming-- for each new file in HDFS

2016-09-16 Thread Steve Loughran
On 16 Sep 2016, at 01:03, Peyman Mohajerian mailto:mohaj...@gmail.com>> wrote: You can listen to files in a specific directory using: Take a look at: http://spark.apache.org/docs/latest/streaming-programming-guide.html streamingContext.fileStream yes, this works here's an example I'm using

Re: Missing output partition file in S3

2016-09-16 Thread Steve Loughran
On 15 Sep 2016, at 19:37, Chen, Kevin mailto:kevin.c...@neustar.biz>> wrote: Hi, Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was los

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-16 Thread Olivier Girardot
Hi michael,Well for nested structs, I saw in the tests the behaviour defined by SPARK-12512 for the "a.b.c" handling in withColumn, and even if it's not ideal for me, I managed to make it work anyway like that :> df.withColumn("a", struct(struct(myUDF(df("a.b.c." // I didn't put back the aliase

Re: very slow parquet file write

2016-09-16 Thread tosaigan...@gmail.com
Hi, try this conf val sc = new SparkContext(conf) sc.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false) Regards, Sai Ganesh On Thu, Sep 15, 2016 at 11:34 PM, gaurav24 [via Apache Spark User List] < ml-node+s1001560n27738...@n3.nabble.com> wrote: > Hi Rok, > > facing sim

Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Mich Talebzadeh
Hi Sean, At the moment I am using Zeppelin with Spark SQL to get data from Hive. So any connection here using visitation has to be through this sort of API. I know Tableau only uses SQL. Zeppelin can use Spark sql directly or through Spark Thrift Server. The question is a user may want to create

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen
You may have to decrease the checkpoint interval to say 5 if you're getting StackOverflowError. You may have a particularly deep lineage being created during iterations. No space left on device means you don't have enough local disk to accommodate the big shuffles in some stage. You can add more d

Re: countApprox

2016-09-16 Thread Sean Owen
countApprox gives the best answer within some timeout. Is it possible that 1ms is more than enough to count this exactly? then the confidence wouldn't matter. Although that seems way too fast, you're counting ranges whose values don't actually matter, and maybe the Python side is smart enough to us

Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Sean Owen
Why Hive and why precompute data at 15 minute latency? there are several ways here to query the source data directly with no extra step or latency here. Even Spark SQL is real-time-ish for queries on the source data, and Impala (or heck Drill etc) are. On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebz

RE: Spark processing Multiple Streams from a single stream

2016-09-16 Thread ayan guha
In fact you can use rdd as well using queue stream but it is considered for testing, as per documents. On 16 Sep 2016 17:44, "ayan guha" wrote: > Rdd no. File yes, using fileStream. But filestream does not support > replay, I think. You need to manage checkpoint yourself. > On 16 Sep 2016 16:56,

RE: Spark processing Multiple Streams from a single stream

2016-09-16 Thread ayan guha
Rdd no. File yes, using fileStream. But filestream does not support replay, I think. You need to manage checkpoint yourself. On 16 Sep 2016 16:56, "Udbhav Agarwal" wrote: > That sounds great. Thanks. > > Can I assume that source for a stream in spark can only be some external > source like kafka