how do I set TBLPROPERTIES in dataFrame.saveAsTable()?

2016-06-15 Thread Yang
I tried df.options(MAP(prop_name->prop_value)).saveAsTable(tb_name) doesn't seem to work thanks a lot!

how to start reading the spark source code?

2015-07-19 Thread Yang
ould read each of them in turn. thanks! yang

Re: how to start reading the spark source code?

2015-07-19 Thread Yang
hy you started with such an early commit. > > Spark project has evolved quite fast. > > I suggest you clone Spark project from github.com/apache/spark/ and start > with core/src/main/scala/org/apache/spark/rdd/RDD.scala > > Cheers > > On Sun, Jul 19, 2015 at 7:44 PM, Yang

Re: how to start reading the spark source code?

2015-07-20 Thread Yang
= logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } then I debug through this and it became fairly clear On Sun, Jul 19, 2015 at 10:13 PM, Yang wrote: > thanks, my point is that earlie

Re: how to start reading the spark source code?

2015-07-20 Thread Yang
ation (Task[] ) through serialization. On Mon, Jul 20, 2015 at 12:38 AM, Yang wrote: > ok got some headstart: > > pull the git source to 14719b93ff4ea7c3234a9389621be3c97fa278b9 (first > release so that I could at least build it) > > then build it according to README.md, &

can mllib Logistic Regression package handle 10 million sparse features?

2016-10-05 Thread Yang
anybody had actual experience applying it to real problems of this scale? thanks

question on the structured DataSet API join

2016-10-17 Thread Yang
I'm trying to use the joinWith() method instead of join() since the former provides type checked result while the latter is a straight DataFrame. the signature is DataSet[(T,U)] joinWith(other:DataSet[U], col:Column) here the second arg, col:Column is normally provided by other.col("col_name")

previous stage results are not saved?

2016-10-17 Thread Yang
experiment while making small changes to the code. any idea what part of the spark framework might have caused this ? thanks Yang

question about the new Dataset API

2016-10-18 Thread Yang
scala> val a = sc.parallelize(Array((1,2),(3,4))) a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[243] at parallelize at :38 scala> val a_ds = hc.di.createDataFrame(a).as[(Long,Long)] a_ds: org.apache.spark.sql.Dataset[(Long, Long)] = [_1: int, _2: int] scala> a_ds.agg(typed.count

Re: question about the new Dataset API

2016-10-18 Thread Yang
On Tue, Oct 18, 2016 at 11:30 PM, Yang wrote: > scala> val a = sc.parallelize(Array((1,2),(3,4))) > a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[243] at > parallelize at :38 > > scala> val a_ds = hc.di.createDataFrame(a).as[(Long,Long)] > a_ds: org.

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-19 Thread Yang
ation time). >> >> > >> >> > Note that the current impl forces dense arrays for intermediate data >> >> > structures, increasing the communication cost significantly. See this >> PR for >> >> > info: https://github.com/apache/spark/pull/12761. Onc

RDD groupBy() then random sort each group ?

2016-10-20 Thread Yang
in my application, I group by same training samples by their model_id's (the input table contains training samples for 100k different models), then each group ends up having about 1 million training samples, then I feed that group of samples to a little Logistic Regression solver (SGD), but SGD r

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
uot;id" % 10 with the key > to group by, then you can get the RDD from shuffled and do the following > operations you want. > > Cheng > > > > On 10/20/16 10:53 AM, Yang wrote: > >> in my application, I group by same training samples by their model_id's

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
for large groups. > > The key is to never materialize the grouped and shuffled data. > > To see one approach to do this take a look at > https://github.com/tresata/spark-sorted > > It's basically a combination of smart partitioning and secondary sort. > > On Oct 20, 20

task not serializable in case of groupByKey() + mapGroups + map?

2016-10-31 Thread Yang
with the following simple code val a = sc.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] val grouped = a.groupByKey({x:(Int,Int)=>x._1}) val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) val yyy = sc.broadcast(1) val last = mappedGroups.rdd.map(xx=>{

type-safe join in the new DataSet API?

2016-11-10 Thread Yang
the new DataSet API is supposed to provide type safety and type checks at compile time https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#join-operations It does this indeed for a lot of places, but I found it still doesn't have a type safe join: val ds1 = hc.sql("se

spark-shell fails to redefine values

2016-12-21 Thread Yang
summary: Spark-shell fails to redefine values in some cases, this is at least found in a case where "implicit" is involved, but not limited to such cases run the following in spark-shell, u can see that the last redefinition does not take effect. the same code runs in plain scala REPL without prob

L1 regularized Logistic regression ?

2017-01-04 Thread Yang
does mllib support this? I do see Lasso impl here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala if it supports LR , could you please show me a link? what algorithm does it use? thanks

Re: L1 regularized Logistic regression ?

2017-01-04 Thread Yang
logistic-regression > > You'd set elasticnetparam = 1 for Lasso > > On Wed, Jan 4, 2017 at 7:13 PM, Yang wrote: > >> does mllib support this? >> >> I do see Lasso impl here https://github.com/apache >> /spark/blob/master/mllib/src/main/scala/org/apache/spark/

how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
I'm trying to use Encoders.bean() to create an encoder for my custom class, but it fails complaining about can't find the schema: class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {} @scala. beans.BeanProperty def getX():Int = {1} } val personEncoder = Encoders.bean[ Person4](clas

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
uples. This way when I encode the wrapper, the bean encoder simply encodes the getContent() output, I think. encoding a list of tuples is very fast. Yang On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust wrote: > I think you are supposed to set BeanProperty on a var as they do here

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
thub.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala#L71-L83>. > If you are using scala though I'd consider using the case class encoders. > > On Tue, May 9, 2017 at 12:21 AM, Yang wrote: > >

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
2.0.2 with scala 2.11 On Tue, May 9, 2017 at 11:30 AM, Michael Armbrust wrote: > Which version of Spark? > > On Tue, May 9, 2017 at 11:28 AM, Yang wrote: > >> actually with var it's the same: >> >> >> scala> class Person4 { >> |

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
he/spark/mllib/tree/configuration/Strategy.scala#L71-L83>. > If you are using scala though I'd consider using the case class encoders. > > On Tue, May 9, 2017 at 12:21 AM, Yang wrote: > >> I'm trying to use Encoders.bean() to create an encoder for my custom >> cla

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
4027ec902e239c93eaaa8714f173bcfc/1023043053387187/908554720841389/2840265927289860/latest.html> > in > Spark 2.1. > > On Tue, May 9, 2017 at 12:10 PM, Yang wrote: > >> somehow the schema check is here >> >> https://github.com/apache/spark/blob/master/sql/catalyst

RowMatrix PCA out of heap space error

2014-10-13 Thread Yang
I got this error when trying to perform PCA on a sparse matrix, each row has a nominal length of 8000, and there are 36k rows. each row has on average 3 elements being non-zero. I guess the total size is not that big. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.

buffer overflow when running Kmeans

2014-10-21 Thread Yang
this is the stack trace I got with yarn logs -applicationId really no idea where to dig further. thanks! yang 14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [ phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21] 14/10/21 14:36:47 ERROR Executor: Exception in task ID 98

version mismatch issue with spark breeze vector

2014-10-22 Thread Yang
1.0.2 org.scala-lang scala-library 2.10.4 Thanks a lot Yang

how to run a dev spark project without fully rebuilding the fat jar ?

2014-10-22 Thread Yang
during tests, I often modify my code a little bit and want to see the result. but spark-submit requires the full fat-jar, which takes quite a lot of time to build. I just need to run in --master local mode. is there a way to run it without rebuilding the fat jar? thanks Yang

Anyone interested in Remote Shuffle Service

2020-10-21 Thread bo yang
Hi Spark Users, Uber open sourced Remote Shuffle Service ( https://github.com/uber/RemoteShuffleService ) recently. It works with open source Spark version without code change needed, and could store shuffle data on separate machines other than Spark executors. Anyone interested to try? Also we a

How to submit a job via REST API?

2020-11-23 Thread Zhou Yang
Dear experts, I found a convenient way to submit job via Rest API at https://gist.github.com/arturmkrtchyan/5d8559b2911ac951d34a#file-submit_job-sh. But I did not know whether can I append `—conf` parameter like what I did in spark-submit. Can someone can help me with this issue? Regards, Yang

Re: How to submit a job via REST API?

2020-11-25 Thread Zhou Yang
:55,vaquar khan mailto:vaquar.k...@gmail.com>> 写道: Hi Yang, Please find following link https://stackoverflow.com/questions/63677736/spark-application-as-a-rest-service/63678337#63678337 Regards, Vaquar khan On Wed, Nov 25, 2020 at 12:40 AM Sonal Goyal mailto:sonalgoy...@gmail.com>> wrote: You s

RE: Spark UI Storage Memory

2020-12-04 Thread Jack Yang
unsubsribe

Re: Unsubscribe

2021-07-13 Thread Howard Yang
Unsubscribe Eric Wang 于2021年7月12日周一 上午7:31写道: > Unsubscribe > > On Sun, Jul 11, 2021 at 9:59 PM Rishi Raj Tandon < > tandon.rishi...@gmail.com> wrote: > >> Unsubscribe >> >

Re: Unsubscribe

2021-08-03 Thread Howard Yang
Unsubscribe Edward Wu 于2021年8月3日周二 下午4:15写道: > Unsubscribe >

[ANNOUNCE] Apache Kyuubi (Incubating) released 1.4.1-incubating

2022-01-30 Thread Vino Yang
Hi all, The Apache Kyuubi (Incubating) community is pleased to announce that Apache Kyuubi (Incubating) 1.4.1-incubating has been released! Apache Kyuubi (Incubating) is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark and designed

Re: [ANNOUNCE] Apache Kyuubi (Incubating) released 1.4.1-incubating

2022-01-30 Thread Vino Yang
apache.org/ Best, Vino Bitfox 于2022年1月31日周一 14:49写道: > > What’s the difference between Spark and Kyuubi? > > Thanks > > On Mon, Jan 31, 2022 at 2:45 PM Vino Yang wrote: >> >> Hi all, >> >> The Apache Kyuubi (Incubating) community is pleased to announce tha

One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
Hi Spark Community, We built an open source tool to deploy and run Spark on Kubernetes with a one click command. For example, on AWS, it could automatically create an EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will be able to use curl or a CLI tool to submit Spark applica

Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
ion of spark? or just the standalone node? > > Thanks > > On Wed, Feb 23, 2022 at 12:06 PM bo yang wrote: > >> Hi Spark Community, >> >> We built an open source tool to deploy and run Spark on Kubernetes with a >> one click command. For example, on AWS, it co

Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
r > about 1 hour. Do you have the SaaS solution for this? I can pay as I did. > > Thanks > > On Wed, Feb 23, 2022 at 12:21 PM bo yang wrote: > >> It is not a standalone spark cluster. In some details, it deploys a Spark >> Operator (https://github.com/GoogleCloudPlatfo

Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
you share link to the source? > > בתאריך יום ד׳, 23 בפבר׳ 2022, 6:52, מאת bo yang ‏: > >> We do not have SaaS yet. Now it is an open source project we build in our >> part time , and we welcome more people working together on that. >> >> You could specify cluster s

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 23 Feb 2022 at 04:06, bo yang wrote: > >> Hi Spark Community, >> >> We built an open source tool to deploy and run Spark on Kubernetes with a >>

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Guidance is appreciated. > > Sarath > > Sent from my iPhone > > On Feb 23, 2022, at 2:01 AM, bo yang wrote: > >  > > Right, normally people start with simple script, then add more stuff, like > permission and more components. After some time, people want to run the >

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Hi Sarath, let's follow up offline on this. On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy wrote: > Hi bo > > How do we start? > > Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc > > > Thanks > Sarath > > > Sent from my iPhone

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
chart to > deploy Spark and some other stuff on K8S? > > ons. 23. feb. 2022 kl. 17:49 skrev bo yang : > >> Hi Sarath, let's follow up offline on this. >> >> On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy < >> sarath.annare...@gmail.com> wrote: >&

Reverse proxy for Spark UI on Kubernetes

2022-05-16 Thread bo yang
Hi Spark Folks, I built a web reverse proxy to access Spark UI on Kubernetes (working together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to share here in case other people have similar need. The reverse proxy code is here: https://github.com/datapunchorg/spark-ui-re

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Thanks Holden :) On Mon, May 16, 2022 at 11:12 PM Holden Karau wrote: > Oh that’s rad 😊 > > On Tue, May 17, 2022 at 7:47 AM bo yang wrote: > >> Hi Spark Folks, >> >> I built a web reverse proxy to access Spark UI on Kubernetes (working >>

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
to behave like that Web Application Proxy. It will simplify settings to access Spark UI on Kubernetes. On Mon, May 16, 2022 at 11:46 PM wilson wrote: > what's the advantage of using reverse proxy for spark UI? > > Thanks > > On Tue, May 17, 2022 at 1:47 PM bo yang wrote:

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Yes, it should be possible, any interest to work on this together? Need more hands to add more features here :) On Tue, May 17, 2022 at 2:06 PM Holden Karau wrote: > Could we make it do the same sort of history server fallback approach? > > On Tue, May 17, 2022 at 10:41 PM bo ya

Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread bo yang
Interesting discussion here, looks like Spark does not support configuring different number of executors in different stages. Would love to see the community come out such a feature. On Thu, Nov 3, 2022 at 9:10 AM Shay Elbaz wrote: > Thanks again Artemis, I really appreciate it. I have watched t

Write Spark Connection client application in Go

2023-09-12 Thread bo yang
Hi Spark Friends, Anyone interested in using Golang to write Spark application? We created a Spark Connect Go Client library . Would love to hear feedback/thoughts from the community. Please see the quick start guide

Re: Write Spark Connection client application in Go

2023-09-14 Thread bo yang
at’s so cool! Great work y’all :) >> >> On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: >> >>> Hi Spark Friends, >>> >>> Anyone interested in using Golang to write Spark application? We created >>> a Spark Connect Go Client library >>>

Fwd: the life cycle shuffle Dependency

2023-12-27 Thread yang chen
hi, I'm learning spark, and wonder when to delete shuffle data, I find the ContextCleaner class which clean the shuffle data when shuffle dependency is GC-ed. Based on source code, the shuffle dependency is gc-ed only when active job finish, but i'm not sure, Could you explain the life cycle of a

[ANNOUNCE] Apache Kyuubi released 1.9.0

2024-03-18 Thread Binjie Yang
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.9.0 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for en

[Spark on k8s] A issue of k8s resource creation order

2024-05-29 Thread Tao Yang
Hi, team! I have a spark on k8s issue which posts in https://stackoverflow.com/questions/78537132/spark-on-k8s-resource-creation-order Need help

RE: MLLIB - Storing the Trained Model

2015-06-23 Thread Yang, Yuhao
Hi Samsudhin, If possible, can you please provide a part of the code? Or perhaps try with the ut in RandomForestSuite to see if the issue repros. Regards, yuhao -Original Message- From: samsudhin [mailto:samsud...@pigstick.com] Sent: Tuesday, June 23, 2015 2:14 PM To: user@spark.apac

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Yin Yang
Cycling prior bits: http://search-hadoop.com/m/q3RTto4sby1Cd2rt&subj=Re+Unit+test+with+sqlContext On Wed, Mar 2, 2016 at 9:54 AM, SRK wrote: > Hi, > > What is a good unit testing framework for Spark batch/streaming jobs? I > have > core spark, spark sql with dataframes and streaming api getting

Re: Ignore features in Random Forest

2016-06-01 Thread Yuhao Yang
Hi Neha, This looks like a feature engineering task. I think VectorSlicer can help with your case. Please refer to http://spark.apache.org/docs/latest/ml-features.html#vectorslicer . Regards, Yuhao 2016-06-01 21:18 GMT+08:00 Neha Mehta : > Hi, > > I am performing Regression using Random Forest.

MapType in Java unsupported in Spark 1.5

2016-06-07 Thread Baichuan YANG
scala or any other types supported in below link: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types Or there is no way to do so? Thanks Regards, BaiChuan Yang

Re: OutOfMemoryError - When saving Word2Vec

2016-06-13 Thread Yuhao Yang
Hi Sharad, what's your vocabulary size and vector length for Word2Vec? Regards, Yuhao 2016-06-13 20:04 GMT+08:00 sharad82 : > Is this the right forum to post Spark related issues ? I have tried this > forum along with StackOverflow but not seeing any response. > > > > -- > View this message in

get hdfs file path in spark

2016-07-25 Thread Yang Cao
Hi, To be new here, I hope to get assistant from you guys. I wonder whether I have some elegant way to get some directory under some path. For example, I have a path like on hfs /a/b/c/d/e/f, and I am given a/b/c, is there any straight forward way to get the path /a/b/c/d/e . I think I can do it

create external table from partitioned avro file

2016-07-28 Thread Yang Cao
Hi, I am using spark 1.6 and I hope to create a hive external table based on one partitioned avro file. Currently, I don’t find any build-in api to do this work. I tried the write.format().saveAsTable, with format com.databricks.spark.avro. it returned error can’t file Hive serde for this. Als

Re: java.net.UnknownHostException

2016-08-02 Thread Yang Cao
actually, i just came into same problem. Whether you can share some code around the error, then I can figure it out whether I can help you. And the "s001.bigdata” is your name of name node? > On 2016年8月2日, at 17:22, pseudo oduesp wrote: > > someone can help me please > > 2016-08-01 11:51

Re: Spark SQL . How to enlarge output rows ?

2016-01-27 Thread bo yang
Hi Eli, are you using Python? I see there is a method show(numRows) in Java, but not sure about Python. On Wed, Jan 27, 2016 at 2:39 AM, Akhil Das wrote: > Why would you want to print all rows? You can try the following: > > sqlContext.sql("select day_time from my_table limit > 10").collect().fo

Re: metrics not reported by spark-cassandra-connector

2016-02-23 Thread Yin Yang
Hi, Sa: Have you asked on spark-cassandra-connector mailing list ? Seems you would get better response there. Cheers

Re: Execution plan in spark

2016-02-24 Thread Yin Yang
Is the following what you were looking for ? sqlContext.sql(""" CREATE TEMPORARY TABLE partitionedParquet USING org.apache.spark.sql.parquet OPTIONS ( path '/tmp/partitioned' )""") table("partitionedParquet").explain(true) On Wed, Feb 24, 2016 at 1:16 AM, Ashok Kuma

Re: Filter on a column having multiple values

2016-02-24 Thread Yin Yang
However, when the number of choices gets big, the following notation becomes cumbersome. On Wed, Feb 24, 2016 at 3:41 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > You can use operators here. > > t.filter($"column1" === 1 || $"column1" === 2) > > > > > > On 24/02/

chang hadoop version when import spark

2016-02-24 Thread YouPeng Yang
Hi I am developing an application based on spark-1.6. my lib dependencies is just as libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.6.0" ) it use hadoop 2.2.0 as the default hadoop version which not my preference.I want to change the hadoop versio when import spark .How

Re: Error:java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2016-02-24 Thread Yin Yang
See slides starting with slide #25 of http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications FYI On Wed, Feb 24, 2016 at 7:25 PM, xiazhuchang wrote: > When cache data to memory, the code DiskStore$getBytes will be called. If > there is a big data, the

Re: Running executors missing in sparkUI

2016-02-25 Thread Yin Yang
Which Spark / hadoop release are you running ? Thanks On Thu, Feb 25, 2016 at 4:28 AM, Jan Štěrba wrote: > Hello, > > I have quite a weird behaviour that I can't quite wrap my head around. > I am running Spark on a Hadoop YARN cluster. I have Spark configured > in such a way that it utilizes al

Re: Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread Yin Yang
Which release of hadoop are you using ? Can you share a bit about the logic of your job ? Pastebinning portion of relevant logs would give us more clue. Thanks On Thu, Feb 25, 2016 at 8:54 AM, unk1102 wrote: > Hi I have spark job which I run on yarn and sometimes it behaves in weird > manner

Re: DirectFileOutputCommiter

2016-02-25 Thread Yin Yang
The header of DirectOutputCommitter.scala says Databricks. Did you get it from Databricks ? On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu wrote: > interesting in this topic as well, why the DirectFileOutputCommitter not > included? > > we added it in our fork, under > core/src/main/scala/org/apache

Re: Spark SQL support for sub-queries

2016-02-26 Thread Yin Yang
Since collect is involved, the approach would be slower compared to the SQL Mich gave in his first email. On Fri, Feb 26, 2016 at 1:42 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > You need to collect the value. > > val m: Int = d.agg(max($"id")).collect.apply(0).getInt(0) > d.filt

Re: Spark 1.5 on Mesos

2016-02-26 Thread Yin Yang
Have you read this ? https://spark.apache.org/docs/latest/running-on-mesos.html On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni wrote: > Hi All , > > Is there any proper documentation as how to run spark on mesos , I am > trying from the last few days and not able to make it work. > > Please help

Re: Spark SQL support for sub-queries

2016-02-26 Thread Yin Yang
I tried the following: scala> Seq((2, "a", "test"), (2, "b", "foo")).toDF("id", "a", "b").registerTempTable("test") scala> val df = sql("SELECT maxRow.* FROM (SELECT max(struct(id, b, a)) as maxRow FROM test) a") df: org.apache.spark.sql.DataFrame = [id: int, b: string ... 1 more field] scala> d

Re: TaskCompletionListener and Exceptions

2016-02-26 Thread Yin Yang
Please see [SPARK-13465] Add a task failure listener to TaskContext On Sat, Dec 19, 2015 at 3:44 PM, Neelesh wrote: > Hi, > I'm trying to build automatic Kafka watermark handling in my stream apps > by overriding the KafkaRDDIterator, and adding a taskcompletionlistener and > updating watermar

.cache() changes contents of RDD

2016-02-26 Thread Yan Yang
Hi I am pretty new to Spark, and after experimentation on our pipelines. I ran into this weird issue. The Scala code is as below: val input = sc.newAPIHadoopRDD(...) val rdd = input.map(...) rdd.cache() rdd.saveAsTextFile(...) I found rdd to consist of 80+K identical rows. To be more precise, t

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Yin Yang
Is this what you look for ? scala> Seq((2, "a", "test"), (2, "b", "foo")).toDF("id", "a", "b").registerTempTable("test") scala> val df = sql("SELECT struct(id, b, a) from test") df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct] scala> df.show ++ |struct(id, b, a)| +

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Yin Yang
scala> Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a", "b").registerTempTable("test") scala> val df = sql("SELECT struct(id, b, a) from test order by b") df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct] scala> df.show ++ |struct(id, b, a)| ++

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Yin Yang
Is there particular reason you cannot use temporary table ? Thanks On Sat, Feb 27, 2016 at 10:59 AM, Ashok Kumar wrote: > Thank you sir. > > Can one do this sorting without using temporary table if possible? > > Best > > > On Saturday, 27 February 2016, 18:50, Yin

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Yin Yang
The default value for spark.shuffle.reduceLocality.enabled is true. To reduce surprise to users of 1.5 and earlier releases, should the default value be set to false ? On Mon, Feb 29, 2016 at 5:38 AM, Lior Chaga wrote: > Hi Koret, > Try spark.shuffle.reduceLocality.enabled=false > This is an un

Re: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Yin Yang
RDDOperationScope is in spark-core_2.1x jar file. 7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class Can you check whether the spark-core jar is in classpath ? FYI On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C wrote: > Hi Jules, folks, > > > > I have tried “h

RE: No space left on device when running graphx job

2015-10-05 Thread Jack Yang
September 2015 12:27 AM To: Jack Yang Cc: Ted Yu; Andy Huang; user@spark.apache.org Subject: Re: No space left on device when running graphx job Would you mind sharing what your solution was? It would help those on the forum who might run into the same problem. Even it it’s a silly ‘gotcha’ it

error with saveAsTextFile in local directory

2015-11-03 Thread Jack Yang
Hi all, I am saving some hive- query results into the local directory: val hdfsFilePath = "hdfs://master:ip/ tempFile "; val localFilePath = "file:///home/hduser/tempFile"; hiveContext.sql(s"""my hql codes here""") res.printSchema() --working res.show() --working res.map{ x => tranRow2Str(x) }

RE: error with saveAsTextFile in local directory

2015-11-03 Thread Jack Yang
Yes. My one is 1.4.0. Then is this problem to do with the version? I doubt that. Any comments please? From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, 4 November 2015 11:52 AM To: Jack Yang Cc: user@spark.apache.org Subject: Re: error with saveAsTextFile in local directory Looks

spark with breeze error of NoClassDefFoundError

2015-11-17 Thread Jack Yang
Hi all, I am using spark 1.4.0, and building my codes using maven. So in one of my scala, I used: import breeze.linalg._ val v1 = new breeze.linalg.SparseVector(commonVector.indices, commonVector.values, commonVector.size) val v2 = new breeze.linalg.SparseVector(commonVector2.indices, commonVect

RE: spark with breeze error of NoClassDefFoundError

2015-11-17 Thread Jack Yang
oader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 10 more 15/11/18 17:15:15 INFO util.Utils: Shutdown hook called From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, 18 November 2015 4:01 PM To: Jack Yang Cc: user@spark.apache.org Subject: Re:

RE: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Jack Yang
ook called Meanwhile, I will prefer to use maven to compile the jar file rather than sbt, although it is indeed another option. Best regards, Jack From: Fengdong Yu [mailto:fengdo...@everstring.com] Sent: Wednesday, 18 November 2015 7:30 PM To: Jack Yang Cc: Ted Yu; user@spark.apache.org S

RE: Do windowing functions require hive support?

2015-11-18 Thread Jack Yang
Which version of spark are you using? From: Stephen Boesch [mailto:java...@gmail.com] Sent: Thursday, 19 November 2015 2:12 PM To: user Subject: Do windowing functions require hive support? The following works against a hive table from spark sql hc.sql("select id,r from (select id, name, rank()

RE: Do windowing functions require hive support?

2015-11-18 Thread Jack Yang
SQLContext only implements a subset of the SQL function, not included the window function. In HiveContext it is fine though. From: Stephen Boesch [mailto:java...@gmail.com] Sent: Thursday, 19 November 2015 3:01 PM To: Michael Armbrust Cc: Jack Yang; user Subject: Re: Do windowing functions

RE: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Jack Yang
cannot find the Class, but with “compiled” the error is IncompatibleClassChangeError. Ok, so can someone tell me which version of breeze and breeze-math are used in spark 1.4? From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com] Sent: Thursday, 19 November 2015 5:10 PM To: Ted Yu Cc: Jack Yang; Fengdong

RE: spark with breeze error of NoClassDefFoundError

2015-11-19 Thread Jack Yang
lly confused… what exactly can we use breeze library in spark, please? From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, 20 November 2015 1:46 AM To: Jack Yang Cc: Zhiliang Zhu; Fengdong Yu; user@spark.apache.org Subject: Re: spark with breeze error of NoClassDefFoundError I don't have S

assertion failed error with GraphX

2015-07-19 Thread Jack Yang
Hi there, I got an error when running one simple graphX program. My setting is: spark 1.4.0, Hadoop yarn 2.5. scala 2.10. with four virtual machines. if I constructed one small graph (6 nodes, 4 edges), I run: println("triangleCount: %s ".format( hdfs_graph.triangleCount().vertices.count() ))

standalone to connect mysql

2015-07-20 Thread Jack Yang
Hi there, I would like to use spark to access the data in mysql. So firstly I tried to run the program using: spark-submit --class "sparkwithscala.SqlApp" --driver-class-path /home/lib/mysql-connector-java-5.1.34.jar --master local[4] /home/myjar.jar that returns me the correct results. Then I

RE: standalone to connect mysql

2015-07-21 Thread Jack Yang
27;, aa, 1) but if I did: sqlContext.sql(s"insert into Table newStu select * from otherStu") that works. Is there any document addressing that? Best regards, Jack From: Terry Hole [mailto:hujie.ea...@gmail.com] Sent: Tuesday, 21 July 2015 4:17 PM To: Jack Yang; user@spark.apache.org

Re: standalone to connect mysql

2015-07-21 Thread Jack Yang
@gmail.com<mailto:hujie.ea...@gmail.com>] Sent: Tuesday, 21 July 2015 4:17 PM To: Jack Yang; user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: standalone to connect mysql Maybe you can try: spark-submit --class "sparkwithscala.SqlApp" --jars /home/lib/mysq

Re: standalone to connect mysql

2015-07-21 Thread Jack Yang
9:21 pm, "Jack Yang" mailto:j...@uow.edu.au>> wrote: No. I did not use hiveContext at this stage. I am talking the embedded SQL syntax for pure spark sql. Thanks, mate. On 21 Jul 2015, at 6:13 pm, "Terry Hole" mailto:hujie.ea...@gmail.com>> wrote: Jack, You

log file directory

2015-07-27 Thread Jack Yang
Hi all, I have questions with regarding to the log file directory. That say if I run "spark-submit --master local[4]", where is the log file? Then how about if I run standalone "spark-submit --master spark://mymaster:7077"? Best regards, Jack

Re: Accessing S3 files with s3n://

2015-08-09 Thread bo yang
Hi Akshat, I find some open source library which implements S3 InputFormat for Hadoop. Then I use Spark newAPIHadoopRDD to load data via that S3 InputFormat. The open source library is https://github.com/ATLANTBH/emr-s3-io. It is a little old. I look inside it and make some changes. Then it works

Re: How to create DataFrame from a binary file?

2015-08-09 Thread bo yang
through Spark SQL: https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang Take a look and feel free to let me know for any question. Best, Bo On Sat, Aug 8, 2015 at 1:42 PM, unk1102 wrote: > Hi how do we create DataFrame from a binary file stored in HDFS

  1   2   3   >