Re: Why spark-submit works with package not with jar

2024-05-05 Thread Jeff Zhang
t convinced why using the package should make so much > difference between a failure and success. In other words, when to use a > package rather than a jar. > > > Any ideas will be appreciated. > > > Thanks > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards Jeff Zhang

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Jeff Zhang
>> >>>> >>>> -- >>>> Shane Knapp >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>>> - >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> >>> >>> -- >>> John Zhuge >>> >> >> >> -- >> Name : Jungtaek Lim >> Blog : http://medium.com/@heartsavior >> Twitter : http://twitter.com/heartsavior >> LinkedIn : http://www.linkedin.com/in/heartsavior >> > -- Best Regards Jeff Zhang

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
bers for contributing to > this release. This release would not have been possible without you. > > Bests, > Dongjoon. > -- Best Regards Jeff Zhang

Re: Kubernetes backend and docker images

2018-01-05 Thread Jeff Zhang
Awesome, less is better Mridul Muralidharan 于2018年1月6日周六 上午11:54写道: > > We should definitely clean this up and make it the default, nicely done > Marcelo ! > > Thanks, > Mridul > > On Fri, Jan 5, 2018 at 5:06 PM Marcelo Vanzin wrote: > >> Hey all, especially those working on the k8s stuff. >> >>

Re: Faster Spark on ORC with Apache ORC

2017-07-13 Thread Jeff Zhang
Awesome, Dong Joon, It's a great improvement. Looking forward its merge. Dong Joon Hyun 于2017年7月12日周三 上午6:53写道: > Hi, All. > > > > Since Apache Spark 2.2 vote passed successfully last week, > > I think it’s a good time for me to ask your opinions again about the > following PR. > > > > https:

Re: [Important for PySpark Devs]: Master now tests with Python 2.7 rather than 2.6 - please retest any Python PRs

2017-03-31 Thread Jeff Zhang
Thanks, retrigger serveral pyspark PRs Hyukjin Kwon 于2017年3月30日周四 上午7:42写道: > Thank you for informing this. > > On 30 Mar 2017 3:52 a.m., "Holden Karau" wrote: > > Hi PySpark Developers, > > In https://issues.apache.org/jira/browse/SPARK-19955 / > https://github.com/apache/spark/pull/17355, as

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Jeff Zhang
Congratulations Burak and Holden! Yanbo Liang 于2017年1月25日周三 上午11:54写道: > Congratulations, Burak and Holden. > > On Tue, Jan 24, 2017 at 7:32 PM, Chester Chen > wrote: > > Congratulation to both. > > > > Holden, we need catch up. > > > > > > *Chester Chen * > > ■ Senior Manager – Data Science &

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Jeff Zhang
+1 Dongjoon Hyun 于2016年11月4日周五 上午9:44写道: > +1 (non-binding) > > It's built and tested on CentOS 6.8 / OpenJDK 1.8.0_111, too. > > Cheers, > Dongjoon. > > On 2016-11-03 14:30 (-0700), Davies Liu wrote: > > +1 > > > > On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin wrote: > > > Please vote on releasi

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Jeff Zhang
>>>>> >>>>> Q: What justifies a -1 vote for this release? >>>>> A: This is a maintenance release in the 2.0.x series. Bugs already >>>>> present in 2.0.0, missing features, or bugs related to new features will >>>>> not necessarily block this release. >>>>> >>>>> Q: What fix version should I use for patches merging into branch-2.0 >>>>> from now on? >>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new >>>>> RC (i.e. RC5) is cut, I will change the fix version of those patches to >>>>> 2.0.1. >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Luciano Resende >>>> http://twitter.com/lresende1975 >>>> http://lresende.blogspot.com/ >>>> >>> >>> >> >> >> -- >> Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; lambdaops.com) >> > -- Best Regards Jeff Zhang

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Jeff Zhang
;>> 1-rc3-bin/ >>>>>>>> >>>>>>>> Release artifacts are signed with the following key: >>>>>>>> https://people.apache.org/keys/committer/pwendell.asc >>>>>>>> >>>>>>>> The staging repository for this release can be found at: >>>>>>>> https://repository.apache.org/content/repositories/orgapache >>>>>>>> spark-1201/ >>>>>>>> >>>>>>>> The documentation corresponding to this release can be found at: >>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0. >>>>>>>> 1-rc3-docs/ >>>>>>>> >>>>>>>> >>>>>>>> Q: How can I help test this release? >>>>>>>> A: If you are a Spark user, you can help us test this release by >>>>>>>> taking an existing Spark workload and running on this release >>>>>>>> candidate, >>>>>>>> then reporting any regressions from 2.0.0. >>>>>>>> >>>>>>>> Q: What justifies a -1 vote for this release? >>>>>>>> A: This is a maintenance release in the 2.0.x series. Bugs already >>>>>>>> present in 2.0.0, missing features, or bugs related to new features >>>>>>>> will >>>>>>>> not necessarily block this release. >>>>>>>> >>>>>>>> Q: What fix version should I use for patches merging into >>>>>>>> branch-2.0 from now on? >>>>>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a >>>>>>>> new RC (i.e. RC4) is cut, I will change the fix version of those >>>>>>>> patches to >>>>>>>> 2.0.1. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> > -- Best Regards Jeff Zhang

Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Jeff Zhang
;> > ---- >> - >> >>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>>> > >> >>>> >> >>>> >> - >> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>>> >> >> >> >> >> >> >> >> -- >> >> -Dhruve Ashar >> >> >> > >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > -- Best Regards Jeff Zhang

Re: Possible contribution to MLlib

2016-06-21 Thread Jeff Zhang
ed to MLlib. Therefore, we are wondering if such an > extension of MLlib K-means algorithm would be appreciated by the community > and would have chances to get included in future spark releases. > > > > Regards, > > > > Simon Nanty > > > -- Best Regards Jeff Zhang

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Jeff Zhang
gt;>>>>>> [INFO] >>>>>>> >>>>>>> >>>>>>> On 18 May 2016, at 16:28, Sean Owen wrote: >>>>>>> >>>>>>> I think it's a good idea. Although releases have been preceded before >>>>>>> by release candidates for developers, it would be good to get a >>>>>>> formal >>>>>>> preview/beta release ratified for public consumption ahead of a new >>>>>>> major release. Better to have a little more testing in the wild to >>>>>>> identify problems before 2.0.0 is finalized. >>>>>>> >>>>>>> +1 to the release. License, sigs, etc check out. On Ubuntu 16 + Java >>>>>>> 8, compilation and tests succeed for "-Pyarn -Phive >>>>>>> -Phive-thriftserver -Phadoop-2.6". >>>>>>> >>>>>>> On Wed, May 18, 2016 at 6:40 AM, Reynold Xin >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> In the past the Apache Spark community have created preview packages >>>>>>> (not >>>>>>> official releases) and used those as opportunities to ask community >>>>>>> members >>>>>>> to test the upcoming versions of Apache Spark. Several people in the >>>>>>> Apache >>>>>>> community have suggested we conduct votes for these preview packages >>>>>>> and >>>>>>> turn them into formal releases by the Apache foundation's standard. >>>>>>> Preview >>>>>>> releases are not meant to be functional, i.e. they can and highly >>>>>>> likely >>>>>>> will contain critical bugs or documentation errors, but we will be >>>>>>> able to >>>>>>> post them to the project's website to get wider feedback. They should >>>>>>> satisfy the legal requirements of Apache's release policy >>>>>>> (http://www.apache.org/dev/release.html) such as having proper >>>>>>> licenses. >>>>>>> >>>>>>> >>>>>>> Please vote on releasing the following candidate as Apache Spark >>>>>>> version >>>>>>> 2.0.0-preview. The vote is open until Friday, May 20, 2015 at 11:00 >>>>>>> PM PDT >>>>>>> and passes if a majority of at least 3 +1 PMC votes are cast. >>>>>>> >>>>>>> [ ] +1 Release this package as Apache Spark 2.0.0-preview >>>>>>> [ ] -1 Do not release this package because ... >>>>>>> >>>>>>> To learn more about Apache Spark, please see >>>>>>> http://spark.apache.org/ >>>>>>> >>>>>>> The tag to be voted on is 2.0.0-preview >>>>>>> (8f5a04b6299e3a47aca13cbb40e72344c0114860) >>>>>>> >>>>>>> The release files, including signatures, digests, etc. can be found >>>>>>> at: >>>>>>> >>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/ >>>>>>> >>>>>>> Release artifacts are signed with the following key: >>>>>>> https://people.apache.org/keys/committer/pwendell.asc >>>>>>> >>>>>>> The documentation corresponding to this release can be found at: >>>>>>> >>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/ >>>>>>> >>>>>>> The list of resolved issues are: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-15351?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.0.0 >>>>>>> >>>>>>> >>>>>>> If you are a Spark user, you can help us test this release by taking >>>>>>> an >>>>>>> existing Apache Spark workload and running on this candidate, then >>>>>>> reporting >>>>>>> any regressions. >>>>>>> >>>>>>> >>>>>>> - >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >> > -- Best Regards Jeff Zhang

Re: executor delay in Spark

2016-04-24 Thread Jeff Zhang
;>> > Regards, >>>>> > Raghava. >>>>> > >>>>> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar >>>>> wrote: >>>>> > >>>>> >> If the data file is same then it should have similar distribution of >>>>> >> keys. >>>>> >> Few queries- >>>>> >> >>>>> >> 1. Did you compare the number of partitions in both the cases? >>>>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala >>>>> >> Program being submitted? >>>>> >> >>>>> >> Also, can you please share the details of Spark Context, >>>>> Environment and >>>>> >> Executors when you run via Scala program? >>>>> >> >>>>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju < >>>>> >> m.vijayaragh...@gmail.com> wrote: >>>>> >> >>>>> >>> Hello All, >>>>> >>> >>>>> >>> We are using HashPartitioner in the following way on a 3 node >>>>> cluster (1 >>>>> >>> master and 2 worker nodes). >>>>> >>> >>>>> >>> val u = >>>>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int, >>>>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) => >>>>> (y.toInt, >>>>> >>> x.toInt) } }).partitionBy(new >>>>> HashPartitioner(8)).setName("u").persist() >>>>> >>> >>>>> >>> u.count() >>>>> >>> >>>>> >>> If we run this from the spark shell, the data (52 MB) is split >>>>> across >>>>> >>> the >>>>> >>> two worker nodes. But if we put this in a scala program and run >>>>> it, then >>>>> >>> all the data goes to only one node. We have run it multiple times, >>>>> but >>>>> >>> this >>>>> >>> behavior does not change. This seems strange. >>>>> >>> >>>>> >>> Is there some problem with the way we use HashPartitioner? >>>>> >>> >>>>> >>> Thanks in advance. >>>>> >>> >>>>> >>> Regards, >>>>> >>> Raghava. >>>>> >>> >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> > -- >>>>> > Regards, >>>>> > Raghava >>>>> > http://raghavam.github.io >>>>> > >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> Mike >>>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Raghava >>>> http://raghavam.github.io >>>> >>> >> >> >> -- >> Regards, >> Raghava >> http://raghavam.github.io >> > -- Best Regards Jeff Zhang

Spark build with scala-2.10 fails ?

2016-03-19 Thread Jeff Zhang
]^ [error] four errors found [error] Compile failed at Mar 17, 2016 2:45:22 PM [13.105s] -- Best Regards Jeff Zhang

Re: What should be spark.local.dir in spark on yarn?

2016-03-01 Thread Jeff Zhang
7197/_tmp_space.db > > > 4. executor uses hadoop.tmp.dir for s3 output > > 16/03/01 08:50:01 INFO s3native.NativeS3FileSystem: OutputStream for key > 'test/p10_16/_SUCCESS' writing to tempfile > '/data01/tmp/hadoop-hadoop/s3/output-2541604454681305094.tmp

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Jeff Zhang
n Sunday, February 28, 2016, Jeff Zhang wrote: > >> data skew might be possible, but not the common case. I think we should >> design for the common case, for the skew case, we may can set some >> parameter of fraction to allow user to tune it. >> >> On Sat, Feb

Re: Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang
ady do what you proposed by creating > identical virtualenvs on all nodes on the same path and change the spark > python path to point to the virtualenv. > > Best Regards, > Mohannad > On Mar 1, 2016 06:07, "Jeff Zhang" wrote: > >> I have created jira for this f

Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang
- spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Best Regards Jeff Zhang

Re: What should be spark.local.dir in spark on yarn?

2016-02-29 Thread Jeff Zhang
a01/tmp,/data02/tmp > > But spark master also writes some files to spark.local.dir > But my master box has only one additional disk /data01 > > So, what should I use for spark.local.dir the > spark.local.dir=/data01/tmp > or > spark.local.dir=/data01/tmp,/data02/tmp > > ? > -- Best Regards Jeff Zhang

Re: Control the stdout and stderr streams in a executor JVM

2016-02-28 Thread Jeff Zhang
e a possibility to have more fine grained control over these, > like we do in a log4j appender, with a property file? > > Rgds > -- > Niranda > @n1r44 <https://twitter.com/N1R44> > +94-71-554-8430 > https://pythagoreanscript.wordpress.com/ > -- Best Regards Jeff Zhang

Re: Is spark.driver.maxResultSize used correctly ?

2016-02-28 Thread Jeff Zhang
the result data are in > one or a few tasks though. > > > On Friday, February 26, 2016, Jeff Zhang wrote: > >> >> My job get this exception very easily even when I set large value of >> spark.driver.maxResultSize. After checking the spark code, I found >>

Is spark.driver.maxResultSize used correctly ?

2016-02-26 Thread Jeff Zhang
asks (1085.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) -- Best Regards Jeff Zhang

Re: ORC file writing hangs in pyspark

2016-02-23 Thread Jeff Zhang
ed/raw_result, is created with a _temporary > folder, but the data is never written. The job hangs at this point, > apparently indefinitely. > > Additionally, no logs are recorded or available for the jobs on the > history server. > > What could be the problem? > -- Best Regards Jeff Zhang

Re: Are we running SparkR tests in Jenkins?

2016-01-15 Thread Jeff Zhang
Created https://issues.apache.org/jira/browse/SPARK-12846 On Fri, Jan 15, 2016 at 3:29 PM, Jeff Zhang wrote: > Right, I forget the documentation, will create a follow up jira. > > On Fri, Jan 15, 2016 at 3:23 PM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote

Re: Are we running SparkR tests in Jenkins?

2016-01-15 Thread Jeff Zhang
7; is not supported as of Spark > 2.0. > >>> Use ./bin/spark-submit > >> > >> > >> Are we still running R tests? Or just saying that this will be > deprecated? > >> > >> Kind regards, > >> > >> Herman van Hövell tot Westerflier > >> > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jeff Zhang
;>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente < >>>>>>>>>>> ju...@esbet.es>님이 작성: >>>>>>>>>>> >>>>>>>>>>>> Unfortunately, Koert is right. >>>>>>>>>>>> >>>>>>>>>>>> I've been in a couple of projects using Spark (banking >>>>>>>>>>>> industry) where CentOS + Python 2.6 is the toolbox available. >>>>>>>>>>>> >>>>>>>>>>>> That said, I believe it should not be a concern for Spark. >>>>>>>>>>>> Python 2.6 is old and busted, which is totally opposite to the >>>>>>>>>>>> Spark >>>>>>>>>>>> philosophy IMO. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers >>>>>>>>>>>> escribió: >>>>>>>>>>>> >>>>>>>>>>>> rhel/centos 6 ships with python 2.6, doesnt it? >>>>>>>>>>>> >>>>>>>>>>>> if so, i still know plenty of large companies where python 2.6 >>>>>>>>>>>> is the only option. asking them for python 2.7 is not going to work >>>>>>>>>>>> >>>>>>>>>>>> so i think its a bad idea >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland < >>>>>>>>>>>> juliet.hougl...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I don't see a reason Spark 2.0 would need to support Python >>>>>>>>>>>>> 2.6. At this point, Python 3 should be the default that is >>>>>>>>>>>>> encouraged. >>>>>>>>>>>>> Most organizations acknowledge the 2.7 is common, but lagging >>>>>>>>>>>>> behind the version they should theoretically use. Dropping python >>>>>>>>>>>>> 2.6 >>>>>>>>>>>>> support sounds very reasonable to me. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas < >>>>>>>>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Red Hat supports Python 2.6 on REHL 5 until 2020 >>>>>>>>>>>>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>, >>>>>>>>>>>>>> but otherwise yes, Python 2.6 is ancient history and the core >>>>>>>>>>>>>> Python >>>>>>>>>>>>>> developers stopped supporting it in 2013. REHL 5 is not a good >>>>>>>>>>>>>> enough >>>>>>>>>>>>>> reason to continue support for Python 2.6 IMO. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We should aim to support Python 2.7 and Python 3.3+ (which I >>>>>>>>>>>>>> believe we currently do). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Nick >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang < >>>>>>>>>>>>>> allenzhang...@126.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> plus 1, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> we are currently using python 2.7.2 in production >>>>>>>>>>>>>>> environment. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" < >>>>>>>>>>>>>>> meethu.mat...@flytxt.com> 写道: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> We use Python 2.7 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Meethu Mathew >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin < >>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does anybody here care about us dropping support for Python >>>>>>>>>>>>>>>> 2.6 in Spark 2.0? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Python 2.6 is ancient, and is pretty slow in many aspects >>>>>>>>>>>>>>>> (e.g. json parsing) when compared with Python 2.7. Some >>>>>>>>>>>>>>>> libraries that >>>>>>>>>>>>>>>> Spark depend on stopped supporting 2.6. We can still convince >>>>>>>>>>>>>>>> the library >>>>>>>>>>>>>>>> maintainers to support 2.6, but it will be extra work. I'm >>>>>>>>>>>>>>>> curious if >>>>>>>>>>>>>>>> anybody still uses Python 2.6 to run Spark. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>> >>> >>> >> > -- Best Regards Jeff Zhang

Re: How to execute non-hadoop command ?

2016-01-04 Thread Jeff Zhang
Sorry, wrong list On Tue, Jan 5, 2016 at 12:36 PM, Jeff Zhang wrote: > I want to create service check for spark, but spark don't use hadoop > script as launch script. I found other component use ExecuteHadoop to > launch hadoop job to verify the service, I am wondering is there

How to execute non-hadoop command ?

2016-01-04 Thread Jeff Zhang
y but don't find how it associates with hadoop -- Best Regards Jeff Zhang

Re: 答复: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-25 Thread Jeff Zhang
me)")) df2.printSchema() df2.show() On Fri, Dec 25, 2015 at 3:44 PM, zml张明磊 wrote: > Thanks, Jeff. It’s not choose some columns of a Row. It’s just choose all > data in a column and convert it to an Array. Do you understand my mean ? > > > > In Chinese > > 我是想基于这个列名把这

Re: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-24 Thread Jeff Zhang
How can I do can achieve this function > ? > > > > Thanks, > > Minglei. > -- Best Regards Jeff Zhang

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Jeff Zhang
of behavior >> >>- spark.mllib.tree.GradientBoostedTrees validationTol has changed >>semantics in 1.6. Previously, it was a threshold for absolute change in >>error. Now, it resembles the behavior of GradientDescent convergenceTol: >>For large errors, it uses relative error (relative to the previous error); >>for small errors (< 0.01), it uses absolute error. >>- spark.ml.feature.RegexTokenizer: Previously, it did not convert >>strings to lowercase before tokenizing. Now, it converts to lowercase by >>default, with an option not to. This matches the behavior of the simpler >>Tokenizer transformer. >>- Spark SQL's partition discovery has been changed to only discover >>partition directories that are children of the given path. (i.e. if >>path="/my/data/x=1" then x=1 will no longer be considered a partition >>but only children of x=1.) This behavior can be overridden by >>manually specifying the basePath that partitioning discovery should >>start with (SPARK-11678 >><https://issues.apache.org/jira/browse/SPARK-11678>). >>- When casting a value of an integral type to timestamp (e.g. casting >>a long value to timestamp), the value is treated as being in seconds >>instead of milliseconds (SPARK-11724 >><https://issues.apache.org/jira/browse/SPARK-11724>). >>- With the improved query planner for queries having distinct >>aggregations (SPARK-9241 >><https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a >>query having a single distinct aggregation has been changed to a more >>robust version. To switch back to the plan generated by Spark 1.5's >>planner, please set spark.sql.specializeSingleDistinctAggPlanning to >>true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077> >>). >> >> > -- Best Regards Jeff Zhang

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-19 Thread Jeff Zhang
. Now, it resembles the behavior of GradientDescent convergenceTol: >>For large errors, it uses relative error (relative to the previous error); >>for small errors (< 0.01), it uses absolute error. >>- spark.ml.feature.RegexTokenizer: Previously, it did not convert >>strings to lowercase before tokenizing. Now, it converts to lowercase by >>default, with an option not to. This matches the behavior of the simpler >>Tokenizer transformer. >>- Spark SQL's partition discovery has been changed to only discover >>partition directories that are children of the given path. (i.e. if >>path="/my/data/x=1" then x=1 will no longer be considered a partition >>but only children of x=1.) This behavior can be overridden by >>manually specifying the basePath that partitioning discovery should >>start with (SPARK-11678 >><https://issues.apache.org/jira/browse/SPARK-11678>). >>- When casting a value of an integral type to timestamp (e.g. casting >>a long value to timestamp), the value is treated as being in seconds >>instead of milliseconds (SPARK-11724 >><https://issues.apache.org/jira/browse/SPARK-11724>). >>- With the improved query planner for queries having distinct >>aggregations (SPARK-9241 >><https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a >>query having a single distinct aggregation has been changed to a more >>robust version. To switch back to the plan generated by Spark 1.5's >>planner, please set spark.sql.specializeSingleDistinctAggPlanning to >>true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077> >>). >> >> > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ > -- Best Regards Jeff Zhang

Re: [SparkR] Any reason why saveDF's mode is append by default ?

2015-12-14 Thread Jeff Zhang
[1]) but the Python API seems to have been > changed to match Scala / Java in > https://issues.apache.org/jira/browse/SPARK-6366 > > Feel free to open a JIRA / PR for this. > > Thanks > Shivaram > > [1] https://github.com/amplab-extras/SparkR-pkg/pull/199/files > >

[SparkR] Any reason why saveDF's mode is append by default ?

2015-12-13 Thread Jeff Zhang
It is inconsistent with scala api which is error by default. Any reason for that ? Thanks -- Best Regards Jeff Zhang

Re: Spark doesn't unset HADOOP_CONF_DIR when testing ?

2015-12-06 Thread Jeff Zhang
Thanks Josh, created https://issues.apache.org/jira/browse/SPARK-12166 On Mon, Dec 7, 2015 at 4:32 AM, Josh Rosen wrote: > I agree that we should unset this in our tests. Want to file a JIRA and > submit a PR to do this? > > On Thu, Dec 3, 2015 at 6:40 PM Jeff Zhang wrote: >

Spark doesn't unset HADOOP_CONF_DIR when testing ?

2015-12-03 Thread Jeff Zhang
nState.java:522) [info] at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) [info] at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) [info] at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) -- Bes

Re: Problem in running MLlib SVM

2015-11-28 Thread Jeff Zhang
el is positive, then I return 1 which is correct classification and I > return zero otherwise. Do you have any idea how to classify a point as > positive or negative using this score or another function ? > > On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang wrote: > >> if(

Re: Problem in running MLlib SVM

2015-11-28 Thread Jeff Zhang
r call(Integer arg0, Integer arg1) > throws Exception { > return arg0+arg1; > }}); > > //compute accuracy as the percentage of the correctly classified > examples > double accuracy=((double)sum)/((double)classification.count()); > System.out.println("Accuracy = " + accuracy); > > } > } > ); > } > } > -- Best Regards Jeff Zhang

Re: FW: SequenceFile and object reuse

2015-11-18 Thread Jeff Zhang
ects, you should first copy them using a map function. > > Is there anyone that can shed some light on this bizzare behavior and the > decisions behind it? > And I also would like to know if anyone's able to read a binary file and > not to incur the additional map() as suggested by the above? What format > did you use? > > thanks > Jeff > -- Best Regards Jeff Zhang

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Jeff Zhang
Created https://issues.apache.org/jira/browse/SPARK-11798 On Wed, Nov 18, 2015 at 9:42 AM, Josh Rosen wrote: > Can you file a JIRA issue to help me triage this further? Thanks! > > On Tue, Nov 17, 2015 at 4:08 PM Jeff Zhang wrote: > >> Sure, hive profile is enabled. &g

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Jeff Zhang
Sure, hive profile is enabled. On Wed, Nov 18, 2015 at 6:12 AM, Josh Rosen wrote: > Is the Hive profile enabled? I think it may need to be turned on in order > for those JARs to be deployed. > > On Tue, Nov 17, 2015 at 2:27 AM Jeff Zhang wrote: > >> BTW, After I revert

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Jeff Zhang
BTW, After I revert SPARK-7841, I can see all the jars under lib_managed/jars On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang wrote: > Hi Josh, > > I notice the comments in https://github.com/apache/spark/pull/9575 said > that Datanucleus related jars will still be copied to lib_

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
BTW, After I revert SPARK-784, I can see all the jars under lib_managed/jars On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang wrote: > Hi Josh, > > I notice the comments in https://github.com/apache/spark/pull/9575 said > that Datanucleus related jars will still be copied to lib_managed

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
uot;+assembly) println("jars:"+jars.map(_.getAbsolutePath()).mkString(",")) // On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang wrote: > This is the exception I got > > 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default > database afte

Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Jeff Zhang
; ) >> previous = current >>i += 1 >> } >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> > -- Best Regards Jeff Zhang

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang wrote: > It's about the datanucleus related jars which is needed by spark sql. > Without th

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
l no > longer place every dependency JAR into lib_managed. Can you say more about > how this affected spark-shell for you (maybe share a stacktrace)? > > On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang wrote: > >> >> Sometimes, the jars under lib_managed is missing. And afte

Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
Sometimes, the jars under lib_managed is missing. And after I rebuild the spark, the jars under lib_managed is still not downloaded. This would cause the spark-shell fail due to jars missing. Anyone has hit this weird issue ? -- Best Regards Jeff Zhang

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-12 Thread Jeff Zhang
Didn't notice that I can pass comma separated path in the existing API (SparkContext#textFile). So no necessary for new api. Thanks all. On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang wrote: > Hi Pradeep > > ≥≥≥ Looks like what I was suggesting doesn't work. :/ > I gu

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
ready implemented a simple patch and it works. On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota wrote: > Looks like what I was suggesting doesn't work. :/ > > On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang wrote: > >> Yes, that's what I suggest. TextInputFormat suppor

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
ried this, but I think you should just be able to do > sc.textFile("file1,file2,...") > > On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: > >> I know these workaround, but wouldn't it be more convenient and >> straightforward to use SparkContext#textFile

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
e >> multiple rdds. For example: >> >> val lines1 = sc.textFile("file1") >> val lines2 = sc.textFile("file2") >> >> val rdd = lines1 union lines2 >> >> regards, >> --Jakob >> >> On 11 November 2015 at 01:20, Jeff Zh

Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
r consideration that I don't know. -- Best Regards Jeff Zhang

Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-10 Thread Jeff Zhang
for us to modify also spark-csv as you proposed in > SPARK-11622? > > Regards > > Kai > > > On Nov 5, 2015, at 11:30 AM, Jeff Zhang wrote: > > > > > > Not sure the reason, it seems LibSVMRelation and CsvRelation can > extends HadoopFsRelation and le

Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
> probably not necessary for LibSVMRelation. > > > > But I think it will be easy to change as extending from HadoopFsRelation. > > > > Hao > > > > *From:* Jeff Zhang [mailto:zjf...@gmail.com] > *Sent:* Thursday, November 5, 2015 10:31 AM > *To:* dev@spark

Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
Not sure the reason, it seems LibSVMRelation and CsvRelation can extends HadoopFsRelation and leverage the features from HadoopFsRelation. Any other consideration for that ? -- Best Regards Jeff Zhang

Re: Master build fails ?

2015-11-03 Thread Jeff Zhang
t me try again, just to be sure. > > Regards > JB > > > On 11/03/2015 11:50 AM, Jeff Zhang wrote: > >> Looks like it's due to guava version conflicts, I see both guava 14.0.1 >> and 16.0.1 under lib_managed/bundles. Anyone meet this issue too ? >> >&g

Master build fails ?

2015-11-03 Thread Jeff Zhang
ookie = HashCodes.fromBytes(secret).toString() [error] ^ -- Best Regards Jeff Zhang

Should enforce the uniqueness of field name in DataFrame ?

2015-10-14 Thread Jeff Zhang
hema (name, age) val df2 = df.join(df, "name") // schema (name, age, age) df2.select("age") // ambiguous column reference. -- Best Regards Jeff Zhang

Re: Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
Stages and ResultStages there > still should not be a performance penalty for this because the extra rounds > of RPCs should only be performed when necessary. > > > On 8/11/15 2:25 AM, Jeff Zhang wrote: > >> As my understanding, OutputCommitCoordinator should only be necessary f

Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
As my understanding, OutputCommitCoordinator should only be necessary for ResultStage (especially for ResultStage with hdfs write), but currently it is used for all the stages. Is there any reason for that ? -- Best Regards Jeff Zhang