t convinced why using the package should make so much
> difference between a failure and success. In other words, when to use a
> package rather than a jar.
>
>
> Any ideas will be appreciated.
>
>
> Thanks
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
--
Best Regards
Jeff Zhang
>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>
--
Best Regards
Jeff Zhang
bers for contributing to
> this release. This release would not have been possible without you.
>
> Bests,
> Dongjoon.
>
--
Best Regards
Jeff Zhang
Awesome, less is better
Mridul Muralidharan 于2018年1月6日周六 上午11:54写道:
>
> We should definitely clean this up and make it the default, nicely done
> Marcelo !
>
> Thanks,
> Mridul
>
> On Fri, Jan 5, 2018 at 5:06 PM Marcelo Vanzin wrote:
>
>> Hey all, especially those working on the k8s stuff.
>>
>>
Awesome, Dong Joon, It's a great improvement. Looking forward its merge.
Dong Joon Hyun 于2017年7月12日周三 上午6:53写道:
> Hi, All.
>
>
>
> Since Apache Spark 2.2 vote passed successfully last week,
>
> I think it’s a good time for me to ask your opinions again about the
> following PR.
>
>
>
> https:
Thanks, retrigger serveral pyspark PRs
Hyukjin Kwon 于2017年3月30日周四 上午7:42写道:
> Thank you for informing this.
>
> On 30 Mar 2017 3:52 a.m., "Holden Karau" wrote:
>
> Hi PySpark Developers,
>
> In https://issues.apache.org/jira/browse/SPARK-19955 /
> https://github.com/apache/spark/pull/17355, as
Congratulations Burak and Holden!
Yanbo Liang 于2017年1月25日周三 上午11:54写道:
> Congratulations, Burak and Holden.
>
> On Tue, Jan 24, 2017 at 7:32 PM, Chester Chen
> wrote:
>
> Congratulation to both.
>
>
>
> Holden, we need catch up.
>
>
>
>
>
> *Chester Chen *
>
> ■ Senior Manager – Data Science &
+1
Dongjoon Hyun 于2016年11月4日周五 上午9:44写道:
> +1 (non-binding)
>
> It's built and tested on CentOS 6.8 / OpenJDK 1.8.0_111, too.
>
> Cheers,
> Dongjoon.
>
> On 2016-11-03 14:30 (-0700), Davies Liu wrote:
> > +1
> >
> > On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin wrote:
> > > Please vote on releasi
>>>>>
>>>>> Q: What justifies a -1 vote for this release?
>>>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>>>> present in 2.0.0, missing features, or bugs related to new features will
>>>>> not necessarily block this release.
>>>>>
>>>>> Q: What fix version should I use for patches merging into branch-2.0
>>>>> from now on?
>>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new
>>>>> RC (i.e. RC5) is cut, I will change the fix version of those patches to
>>>>> 2.0.1.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Luciano Resende
>>>> http://twitter.com/lresende1975
>>>> http://lresende.blogspot.com/
>>>>
>>>
>>>
>>
>>
>> --
>> Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; lambdaops.com)
>>
>
--
Best Regards
Jeff Zhang
;>> 1-rc3-bin/
>>>>>>>>
>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>
>>>>>>>> The staging repository for this release can be found at:
>>>>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>>>>> spark-1201/
>>>>>>>>
>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.
>>>>>>>> 1-rc3-docs/
>>>>>>>>
>>>>>>>>
>>>>>>>> Q: How can I help test this release?
>>>>>>>> A: If you are a Spark user, you can help us test this release by
>>>>>>>> taking an existing Spark workload and running on this release
>>>>>>>> candidate,
>>>>>>>> then reporting any regressions from 2.0.0.
>>>>>>>>
>>>>>>>> Q: What justifies a -1 vote for this release?
>>>>>>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>>>>>>> present in 2.0.0, missing features, or bugs related to new features
>>>>>>>> will
>>>>>>>> not necessarily block this release.
>>>>>>>>
>>>>>>>> Q: What fix version should I use for patches merging into
>>>>>>>> branch-2.0 from now on?
>>>>>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a
>>>>>>>> new RC (i.e. RC4) is cut, I will change the fix version of those
>>>>>>>> patches to
>>>>>>>> 2.0.1.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
--
Best Regards
Jeff Zhang
;> > ----
>> -
>> >>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>> >
>> >>>>
>> >>>>
>> -
>> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>>
>> >>
>> >>
>> >>
>> >> --
>> >> -Dhruve Ashar
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
--
Best Regards
Jeff Zhang
ed to MLlib. Therefore, we are wondering if such an
> extension of MLlib K-means algorithm would be appreciated by the community
> and would have chances to get included in future spark releases.
>
>
>
> Regards,
>
>
>
> Simon Nanty
>
>
>
--
Best Regards
Jeff Zhang
gt;>>>>>> [INFO]
>>>>>>>
>>>>>>>
>>>>>>> On 18 May 2016, at 16:28, Sean Owen wrote:
>>>>>>>
>>>>>>> I think it's a good idea. Although releases have been preceded before
>>>>>>> by release candidates for developers, it would be good to get a
>>>>>>> formal
>>>>>>> preview/beta release ratified for public consumption ahead of a new
>>>>>>> major release. Better to have a little more testing in the wild to
>>>>>>> identify problems before 2.0.0 is finalized.
>>>>>>>
>>>>>>> +1 to the release. License, sigs, etc check out. On Ubuntu 16 + Java
>>>>>>> 8, compilation and tests succeed for "-Pyarn -Phive
>>>>>>> -Phive-thriftserver -Phadoop-2.6".
>>>>>>>
>>>>>>> On Wed, May 18, 2016 at 6:40 AM, Reynold Xin
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> In the past the Apache Spark community have created preview packages
>>>>>>> (not
>>>>>>> official releases) and used those as opportunities to ask community
>>>>>>> members
>>>>>>> to test the upcoming versions of Apache Spark. Several people in the
>>>>>>> Apache
>>>>>>> community have suggested we conduct votes for these preview packages
>>>>>>> and
>>>>>>> turn them into formal releases by the Apache foundation's standard.
>>>>>>> Preview
>>>>>>> releases are not meant to be functional, i.e. they can and highly
>>>>>>> likely
>>>>>>> will contain critical bugs or documentation errors, but we will be
>>>>>>> able to
>>>>>>> post them to the project's website to get wider feedback. They should
>>>>>>> satisfy the legal requirements of Apache's release policy
>>>>>>> (http://www.apache.org/dev/release.html) such as having proper
>>>>>>> licenses.
>>>>>>>
>>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version
>>>>>>> 2.0.0-preview. The vote is open until Friday, May 20, 2015 at 11:00
>>>>>>> PM PDT
>>>>>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 2.0.0-preview
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see
>>>>>>> http://spark.apache.org/
>>>>>>>
>>>>>>> The tag to be voted on is 2.0.0-preview
>>>>>>> (8f5a04b6299e3a47aca13cbb40e72344c0114860)
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>> at:
>>>>>>>
>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/
>>>>>>>
>>>>>>> Release artifacts are signed with the following key:
>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>
>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/
>>>>>>>
>>>>>>> The list of resolved issues are:
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-15351?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.0.0
>>>>>>>
>>>>>>>
>>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>> an
>>>>>>> existing Apache Spark workload and running on this candidate, then
>>>>>>> reporting
>>>>>>> any regressions.
>>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
>
--
Best Regards
Jeff Zhang
;>> > Regards,
>>>>> > Raghava.
>>>>> >
>>>>> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar
>>>>> wrote:
>>>>> >
>>>>> >> If the data file is same then it should have similar distribution of
>>>>> >> keys.
>>>>> >> Few queries-
>>>>> >>
>>>>> >> 1. Did you compare the number of partitions in both the cases?
>>>>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala
>>>>> >> Program being submitted?
>>>>> >>
>>>>> >> Also, can you please share the details of Spark Context,
>>>>> Environment and
>>>>> >> Executors when you run via Scala program?
>>>>> >>
>>>>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
>>>>> >> m.vijayaragh...@gmail.com> wrote:
>>>>> >>
>>>>> >>> Hello All,
>>>>> >>>
>>>>> >>> We are using HashPartitioner in the following way on a 3 node
>>>>> cluster (1
>>>>> >>> master and 2 worker nodes).
>>>>> >>>
>>>>> >>> val u =
>>>>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
>>>>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) =>
>>>>> (y.toInt,
>>>>> >>> x.toInt) } }).partitionBy(new
>>>>> HashPartitioner(8)).setName("u").persist()
>>>>> >>>
>>>>> >>> u.count()
>>>>> >>>
>>>>> >>> If we run this from the spark shell, the data (52 MB) is split
>>>>> across
>>>>> >>> the
>>>>> >>> two worker nodes. But if we put this in a scala program and run
>>>>> it, then
>>>>> >>> all the data goes to only one node. We have run it multiple times,
>>>>> but
>>>>> >>> this
>>>>> >>> behavior does not change. This seems strange.
>>>>> >>>
>>>>> >>> Is there some problem with the way we use HashPartitioner?
>>>>> >>>
>>>>> >>> Thanks in advance.
>>>>> >>>
>>>>> >>> Regards,
>>>>> >>> Raghava.
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Raghava
>>>>> > http://raghavam.github.io
>>>>> >
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Mike
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Raghava
>>>> http://raghavam.github.io
>>>>
>>>
>>
>>
>> --
>> Regards,
>> Raghava
>> http://raghavam.github.io
>>
>
--
Best Regards
Jeff Zhang
]^
[error] four errors found
[error] Compile failed at Mar 17, 2016 2:45:22 PM [13.105s]
--
Best Regards
Jeff Zhang
7197/_tmp_space.db
>
>
> 4. executor uses hadoop.tmp.dir for s3 output
>
> 16/03/01 08:50:01 INFO s3native.NativeS3FileSystem: OutputStream for key
> 'test/p10_16/_SUCCESS' writing to tempfile
> '/data01/tmp/hadoop-hadoop/s3/output-2541604454681305094.tmp
n Sunday, February 28, 2016, Jeff Zhang wrote:
>
>> data skew might be possible, but not the common case. I think we should
>> design for the common case, for the skew case, we may can set some
>> parameter of fraction to allow user to tune it.
>>
>> On Sat, Feb
ady do what you proposed by creating
> identical virtualenvs on all nodes on the same path and change the spark
> python path to point to the virtualenv.
>
> Best Regards,
> Mohannad
> On Mar 1, 2016 06:07, "Jeff Zhang" wrote:
>
>> I have created jira for this f
- spark.pyspark.virtualenv.path (path to the executable for for
virtualenv/conda)
Best Regards
Jeff Zhang
a01/tmp,/data02/tmp
>
> But spark master also writes some files to spark.local.dir
> But my master box has only one additional disk /data01
>
> So, what should I use for spark.local.dir the
> spark.local.dir=/data01/tmp
> or
> spark.local.dir=/data01/tmp,/data02/tmp
>
> ?
>
--
Best Regards
Jeff Zhang
e a possibility to have more fine grained control over these,
> like we do in a log4j appender, with a property file?
>
> Rgds
> --
> Niranda
> @n1r44 <https://twitter.com/N1R44>
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>
--
Best Regards
Jeff Zhang
the result data are in
> one or a few tasks though.
>
>
> On Friday, February 26, 2016, Jeff Zhang wrote:
>
>>
>> My job get this exception very easily even when I set large value of
>> spark.driver.maxResultSize. After checking the spark code, I found
>>
asks (1085.0 MB)
is bigger than spark.driver.maxResultSize (1024.0 MB)
--
Best Regards
Jeff Zhang
ed/raw_result, is created with a _temporary
> folder, but the data is never written. The job hangs at this point,
> apparently indefinitely.
>
> Additionally, no logs are recorded or available for the jobs on the
> history server.
>
> What could be the problem?
>
--
Best Regards
Jeff Zhang
Created https://issues.apache.org/jira/browse/SPARK-12846
On Fri, Jan 15, 2016 at 3:29 PM, Jeff Zhang wrote:
> Right, I forget the documentation, will create a follow up jira.
>
> On Fri, Jan 15, 2016 at 3:23 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote
7; is not supported as of Spark
> 2.0.
> >>> Use ./bin/spark-submit
> >>
> >>
> >> Are we still running R tests? Or just saying that this will be
> deprecated?
> >>
> >> Kind regards,
> >>
> >> Herman van Hövell tot Westerflier
> >>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
--
Best Regards
Jeff Zhang
;>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <
>>>>>>>>>>> ju...@esbet.es>님이 작성:
>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately, Koert is right.
>>>>>>>>>>>>
>>>>>>>>>>>> I've been in a couple of projects using Spark (banking
>>>>>>>>>>>> industry) where CentOS + Python 2.6 is the toolbox available.
>>>>>>>>>>>>
>>>>>>>>>>>> That said, I believe it should not be a concern for Spark.
>>>>>>>>>>>> Python 2.6 is old and busted, which is totally opposite to the
>>>>>>>>>>>> Spark
>>>>>>>>>>>> philosophy IMO.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers
>>>>>>>>>>>> escribió:
>>>>>>>>>>>>
>>>>>>>>>>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>>>>>>>>>>
>>>>>>>>>>>> if so, i still know plenty of large companies where python 2.6
>>>>>>>>>>>> is the only option. asking them for python 2.7 is not going to work
>>>>>>>>>>>>
>>>>>>>>>>>> so i think its a bad idea
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>>>>>>>>>>>> juliet.hougl...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I don't see a reason Spark 2.0 would need to support Python
>>>>>>>>>>>>> 2.6. At this point, Python 3 should be the default that is
>>>>>>>>>>>>> encouraged.
>>>>>>>>>>>>> Most organizations acknowledge the 2.7 is common, but lagging
>>>>>>>>>>>>> behind the version they should theoretically use. Dropping python
>>>>>>>>>>>>> 2.6
>>>>>>>>>>>>> support sounds very reasonable to me.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>>>>>>>>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Red Hat supports Python 2.6 on REHL 5 until 2020
>>>>>>>>>>>>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>,
>>>>>>>>>>>>>> but otherwise yes, Python 2.6 is ancient history and the core
>>>>>>>>>>>>>> Python
>>>>>>>>>>>>>> developers stopped supporting it in 2013. REHL 5 is not a good
>>>>>>>>>>>>>> enough
>>>>>>>>>>>>>> reason to continue support for Python 2.6 IMO.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We should aim to support Python 2.7 and Python 3.3+ (which I
>>>>>>>>>>>>>> believe we currently do).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang <
>>>>>>>>>>>>>> allenzhang...@126.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> plus 1,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> we are currently using python 2.7.2 in production
>>>>>>>>>>>>>>> environment.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" <
>>>>>>>>>>>>>>> meethu.mat...@flytxt.com> 写道:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>> We use Python 2.7
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Meethu Mathew
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin <
>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Does anybody here care about us dropping support for Python
>>>>>>>>>>>>>>>> 2.6 in Spark 2.0?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Python 2.6 is ancient, and is pretty slow in many aspects
>>>>>>>>>>>>>>>> (e.g. json parsing) when compared with Python 2.7. Some
>>>>>>>>>>>>>>>> libraries that
>>>>>>>>>>>>>>>> Spark depend on stopped supporting 2.6. We can still convince
>>>>>>>>>>>>>>>> the library
>>>>>>>>>>>>>>>> maintainers to support 2.6, but it will be extra work. I'm
>>>>>>>>>>>>>>>> curious if
>>>>>>>>>>>>>>>> anybody still uses Python 2.6 to run Spark.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>
--
Best Regards
Jeff Zhang
Sorry, wrong list
On Tue, Jan 5, 2016 at 12:36 PM, Jeff Zhang wrote:
> I want to create service check for spark, but spark don't use hadoop
> script as launch script. I found other component use ExecuteHadoop to
> launch hadoop job to verify the service, I am wondering is there
y
but don't find how it associates with hadoop
--
Best Regards
Jeff Zhang
me)"))
df2.printSchema()
df2.show()
On Fri, Dec 25, 2015 at 3:44 PM, zml张明磊 wrote:
> Thanks, Jeff. It’s not choose some columns of a Row. It’s just choose all
> data in a column and convert it to an Array. Do you understand my mean ?
>
>
>
> In Chinese
>
> 我是想基于这个列名把这
How can I do can achieve this function
> ?
>
>
>
> Thanks,
>
> Minglei.
>
--
Best Regards
Jeff Zhang
of behavior
>>
>>- spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>semantics in 1.6. Previously, it was a threshold for absolute change in
>>error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>For large errors, it uses relative error (relative to the previous error);
>>for small errors (< 0.01), it uses absolute error.
>>- spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>strings to lowercase before tokenizing. Now, it converts to lowercase by
>>default, with an option not to. This matches the behavior of the simpler
>>Tokenizer transformer.
>>- Spark SQL's partition discovery has been changed to only discover
>>partition directories that are children of the given path. (i.e. if
>>path="/my/data/x=1" then x=1 will no longer be considered a partition
>>but only children of x=1.) This behavior can be overridden by
>>manually specifying the basePath that partitioning discovery should
>>start with (SPARK-11678
>><https://issues.apache.org/jira/browse/SPARK-11678>).
>>- When casting a value of an integral type to timestamp (e.g. casting
>>a long value to timestamp), the value is treated as being in seconds
>>instead of milliseconds (SPARK-11724
>><https://issues.apache.org/jira/browse/SPARK-11724>).
>>- With the improved query planner for queries having distinct
>>aggregations (SPARK-9241
>><https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>query having a single distinct aggregation has been changed to a more
>>robust version. To switch back to the plan generated by Spark 1.5's
>>planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>).
>>
>>
>
--
Best Regards
Jeff Zhang
. Now, it resembles the behavior of GradientDescent convergenceTol:
>>For large errors, it uses relative error (relative to the previous error);
>>for small errors (< 0.01), it uses absolute error.
>>- spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>strings to lowercase before tokenizing. Now, it converts to lowercase by
>>default, with an option not to. This matches the behavior of the simpler
>>Tokenizer transformer.
>>- Spark SQL's partition discovery has been changed to only discover
>>partition directories that are children of the given path. (i.e. if
>>path="/my/data/x=1" then x=1 will no longer be considered a partition
>>but only children of x=1.) This behavior can be overridden by
>>manually specifying the basePath that partitioning discovery should
>>start with (SPARK-11678
>><https://issues.apache.org/jira/browse/SPARK-11678>).
>>- When casting a value of an integral type to timestamp (e.g. casting
>>a long value to timestamp), the value is treated as being in seconds
>>instead of milliseconds (SPARK-11724
>><https://issues.apache.org/jira/browse/SPARK-11724>).
>>- With the improved query planner for queries having distinct
>>aggregations (SPARK-9241
>><https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>query having a single distinct aggregation has been changed to a more
>>robust version. To switch back to the plan generated by Spark 1.5's
>>planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>).
>>
>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
--
Best Regards
Jeff Zhang
[1]) but the Python API seems to have been
> changed to match Scala / Java in
> https://issues.apache.org/jira/browse/SPARK-6366
>
> Feel free to open a JIRA / PR for this.
>
> Thanks
> Shivaram
>
> [1] https://github.com/amplab-extras/SparkR-pkg/pull/199/files
>
>
It is inconsistent with scala api which is error by default. Any reason for
that ? Thanks
--
Best Regards
Jeff Zhang
Thanks Josh, created https://issues.apache.org/jira/browse/SPARK-12166
On Mon, Dec 7, 2015 at 4:32 AM, Josh Rosen wrote:
> I agree that we should unset this in our tests. Want to file a JIRA and
> submit a PR to do this?
>
> On Thu, Dec 3, 2015 at 6:40 PM Jeff Zhang wrote:
>
nState.java:522)
[info] at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
[info] at
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
[info] at
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
--
Bes
el is positive, then I return 1 which is correct classification and I
> return zero otherwise. Do you have any idea how to classify a point as
> positive or negative using this score or another function ?
>
> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang wrote:
>
>> if(
r call(Integer arg0, Integer arg1)
> throws Exception {
> return arg0+arg1;
> }});
>
> //compute accuracy as the percentage of the correctly classified
> examples
> double accuracy=((double)sum)/((double)classification.count());
> System.out.println("Accuracy = " + accuracy);
>
> }
> }
> );
> }
> }
>
--
Best Regards
Jeff Zhang
ects, you should first copy them using a map function.
>
> Is there anyone that can shed some light on this bizzare behavior and the
> decisions behind it?
> And I also would like to know if anyone's able to read a binary file and
> not to incur the additional map() as suggested by the above? What format
> did you use?
>
> thanks
> Jeff
>
--
Best Regards
Jeff Zhang
Created https://issues.apache.org/jira/browse/SPARK-11798
On Wed, Nov 18, 2015 at 9:42 AM, Josh Rosen
wrote:
> Can you file a JIRA issue to help me triage this further? Thanks!
>
> On Tue, Nov 17, 2015 at 4:08 PM Jeff Zhang wrote:
>
>> Sure, hive profile is enabled.
&g
Sure, hive profile is enabled.
On Wed, Nov 18, 2015 at 6:12 AM, Josh Rosen
wrote:
> Is the Hive profile enabled? I think it may need to be turned on in order
> for those JARs to be deployed.
>
> On Tue, Nov 17, 2015 at 2:27 AM Jeff Zhang wrote:
>
>> BTW, After I revert
BTW, After I revert SPARK-7841, I can see all the jars under
lib_managed/jars
On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang wrote:
> Hi Josh,
>
> I notice the comments in https://github.com/apache/spark/pull/9575 said
> that Datanucleus related jars will still be copied to lib_
BTW, After I revert SPARK-784, I can see all the jars under
lib_managed/jars
On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang wrote:
> Hi Josh,
>
> I notice the comments in https://github.com/apache/spark/pull/9575 said
> that Datanucleus related jars will still be copied to lib_managed
uot;+assembly)
println("jars:"+jars.map(_.getAbsolutePath()).mkString(","))
//
On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang wrote:
> This is the exception I got
>
> 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default
> database afte
; )
>> previous = current
>>i += 1
>> }
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
--
Best Regards
Jeff Zhang
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang wrote:
> It's about the datanucleus related jars which is needed by spark sql.
> Without th
l no
> longer place every dependency JAR into lib_managed. Can you say more about
> how this affected spark-shell for you (maybe share a stacktrace)?
>
> On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang wrote:
>
>>
>> Sometimes, the jars under lib_managed is missing. And afte
Sometimes, the jars under lib_managed is missing. And after I rebuild the
spark, the jars under lib_managed is still not downloaded. This would cause
the spark-shell fail due to jars missing. Anyone has hit this weird issue ?
--
Best Regards
Jeff Zhang
Didn't notice that I can pass comma separated path in the existing API
(SparkContext#textFile). So no necessary for new api. Thanks all.
On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang wrote:
> Hi Pradeep
>
> ≥≥≥ Looks like what I was suggesting doesn't work. :/
> I gu
ready implemented a simple patch and it works.
On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota
wrote:
> Looks like what I was suggesting doesn't work. :/
>
> On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang wrote:
>
>> Yes, that's what I suggest. TextInputFormat suppor
ried this, but I think you should just be able to do
> sc.textFile("file1,file2,...")
>
> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote:
>
>> I know these workaround, but wouldn't it be more convenient and
>> straightforward to use SparkContext#textFile
e
>> multiple rdds. For example:
>>
>> val lines1 = sc.textFile("file1")
>> val lines2 = sc.textFile("file2")
>>
>> val rdd = lines1 union lines2
>>
>> regards,
>> --Jakob
>>
>> On 11 November 2015 at 01:20, Jeff Zh
r consideration
that I don't know.
--
Best Regards
Jeff Zhang
for us to modify also spark-csv as you proposed in
> SPARK-11622?
>
> Regards
>
> Kai
>
> > On Nov 5, 2015, at 11:30 AM, Jeff Zhang wrote:
> >
> >
> > Not sure the reason, it seems LibSVMRelation and CsvRelation can
> extends HadoopFsRelation and le
> probably not necessary for LibSVMRelation.
>
>
>
> But I think it will be easy to change as extending from HadoopFsRelation.
>
>
>
> Hao
>
>
>
> *From:* Jeff Zhang [mailto:zjf...@gmail.com]
> *Sent:* Thursday, November 5, 2015 10:31 AM
> *To:* dev@spark
Not sure the reason, it seems LibSVMRelation and CsvRelation can extends
HadoopFsRelation and leverage the features from HadoopFsRelation. Any
other consideration for that ?
--
Best Regards
Jeff Zhang
t me try again, just to be sure.
>
> Regards
> JB
>
>
> On 11/03/2015 11:50 AM, Jeff Zhang wrote:
>
>> Looks like it's due to guava version conflicts, I see both guava 14.0.1
>> and 16.0.1 under lib_managed/bundles. Anyone meet this issue too ?
>>
>&g
ookie = HashCodes.fromBytes(secret).toString()
[error] ^
--
Best Regards
Jeff Zhang
hema (name, age)
val df2 = df.join(df, "name") // schema (name, age, age)
df2.select("age") // ambiguous column reference.
--
Best Regards
Jeff Zhang
Stages and ResultStages there
> still should not be a performance penalty for this because the extra rounds
> of RPCs should only be performed when necessary.
>
>
> On 8/11/15 2:25 AM, Jeff Zhang wrote:
>
>> As my understanding, OutputCommitCoordinator should only be necessary f
As my understanding, OutputCommitCoordinator should only be necessary for
ResultStage (especially for ResultStage with hdfs write), but currently it
is used for all the stages. Is there any reason for that ?
--
Best Regards
Jeff Zhang
62 matches
Mail list logo