Re: [Dataset API] SPARK-27249

2020-01-26 Thread nick


> On Jan 22, 2020, at 8:35 AM, Nick Afshartous  wrote:
> 
> Hello,
> 
> I'm looking into starting work on this ticket
> 
>  https://issues.apache.org/jira/browse/SPARK-27249 
> <https://issues.apache.org/jira/browse/SPARK-27249>
> 
> which involves adding an API for transforming Datasets.  In the comments I 
> have a question about whether or not this ticket is still necessary.
> 
> Could someone please review and advise.


Checking again to see if this ticket is still a valid task.  
--
  Nick





Re: Welcoming six new Apache Spark committers

2021-03-29 Thread Nick Pentreath
Congratulations to all the new committers. Welcome!


On Fri, 26 Mar 2021 at 22:22, Matei Zaharia  wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new role! Our new committers are:
>
> - Maciej Szymkiewicz (contributor to PySpark)
> - Max Gekk (contributor to Spark SQL)
> - Kent Yao (contributor to Spark SQL)
> - Attila Zsolt Piros (contributor to decommissioning and Spark on
> Kubernetes)
> - Yi Wu (contributor to Spark Core and SQL)
> - Gabor Somogyi (contributor to Streaming and security)
>
> All six of them contributed to Spark 3.1 and we’re very excited to have
> them join as committers.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nick Pentreath
Wow! end of an era

Thanks so much to you Shane for all you work over 10 (!!) years. And to
Amplab also!

Farewell Spark Jenkins!

N

On Tue, Dec 7, 2021 at 6:49 AM Nicholas Chammas 
wrote:

> Farewell to Jenkins and its classic weather forecast build status icons:
>
> [image: health-80plus.png][image: health-60to79.png][image:
> health-40to59.png][image: health-20to39.png][image: health-00to19.png]
>
> And thank you Shane for all the help over these years.
>
> Will you be nuking all the Jenkins-related code in the repo after the 23rd?
>
> On Mon, Dec 6, 2021 at 3:02 PM shane knapp ☠  wrote:
>
>> hey everyone!
>>
>> after a marathon run of nearly a decade, we're finally going to be
>> shutting down {amp|rise}lab jenkins at the end of this month...
>>
>> the earliest snapshot i could find is from 2013 with builds for spark 0.7:
>>
>> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>>
>> it's been a hell of a run, and i'm gonna miss randomly tweaking the build
>> system, but technology has moved on and running a dedicated set of servers
>> for just one open source project is just too expensive for us here at uc
>> berkeley.
>>
>> if there's interest, i'll fire up a zoom session and all y'alls can watch
>> me type the final command:
>>
>> systemctl stop jenkins
>>
>> feeling bittersweet,
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


[Dataset API] SPARK-27249

2020-01-22 Thread Nick Afshartous



Hello,

I'm looking into starting work on this ticket

  https://issues.apache.org/jira/browse/SPARK-27249

which involves adding an API for transforming Datasets.  In the comments 
I have a question about whether or not this ticket is still necessary.


Could someone please review and advise.

Cheers,
--
   Nick

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



PySpark .collect() output to Scala Array[Row]

2020-05-25 Thread Nick Ruest
Hi,

I've hit a wall with trying to just implement a couple of Scala methods
of in a Python version of our project.

My Python function looks like this:

def Write_Graphml(data, graphml_path, sc):
return
sc.getOrCreate()._jvm.io.archivesunleashed.app.WriteGraphML(data,
graphml_path).apply


Where data is a DataFrame that has been collected; data.collect().

On the Scala side is it basically:

object WriteGraphML {
  apply(data: Array[Row], graphmlPath: String): Boolean = {
...
massages an Array[Row] into GraphML
...
True
}

When I try to use it in PySpark, I end up getting this error message:

Py4JError: An error occurred while calling
None.io.archivesunleashed.app.WriteGraphML. Trace:
py4j.Py4JException: Constructor
io.archivesunleashed.app.WriteGraphML([class java.util.ArrayList, class
java.lang.String]) does not exist
at
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:237)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)



Based on my research, I'm fairly certain it is because of how Py4J is
passing off the Python List (data) to the JVM, and then passing it to
Scala. It's ending up as an ArrayList instead of an Array[Row].

Do I need to tweak data before it is passed to Write_Graphml? Or am I
doing something else wrong here.

...and not 100% sure if this is a user or dev list question. Let me know
if I should move this over to user.

Thanks in advance for any help!

cheers!

-nruest



signature.asc
Description: OpenPGP digital signature


Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Nick Pentreath
Congratulations and welcome as Apache Spark committers!

On Wed, 15 Jul 2020 at 06:59, Prashant Sharma  wrote:

> Congratulations all ! It's great to have such committed folks as
> committers. :)
>
> On Wed, Jul 15, 2020 at 9:24 AM Yi Wu  wrote:
>
>> Congrats!!
>>
>> On Wed, Jul 15, 2020 at 8:02 AM Hyukjin Kwon  wrote:
>>
>>> Congrats!
>>>
>>> 2020년 7월 15일 (수) 오전 7:56, Takeshi Yamamuro 님이 작성:
>>>
 Congrats, all!

 On Wed, Jul 15, 2020 at 5:15 AM Takuya UESHIN 
 wrote:

> Congrats and welcome!
>
> On Tue, Jul 14, 2020 at 1:07 PM Bryan Cutler 
> wrote:
>
>> Congratulations and welcome!
>>
>> On Tue, Jul 14, 2020 at 12:36 PM Xingbo Jiang 
>> wrote:
>>
>>> Welcome, Huaxin, Jungtaek, and Dilip!
>>>
>>> Congratulations!
>>>
>>> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>>
 Hi all,

 The Spark PMC recently voted to add several new committers. Please
 join me in welcoming them to their new roles! The new committers are:

 - Huaxin Gao
 - Jungtaek Lim
 - Dilip Biswal

 All three of them contributed to Spark 3.0 and we’re excited to
 have them join the project.

 Matei and the Spark PMC

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>
> --
> Takuya UESHIN
>
>

 --
 ---
 Takeshi Yamamuro

>>>


Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Nick Pentreath
Congratulations!



>>
>> Matei Zaharia wrote
>> > Hi all,
>> >
>> > The Spark PMC recently added Tejas Patil as a committer on the
>> > project. Tejas has been contributing across several areas of Spark for
>> > a while, focusing especially on scalability issues and SQL. Please
>> > join me in welcoming Tejas!
>> >
>> > Matei
>> >
>> > -
>> > To unsubscribe e-mail:
>>
>> > dev-unsubscribe@.apache
>>
>>
>>
>>
>>
>> -
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: Should Flume integration be behind a profile?

2017-10-02 Thread Nick Pentreath
I'd agree with #1 or #2. Deprecation now seems fine.

Perhaps this should be raised on the user list also?

And perhaps it makes sense to look at moving the Flume support into Apache
Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the
current state of the connector could keep going for those users who may
need it.

As for examples, for the Kinesis connector the examples now live in the
subproject (see e.g. KinesisWordCountASL under external/kinesis-asl). So we
don't have to completely remove the examples, just move them (this may not
solve the doc issue but at least the examples are still there for anyone
who needs them).

On Mon, 2 Oct 2017 at 06:36 Mridul Muralidharan  wrote:

> I agree, proposal 1 sounds better among the options.
>
> Regards,
> Mridul
>
>
> On Sun, Oct 1, 2017 at 3:50 PM, Reynold Xin  wrote:
> > Probably should do 1, and then it is an easier transition in 3.0.
> >
> > On Sun, Oct 1, 2017 at 1:28 AM Sean Owen  wrote:
> >>
> >> I tried and failed to do this in
> >> https://issues.apache.org/jira/browse/SPARK-22142 because it became
> clear
> >> that the Flume examples would have to be removed to make this work, too.
> >> (Well, you can imagine other solutions with extra source dirs or
> modules for
> >> flume examples enabled by a profile, but that doesn't help the docs and
> is
> >> nontrivial complexity for little gain.)
> >>
> >> It kind of suggests Flume support should be deprecated if it's put
> behind
> >> a profile. Like with Kafka 0.8. (This is why I'm raising it again to the
> >> whole list.)
> >>
> >> Any preferences among:
> >> 1. Put Flume behind a profile, remove examples, deprecate
> >> 2. Put Flume behind a profile, remove examples, but don't deprecate
> >> 3. Punt until Spark 3.0, when this integration would probably be removed
> >> entirely (?)
> >>
> >> On Tue, Sep 26, 2017 at 10:36 AM Sean Owen  wrote:
> >>>
> >>> Not a big deal, but I'm wondering whether Flume integration should at
> >>> least be opt-in and behind a profile? it still sees some use (at least
> on
> >>> our end) but not applicable to the majority of users. Most other
> third-party
> >>> framework integrations are behind a profile, like YARN, Mesos, Kinesis,
> >>> Kafka 0.8, Docker. Just soliciting comments, not arguing for it.
> >>>
> >>> (Well, actually it annoys me that the Flume integration always fails to
> >>> compile in IntelliJ unless you generate the sources manually)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Configuration docs pages are broken

2017-10-03 Thread Nick Dimiduk
Heya,

Looks like the Configuration sections of your docs, both latest [0], and
2.1 [1] are broken. The last couple sections are smashed into a single
unrendered paragraph of markdown at the bottom.

Thanks,
Nick

[0]: https://spark.apache.org/docs/latest/configuration.html
[1]: https://spark.apache.org/docs/2.1.0/configuration.html


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Nick Pentreath
Hmm I'm consistently getting this error in core tests:

- SPARK-3697: ignore directories that cannot be read. *** FAILED ***
  2 was not equal to 1 (FsHistoryProviderSuite.scala:146)


Anyone else? Any insight? Perhaps it's my set up.


>>
>> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc4
>>>  (
>>> 2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>
>>> Release artifacts are signed with a key from:
>>> https://people.apache.org/~holden/holdens_keys.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test with
>>> the RC (make sure to clean up the artifact cache before/after so you
>>> don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said
>>> if there is something which is a regression form 2.1.1 that has not
>>> been correctly targeted please ping a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At this time there are no open unresolved issues.
>>>
>>> *Is there anything different about this release?*
>>>
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>> This is good because it means future releases can more easily be built and
>>> signed securely (and I've been updating the documentation in
>>> https://github.com/apache/spark-website/pull/66 as I progress), however
>>> the chances of a mistake are higher with any change like this. If there
>>> something you normally take for granted as correct when checking a release,
>>> please double check this time :)
>>>
>>> *Should I be committing code to branch-2.1?*
>>>
>>> Thanks for asking! Please treat this stage in the RC process as "code
>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>
>>> *What happened to RC3?*
>>>
>>> Some R+zinc interactions kept it from getting out the door.
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-04 Thread Nick Pentreath
Ah right! Was using a new cloud instance and didn't realize I was logged in
as root! thanks

On Tue, 3 Oct 2017 at 21:13 Marcelo Vanzin  wrote:

> Maybe you're running as root (or the admin account on your OS)?
>
> On Tue, Oct 3, 2017 at 12:12 PM, Nick Pentreath
>  wrote:
> > Hmm I'm consistently getting this error in core tests:
> >
> > - SPARK-3697: ignore directories that cannot be read. *** FAILED ***
> >   2 was not equal to 1 (FsHistoryProviderSuite.scala:146)
> >
> >
> > Anyone else? Any insight? Perhaps it's my set up.
> >
> >>>
> >>>
> >>> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau 
> wrote:
> >>>>
> >>>> Please vote on releasing the following candidate as Apache Spark
> version
> >>>> 2.1.2. The vote is open until Saturday October 7th at 9:00 PST and
> passes if
> >>>> a majority of at least 3 +1 PMC votes are cast.
> >>>>
> >>>> [ ] +1 Release this package as Apache Spark 2.1.2
> >>>> [ ] -1 Do not release this package because ...
> >>>>
> >>>>
> >>>> To learn more about Apache Spark, please see
> https://spark.apache.org/
> >>>>
> >>>> The tag to be voted on is v2.1.2-rc4
> >>>> (2abaea9e40fce81cd4626498e0f5c28a70917499)
> >>>>
> >>>> List of JIRA tickets resolved in this release can be found with this
> >>>> filter.
> >>>>
> >>>> The release files, including signatures, digests, etc. can be found
> at:
> >>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
> >>>>
> >>>> Release artifacts are signed with a key from:
> >>>> https://people.apache.org/~holden/holdens_keys.asc
> >>>>
> >>>> The staging repository for this release can be found at:
> >>>>
> https://repository.apache.org/content/repositories/orgapachespark-1252
> >>>>
> >>>> The documentation corresponding to this release can be found at:
> >>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
> >>>>
> >>>>
> >>>> FAQ
> >>>>
> >>>> How can I help test this release?
> >>>>
> >>>> If you are a Spark user, you can help us test this release by taking
> an
> >>>> existing Spark workload and running on this release candidate, then
> >>>> reporting any regressions.
> >>>>
> >>>> If you're working in PySpark you can set up a virtual env and install
> >>>> the current RC and see if anything important breaks, in the
> Java/Scala you
> >>>> can add the staging repository to your projects resolvers and test
> with the
> >>>> RC (make sure to clean up the artifact cache before/after so you
> don't end
> >>>> up building with a out of date RC going forward).
> >>>>
> >>>> What should happen to JIRA tickets still targeting 2.1.2?
> >>>>
> >>>> Committers should look at those and triage. Extremely important bug
> >>>> fixes, documentation, and API tweaks that impact compatibility should
> be
> >>>> worked on immediately. Everything else please retarget to 2.1.3.
> >>>>
> >>>> But my bug isn't fixed!??!
> >>>>
> >>>> In order to make timely releases, we will typically not hold the
> release
> >>>> unless the bug in question is a regression from 2.1.1. That being
> said if
> >>>> there is something which is a regression form 2.1.1 that has not been
> >>>> correctly targeted please ping a committer to help target the issue
> (you can
> >>>> see the open issues listed as impacting Spark 2.1.1 & 2.1.2)
> >>>>
> >>>> What are the unresolved issues targeted for 2.1.2?
> >>>>
> >>>> At this time there are no open unresolved issues.
> >>>>
> >>>> Is there anything different about this release?
> >>>>
> >>>> This is the first release in awhile not built on the AMPLAB Jenkins.
> >>>> This is good because it means future releases can more easily be
> built and
> >>>> signed securely (and I've been updating the documentation in
> >>>> https://github.com/apache/spark-website/pull/66 as I progress),
> however the
> >>>> chances of a mistake are higher with any change like this. If there
> >>>> something you normally take for granted as correct when checking a
> release,
> >>>> please double check this time :)
> >>>>
> >>>> Should I be committing code to branch-2.1?
> >>>>
> >>>> Thanks for asking! Please treat this stage in the RC process as "code
> >>>> freeze" so bug fixes only. If you're uncertain if something should be
> back
> >>>> ported please reach out. If you do commit to branch-2.1 please tag
> your JIRA
> >>>> issue fix version for 2.1.3 and if we cut another RC I'll move the
> 2.1.3
> >>>> fixed into 2.1.2 as appropriate.
> >>>>
> >>>> What happened to RC3?
> >>>>
> >>>> Some R+zinc interactions kept it from getting out the door.
> >>>> --
> >>>> Twitter: https://twitter.com/holdenkarau
> >>
> >>
> >
>
>
>
> --
> Marcelo
>


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Checked sigs & hashes.

Tested on RHEL
build/mvn -Phadoop-2.7 -Phive -Pyarn test passed
Python tests passed

I ran R tests and am getting some failures:
https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem to
recall similar issues on a previous release but I thought it was fixed).

I re-ran R tests on an Ubuntu box to double check and they passed there.

So I'd still +1 the release

Perhaps someone can take a look at the R failures on RHEL just in case
though.


On Fri, 6 Oct 2017 at 05:58 vaquar khan  wrote:

> +1 (non binding ) tested on Ubuntu ,all test case  are passed.
>
> Regards,
> Vaquar khan
>
> On Thu, Oct 5, 2017 at 10:46 PM, Hyukjin Kwon  wrote:
>
>> +1 too.
>>
>>
>> On 6 Oct 2017 10:49 am, "Reynold Xin"  wrote:
>>
>> +1
>>
>>
>> On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc4
>>>  (
>>> 2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>
>>> Release artifacts are signed with a key from:
>>> https://people.apache.org/~holden/holdens_keys.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test with
>>> the RC (make sure to clean up the artifact cache before/after so you
>>> don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said
>>> if there is something which is a regression form 2.1.1 that has not
>>> been correctly targeted please ping a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At this time there are no open unresolved issues.
>>>
>>> *Is there anything different about this release?*
>>>
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>> This is good because it means future releases can more easily be built and
>>> signed securely (and I've been updating the documentation in
>>> https://github.com/apache/spark-website/pull/66 as I progress), however
>>> the chances of a mistake are higher with any change like this. If there
>>> something you normally take for granted as correct when checking a release,
>>> please double check this time :)
>>>
>>> *Should I be committing code to branch-2.1?*
>>>
>>> Thanks for asking! Please treat this stage in the RC process as "code
>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>
>>> *What happened to RC3?*
>>>
>>> Some R+zinc interactions kept it from getting out the door.
>>> --
>>> Twitter

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Ah yes - I recall that it was fixed. Forgot it was for 2.3.0

My +1 vote stands.

On Fri, 6 Oct 2017 at 15:15 Hyukjin Kwon  wrote:

> Hi Nick,
>
> I believe that R test failure is due to SPARK-21093, at least the error
> message looks the same, and that is fixed from 2.3.0. This was not
> backported because I and reviewers were worried as that fixed a very core
> to SparkR (even, it was reverted once even after very close look by some
> reviewers).
>
> I asked Michael to note this as a known issue in
> https://spark.apache.org/releases/spark-release-2-2-0.html#known-issues
> before due to this reason.
> I believe It should be fine and probably we should note if possible. I
> believe this should not be a regression anyway as, if I understood
> correctly, it was there from the very first place.
>
> Thanks.
>
>
>
>
> 2017-10-06 21:20 GMT+09:00 Nick Pentreath :
>
>> Checked sigs & hashes.
>>
>> Tested on RHEL
>> build/mvn -Phadoop-2.7 -Phive -Pyarn test passed
>> Python tests passed
>>
>> I ran R tests and am getting some failures:
>> https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem
>> to recall similar issues on a previous release but I thought it was fixed).
>>
>> I re-ran R tests on an Ubuntu box to double check and they passed there.
>>
>> So I'd still +1 the release
>>
>> Perhaps someone can take a look at the R failures on RHEL just in case
>> though.
>>
>>
>> On Fri, 6 Oct 2017 at 05:58 vaquar khan  wrote:
>>
>>> +1 (non binding ) tested on Ubuntu ,all test case  are passed.
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Thu, Oct 5, 2017 at 10:46 PM, Hyukjin Kwon 
>>> wrote:
>>>
>>>> +1 too.
>>>>
>>>>
>>>> On 6 Oct 2017 10:49 am, "Reynold Xin"  wrote:
>>>>
>>>> +1
>>>>
>>>>
>>>> On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau 
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.1.2-rc4
>>>>> <https://github.com/apache/spark/tree/v2.1.2-rc4> (
>>>>> 2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>>>
>>>>> List of JIRA tickets resolved in this release can be found with this
>>>>> filter.
>>>>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>>>
>>>>> Release artifacts are signed with a key from:
>>>>> https://people.apache.org/~holden/holdens_keys.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the
>>>>> Java/Scala you can add the staging repository to your projects resolvers
>>>>> and test with the RC (make sure to clean up the artifact cache
>>>>> before/after so you don't end up building with a out of date RC going
>>>>> forward).
>>>>>
>>>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>>>
&

Re: Timeline for Spark 2.3

2017-11-09 Thread Nick Pentreath
+1 I think that’s practical

On Fri, 10 Nov 2017 at 03:13, Erik Erlandson  wrote:

> +1 on extending the deadline. It will significantly improve the logistics
> for upstreaming the Kubernetes back-end.  Also agreed, on the general
> realities of reduced bandwidth over the Nov-Dec holiday season.
> Erik
>
> On Thu, Nov 9, 2017 at 6:03 PM, Matei Zaharia 
> wrote:
>
>> I’m also +1 on extending this to get Kubernetes and other features in.
>>
>> Matei
>>
>> > On Nov 9, 2017, at 4:04 PM, Anirudh Ramanathan
>>  wrote:
>> >
>> > This would help the community on the Kubernetes effort quite a bit -
>> giving us additional time for reviews and testing for the 2.3 release.
>> >
>> > On Thu, Nov 9, 2017 at 3:56 PM, Justin Miller <
>> justin.mil...@protectwise.com> wrote:
>> > That sounds fine to me. I’m hoping that this ticket can make it into
>> Spark 2.3: https://issues.apache.org/jira/browse/SPARK-18016
>> >
>> > It’s causing some pretty considerable problems when we alter the
>> columns to be nullable, but we are OK for now without that.
>> >
>> > Best,
>> > Justin
>> >
>> >> On Nov 9, 2017, at 4:54 PM, Michael Armbrust 
>> wrote:
>> >>
>> >> According to the timeline posted on the website, we are nearing branch
>> cut for Spark 2.3.  I'd like to propose pushing this out towards mid to
>> late December for a couple of reasons and would like to hear what people
>> think.
>> >>
>> >> 1. I've done release management during the Thanksgiving / Christmas
>> time before and in my experience, we don't actually get a lot of testing
>> during this time due to vacations and other commitments. I think beginning
>> the RC process in early January would give us the best coverage in the
>> shortest amount of time.
>> >> 2. There are several large initiatives in progress that given a little
>> more time would leave us with a much more exciting 2.3 release.
>> Specifically, the work on the history server, Kubernetes and continuous
>> processing.
>> >> 3. Given the actual release date of Spark 2.2, I think we'll still get
>> Spark 2.3 out roughly 6 months after.
>> >>
>> >> Thoughts?
>> >>
>> >> Michael
>> >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: CrossValidation distribution - is it in the roadmap?

2017-11-29 Thread Nick Pentreath
Hi Tomasz

Parallel evaluation for CrossValidation and TrainValidationSplit was added
for Spark 2.3 in https://issues.apache.org/jira/browse/SPARK-19357


On Wed, 29 Nov 2017 at 16:31 Tomasz Dudek 
wrote:

> Hey,
>
> is there a way to make the following code:
>
> val paramGrid = new ParamGridBuilder().//omitted for brevity - lets say we
> have hundreds of param combinations here
>
> val cv = new
> CrossValidator().setNumFolds(3).setEstimator(pipeline).setEstimatorParamMaps(paramGrid)
>
> automatically distribute itself over all the executors? What I mean is
> to simultaneously compute few(or hundreds of it) ML models, instead of
> using all the computation power on just one model at time.
>
> If not, is such behavior in the Spark's road map?
>
> ...if not, do you think a person without prior Spark development
> experience(me) could do it? I'm using SparkML daily, since few months, at
> work. How much time would it take, approximately?
>
> Yours,
> Tomasz
>
>
>


Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Nick Pentreath
I think this has come up before (and Sean mentions it above), but the
sub-items on:

SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella

are actually marked as Blockers, but are not targeted to 2.3.0. I think
they should be, and I'm not comfortable with those not being resolved
before voting positively on the release.

So I'm -1 too for that reason.

I think most of those review items are close to done, and there is also
https://issues.apache.org/jira/browse/SPARK-22799 that I think should be in
for 2.3 (to avoid a behavior change later between 2.3.0 and 2.3.1,
especially since we'll have another RC now it seems).


On Thu, 25 Jan 2018 at 19:28 Marcelo Vanzin  wrote:

> Sorry, have to change my vote again. Hive guys ran into SPARK-23209
> and that's a regression we need to fix. I'll post a patch soon. So -1
> (although others have already -1'ed).
>
> On Wed, Jan 24, 2018 at 11:42 AM, Marcelo Vanzin 
> wrote:
> > Given that the bugs I was worried about have been dealt with, I'm
> > upgrading to +1.
> >
> > On Mon, Jan 22, 2018 at 5:09 PM, Marcelo Vanzin 
> wrote:
> >> +0
> >>
> >> Signatures check out. Code compiles, although I see the errors in [1]
> >> when untarring the source archive; perhaps we should add "use GNU tar"
> >> to the RM checklist?
> >>
> >> Also ran our internal tests and they seem happy.
> >>
> >> My concern is the list of open bugs targeted at 2.3.0 (ignoring the
> >> documentation ones). It is not long, but it seems some of those need
> >> to be looked at. It would be nice for the committers who are involved
> >> in those bugs to take a look.
> >>
> >> [1]
> https://superuser.com/questions/318809/linux-os-x-tar-incompatibility-tarballs-created-on-os-x-give-errors-when-unt
> >>
> >>
> >> On Mon, Jan 22, 2018 at 1:36 PM, Sameer Agarwal 
> wrote:
> >>> Please vote on releasing the following candidate as Apache Spark
> version
> >>> 2.3.0. The vote is open until Friday January 26, 2018 at 8:00:00 am
> UTC and
> >>> passes if a majority of at least 3 PMC +1 votes are cast.
> >>>
> >>>
> >>> [ ] +1 Release this package as Apache Spark 2.3.0
> >>>
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>>
> >>> To learn more about Apache Spark, please see https://spark.apache.org/
> >>>
> >>> The tag to be voted on is v2.3.0-rc2:
> >>> https://github.com/apache/spark/tree/v2.3.0-rc2
> >>> (489ecb0ef23e5d9b705e5e5bae4fa3d871bdac91)
> >>>
> >>> List of JIRA tickets resolved in this release can be found here:
> >>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
> >>>
> >>> The release files, including signatures, digests, etc. can be found at:
> >>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc2-bin/
> >>>
> >>> Release artifacts are signed with the following key:
> >>> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>>
> >>> The staging repository for this release can be found at:
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1262/
> >>>
> >>> The documentation corresponding to this release can be found at:
> >>>
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc2-docs/_site/index.html
> >>>
> >>>
> >>> FAQ
> >>>
> >>> ===
> >>> What are the unresolved issues targeted for 2.3.0?
> >>> ===
> >>>
> >>> Please see https://s.apache.org/oXKi. At the time of writing, there
> are
> >>> currently no known release blockers.
> >>>
> >>> =
> >>> How can I help test this release?
> >>> =
> >>>
> >>> If you are a Spark user, you can help us test this release by taking an
> >>> existing Spark workload and running on this release candidate, then
> >>> reporting any regressions.
> >>>
> >>> If you're working in PySpark you can set up a virtual env and install
> the
> >>> current RC and see if anything important breaks, in the Java/Scala you
> can
> >>> add the staging repository to your projects resolvers and test with
> the RC
> >>> (make sure to clean up the artifact cache before/after so you don't
> end up
> >>> building with a out of date RC going forward).
> >>>
> >>> ===
> >>> What should happen to JIRA tickets still targeting 2.3.0?
> >>> ===
> >>>
> >>> Committers should look at those and triage. Extremely important bug
> fixes,
> >>> documentation, and API tweaks that impact compatibility should be
> worked on
> >>> immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
> >>> appropriate.
> >>>
> >>> ===
> >>> Why is my bug not fixed?
> >>> ===
> >>>
> >>> In order to make timely releases, we will typically not hold the
> release
> >>> unless the bug in question is a regression from 2.2.0. That being
> said, if
> >>> there is something which is a regression from 2.2.0 and has not been
> >>> correctly targeted please ping me or a committer to help target the
> issue
> >>> (you can see the open iss

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Nick Pentreath
All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side
that should be everything outstanding.

On Thu, 1 Feb 2018 at 06:21 Yin Huai  wrote:

> seems we are not running tests related to pandas in pyspark tests (see my
> email "python tests related to pandas are skipped in jenkins"). I think we
> should fix this test issue and make sure all tests are good before cutting
> RC3.
>
> On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal 
> wrote:
>
>> Just a quick status update on RC3 -- SPARK-23274
>>  was resolved
>> yesterday and tests have been quite healthy throughout this week and the
>> last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202
>> ) is resolved.
>>
>>
>> On 30 January 2018 at 10:12, Andrew Ash  wrote:
>>
>>> I'd like to nominate SPARK-23274
>>>  as a potential
>>> blocker for the 2.3.0 release as well, due to being a regression from
>>> 2.2.0.  The ticket has a simple repro included, showing a query that works
>>> in prior releases but now fails with an exception in the catalyst optimizer.
>>>
>>> On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal 
>>> wrote:
>>>
 This vote has failed due to a number of aforementioned blockers. I'll
 follow up with RC3 as soon as the 2 remaining (non-QA) blockers are
 resolved: https://s.apache.org/oXKi


 On 25 January 2018 at 12:59, Sameer Agarwal 
 wrote:

>
> Most tests pass on RC2, except I'm still seeing the timeout caused by
>> https://issues.apache.org/jira/browse/SPARK-23055 ; the tests never
>> finish. I followed the thread a bit further and wasn't clear whether it 
>> was
>> subsequently re-fixed for 2.3.0 or not. It says it's resolved along with
>> https://issues.apache.org/jira/browse/SPARK-22908 for 2.3.0 though I
>> am still seeing these tests fail or hang:
>>
>> - subscribing topic by name from earliest offsets (failOnDataLoss:
>> false)
>> - subscribing topic by name from earliest offsets (failOnDataLoss:
>> true)
>>
>
> Sean, while some of these tests were timing out on RC1, we're not
> aware of any known issues in RC2. Both maven (
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/146/testReport/org.apache.spark.sql.kafka010/history/)
> and sbt (
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/123/testReport/org.apache.spark.sql.kafka010/history/)
> historical builds on jenkins for org.apache.spark.sql.kafka010 look fairly
> healthy. If you're still seeing timeouts in RC2, can you create a JIRA 
> with
> any applicable build/env info?
>
>
>
>> On Tue, Jan 23, 2018 at 9:01 AM Sean Owen  wrote:
>>
>>> I'm not seeing that same problem on OS X and /usr/bin/tar. I tried
>>> unpacking it with 'xvzf' and also unzipping it first, and it untarred
>>> without warnings in either case.
>>>
>>> I am encountering errors while running the tests, different ones
>>> each time, so am still figuring out whether there is a real problem or 
>>> just
>>> flaky tests.
>>>
>>> These issues look like blockers, as they are inherently to be
>>> completed before the 2.3 release. They are mostly not done. I suppose 
>>> I'd
>>> -1 on behalf of those who say this needs to be done first, though, we 
>>> can
>>> keep testing.
>>>
>>> SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella
>>> SPARK-23114 Spark R 2.3 QA umbrella
>>>
>>> Here are the remaining items targeted for 2.3:
>>>
>>> SPARK-15689 Data source API v2
>>> SPARK-20928 SPIP: Continuous Processing Mode for Structured Streaming
>>> SPARK-21646 Add new type coercion rules to compatible with Hive
>>> SPARK-22386 Data Source V2 improvements
>>> SPARK-22731 Add a test for ROWID type to OracleIntegrationSuite
>>> SPARK-22735 Add VectorSizeHint to ML features documentation
>>> SPARK-22739 Additional Expression Support for Objects
>>> SPARK-22809 pyspark is sensitive to imports with dots
>>> SPARK-22820 Spark 2.3 SQL API audit
>>>
>>>
>>> On Mon, Jan 22, 2018 at 7:09 PM Marcelo Vanzin 
>>> wrote:
>>>
 +0

 Signatures check out. Code compiles, although I see the errors in
 [1]
 when untarring the source archive; perhaps we should add "use GNU
 tar"
 to the RM checklist?

 Also ran our internal tests and they seem happy.

 My concern is the list of open bugs targeted at 2.3.0 (ignoring the
 documentation ones). It is not long, but it seems some of those need
 to be looked at. It would be nice for the committers who are
>>>

Re: redundant decision tree model

2018-02-13 Thread Nick Pentreath
There is a long outstanding JIRA issue about it:
https://issues.apache.org/jira/browse/SPARK-3155.

It is probably still a useful feature to have for trees but the priority is
not that high since it may not be that useful for the tree ensemble models.

On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello community,
> I have recently manually inspected some decision trees computed with Spark
> (2.2.1, but the behavior is the same with the latest code on the repo).
>
> I have observed that the trees are always complete, even if an entire
> subtree leads to the same prediction in its different leaves.
>
> In such case, the root of the subtree, instead of being an InternalNode,
> could simply be a LeafNode with the (shared) prediction.
>
> I know that decision trees computed by scikit-learn share the same
> feature, I understand that this is needed by construction, because you
> realize this redundancy only at the end.
>
> So my question is, why is this "post-pruning" missing?
>
> Three hypothesis:
>
> 1) It is not suitable (for a reason I fail to see)
> 2) Such addition to the code is considered as not worth (in terms of code
> complexity, maybe)
> 3) It has been overlooked, but could be a favorable addition
>
> For clarity, I have managed to isolate a small case to reproduce this, in
> what follows.
>
> This is the dataset:
>
>> +-+-+
>> |label|features |
>> +-+-+
>> |1.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,0.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,0.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,0.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> +-+-+
>
>
> Which generates the following model:
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 1.0
>> Else (feature 0 > 0.5)
>>  Predict: 1.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>
>
> As you can see, the following model would be equivalent, but smaller and
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> Predict: 1.0
>>Else (feature 2 > 0.5)
>> Predict: 0.0
>
>
> This happens pretty often in real cases, and despite the small gain in the
> single model invocation for the "optimized" version, it can become non
> negligible when the number of calls is massive, as one can expect in a Big
> Data context.
>
> I would appreciate your opinion on this matter (if relevant for a PR or
> not, pros/cons etc).
>
> Best regards,
> Alessandro
>


Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-14 Thread Nick Pentreath
-1 for me as we elevated https://issues.apache.org/jira/browse/SPARK-23377 to
a Blocker. It should be fixed before release.

On Thu, 15 Feb 2018 at 07:25 Holden Karau  wrote:

> If this is a blocker in your view then the vote thread is an important
> place to mention it. I'm not super sure all of the places these methods are
> used so I'll defer to srowen and folks, but for the ML related implications
> in the past we've allowed people to set the hashing function when we've
> introduced changes.
>
> On Feb 15, 2018 2:08 PM, "mrkm4ntr"  wrote:
>
>> I was advised to post here in the discussion at GitHub. I do not know
>> what to
>> do about the problem that discussions dispersing in two places.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-27 Thread Nick Pentreath
+1 (binding)

Built and ran Scala tests with "-Phadoop-2.6 -Pyarn -Phive", all passed.

Python tests passed (also including pyspark-streaming w/kafka-0.8 and flume
packages built)

On Tue, 27 Feb 2018 at 10:09 Felix Cheung  wrote:

> +1
>
> Tested R:
>
> install from package, CRAN tests, manual tests, help check, vignettes check
>
> Filed this https://issues.apache.org/jira/browse/SPARK-23461
> This is not a regression so not a blocker of the release.
>
> Tested this on win-builder and r-hub. On r-hub on multiple platforms
> everything passed. For win-builder tests failed on x86 but passed x64 -
> perhaps due to an intermittent download issue causing a gzip error,
> re-testing now but won’t hold the release on this.
>
> --
> *From:* Nan Zhu 
> *Sent:* Monday, February 26, 2018 4:03:22 PM
> *To:* Michael Armbrust
> *Cc:* dev
> *Subject:* Re: [VOTE] Spark 2.3.0 (RC5)
>
> +1  (non-binding), tested with internal workloads and benchmarks
>
> On Mon, Feb 26, 2018 at 12:09 PM, Michael Armbrust  > wrote:
>
>> +1 all our pipelines have been running the RC for several days now.
>>
>> On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun 
>> wrote:
>>
>>> +1 (non-binding).
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue 
>>> wrote:
>>>
 +1 (non-binding)

 On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li  wrote:

> +1 (binding) in Spark SQL, Core and PySpark.
>
> Xiao
>
> 2018-02-24 14:49 GMT-08:00 Ricardo Almeida <
> ricardo.alme...@actnowib.com>:
>
>> +1 (non-binding)
>>
>> same as previous RC
>>
>> On 24 February 2018 at 11:10, Hyukjin Kwon 
>> wrote:
>>
>>> +1
>>>
>>> 2018-02-24 16:57 GMT+09:00 Bryan Cutler :
>>>
 +1
 Tests passed and additionally ran Arrow related tests and did some
 perf checks with python 2.7.14

 On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau >>> > wrote:

> Note: given the state of Jenkins I'd love to see Bryan Cutler or
> someone with Arrow experience sign off on this release.
>
> On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian  > wrote:
>
>> +1 (binding)
>>
>> Passed all the tests, looks good.
>>
>> Cheng
>>
>> On 2/23/18 15:00, Holden Karau wrote:
>>
>> +1 (binding)
>> PySpark artifacts install in a fresh Py3 virtual env
>>
>> On Feb 23, 2018 7:55 AM, "Denny Lee" 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
>>> joshgoldsboroughs...@gmail.com> wrote:
>>>
 New to testing out Spark RCs for the community but I was able
 to run some of the basic unit tests without error so for what it's 
 worth,
 I'm a +1.

 On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal <
 samee...@apache.org> wrote:

> Please vote on releasing the following candidate as Apache
> Spark version 2.3.0. The vote is open until Tuesday February 27, 
> 2018 at
> 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 
> votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.3.0
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see
> https://spark.apache.org/
>
> The tag to be voted on is v2.3.0-rc5:
> https://github.com/apache/spark/tree/v2.3.0-rc5
> (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>
> List of JIRA tickets resolved in this release can be found
> here:
> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>
> The release files, including signatures, digests, etc. can be
> found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1266/
>
> The documentation corresponding to this release can be found
> at:
>
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/index.html
>
>
> FAQ
>
> ===
> What are the unresolved issues targeted for 2.3.0?
> ===

Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Nick Pentreath
Congratulations!

On Tue, 3 Apr 2018 at 05:34 wangzhenhua (G)  wrote:

>
>
> Thanks everyone! It’s my great pleasure to be part of such a professional
> and innovative community!
>
>
>
>
>
> best regards,
>
> -Zhenhua(Xander)
>
>
>


Re: Revisiting Online serving of Spark models?

2018-06-05 Thread Nick Pentreath
I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.

On Sun, 3 Jun 2018 at 00:24 Holden Karau  wrote:

> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> maximilianofel...@gmail.com> wrote:
>
>> Hi!
>>
>> We're already in San Francisco waiting for the summit. We even think that
>> we spotted @holdenk this afternoon.
>>
> Unless you happened to be walking by my garage probably not super likely,
> spent the day working on scooters/motorcycles (my style is a little less
> unique in SF :)). Also if you see me feel free to say hi unless I look like
> I haven't had my first coffee of the day, love chatting with folks IRL :)
>
>>
>> @chris, we're really interested in the Meetup you're hosting. My team
>> will probably join it since the beginning of you have room for us, and I'll
>> join it later after discussing the topics on this thread. I'll send you an
>> email regarding this request.
>>
>> Thanks
>>
>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal 
>> escribió:
>>
>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>> folks
>>>
>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>> meetup around model serving in spark at my work or elsewhere close,
>>> thoughts?  I’m actually in the midst of building microservices to manage
>>> models and when I say models I mean much more than machine learning models
>>> (think OR, process models as well)
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 31, 2018, at 10:32 PM, Chris Fregly  wrote:
>>>
>>> Hey everyone!
>>>
>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>> calendar event - mostly for me, so i don’t forget!  :)
>>>
>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>> TensorFlow Meetup*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>  @5:30pm
>>> on June 6th (same night) here in SF!
>>>
>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>> includes the signup link:
>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>
>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>> ground.
>>>
>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>> posting the recording afterward.
>>>
>>> All details are in the meetup link above…
>>>
>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>> welcome to give a talk. I can move things around to make room.
>>>
>>> @joseph:  I’d personally like an update on the direction of the
>>> Databricks proprietary ML Serving export format which is similar to PMML
>>> but not a standard in any way.
>>>
>>> Also, the Databricks ML Serving Runtime is only available to Databricks
>>> customers.  This seems in conflict with the community efforts described
>>> here.  Can you comment on behalf of Databricks?
>>>
>>> Look forward to your response, joseph.
>>>
>>> See you all soon!
>>>
>>> —
>>>
>>>
>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>>> Users)
>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>>> Global Members)
>>>
>>>
>>>
>>> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf
>>> *
>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>> <http://community.pipeline.ai/>*
>>>
>>>
>>> On May 30, 2018, at 9:32 AM, Felix Cheung 
>>> wrote:
>>>
>>> Hi!
>>>
>>> Thank you! Let’s meet then
>>>
>>> June 6 4pm
>>>
>>> Moscone West Convention Center
>>> 800 Howard Street, San Francisco, CA 94103
>>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>>
>>> Ground floor (outside of conference area - should be available for all)
>>> - we will meet and decide where to go
>>>
>>> (Would not send invite because that woul

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-07-04 Thread Nick Pentreath
Hey everyone,

Is there an idea for updated timeline for cutting a next RC? Do we have a
clear picture of outstanding issues? I see 21 issues marked Blocker or
Critical targeted at 2.0.0.

The only blockers I see on JIRA are related to MLlib doc updates etc (I
will go through a few of these to clean them up and see where they stand).
If there are other blockers then we should mark them as such to help
tracking progress?


On Tue, 28 Jun 2016 at 11:28 Nick Pentreath 
wrote:

> I take it there will be another RC due to some blockers and as there were
> no +1 votes anyway.
>
> FWIW, I cannot run python tests using "./python/run-tests".
>
> I'd be -1 for this reason (see https://github.com/apache/spark/pull/13737 /
> http://issues.apache.org/jira/browse/SPARK-15954) - does anyone else
> encounter this?
>
> ./python/run-tests --python-executables=python2.7
> Running PySpark tests. Output is in
> /Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/unit-tests.log
> Will test against the following Python executables: ['python2.7']
> Will test the following Python modules: ['pyspark-core', 'pyspark-ml',
> 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> ==
> ERROR: setUpClass (pyspark.sql.tests.HiveContextSQLTests)
> --
> Traceback (most recent call last):
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/sql/tests.py",
> line 1620, in setUpClass
> cls.spark = HiveContext._createForTesting(cls.sc)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/sql/context.py",
> line 490, in _createForTesting
> jtestHive =
> sparkContext._jvm.org.apache.spark.sql.hive.test.TestHiveContext(jsc)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
> line 1183, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
> line 312, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling
> None.org.apache.spark.sql.hive.test.TestHiveContext.
> : java.lang.NullPointerException
> at
> org.apache.spark.sql.hive.test.TestHiveSparkSession.getHiveFile(TestHive.scala:183)
> at
> org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:214)
> at
> org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:122)
> at org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:77)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
>
>
> ======
> ERROR: setUpClass (pyspark.sql.tests.SQLTests)
> --
> Traceback (most recent call last):
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/sql/tests.py",
> line 189, in setUpClass
> ReusedPySparkTestCase.setUpClass()
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/tests.py",
> line 344, in setUpClass
> cls.sc = SparkContext('local[4]', cls.__name__)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/context.py",
> line 112, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/context.py",
> line 261, in _ensure_initialized
> callsite.function, callsite.file, callsite.linenu

Re: Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Nick Pentreath
+1 I don't believe there's any reason for the warnings to still be there
except for available dev time & focus :)

On Wed, 27 Jul 2016 at 21:35, Jacek Laskowski  wrote:

> Kill 'em all -- one by one slowly yet gradually! :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Wed, Jul 27, 2016 at 9:11 PM, Holden Karau 
> wrote:
> > Now that the 2.0 release is out the door and I've got some cycles to do
> some
> > cleanups -  I'd like to know what other people think of the internal
> > deprecation warnings we've introduced in a lot of a places in our code.
> Once
> > before I did some minor refactoring so the Python code which had to use
> the
> > deprecated code to expose the deprecated API wouldn't gum up the build
> logs
> > - but is there interest in doing that or are we more interested in not
> > paying attention to the deprecation warnings for internal Spark
> components
> > (e.g. https://twitter.com/thepracticaldev/status/725769766603001856 )?
> >
> >
> > --
> > Cell : 425-233-8271
> > Twitter: https://twitter.com/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating
TF-IDF normalized vectors. The definition (
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency
in TF-IDF is actually the "number of times the term occurs in the document".

So it's perhaps a bit of a misnomer, but the implementation is correct.

On Tue, 2 Aug 2016 at 05:44 Yanbo Liang  wrote:

> Hi Hao,
>
> HashingTF directly apply a hash function (Murmurhash3) to the features to
> determine their column index. It excluded any thought about the term
> frequency or the length of the document. It does similar work compared with
> sklearn FeatureHasher. The result is increased speed and reduced memory
> usage, but it does not remember what the input features looked like and can
> not convert the output back to the original features. Actually we misnamed
> this transformer, it only does the work of feature hashing rather than
> computing hashing term frequency.
>
> CountVectorizer will select the top vocabSize words ordered by term
> frequency across the corpus to build the hash table of the features. So it
> will consume more memory than HashingTF. However, we can convert the output
> back to the original feature.
>
> Both of the transformers do not consider the length of each document. If
> you want to compute term frequency divided by the length of the document,
> you should write your own function based on transformers provided by MLlib.
>
> Thanks
> Yanbo
>
> 2016-08-01 15:29 GMT-07:00 Hao Ren :
>
>> When computing term frequency, we can use either HashTF or
>> CountVectorizer feature extractors.
>> However, both of them just use the number of times that a term appears in
>> a document.
>> It is not a true frequency. Acutally, it should be divided by the length
>> of the document.
>>
>> Is this a wanted feature ?
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>


Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nick Pentreath
Currently there is no direct way in Spark to serve models without bringing
in all of Spark as a dependency.

For Spark ML, there is actually no way to do it independently of DataFrames
either (which for single-instance prediction makes things sub-optimal).
That is covered here: https://issues.apache.org/jira/browse/SPARK-10413

So, your options are (in Scala) things like MLeap, PredictionIO, or "roll
your own". Or you can try to export to some other format such as PMML or
PFA. Some MLlib models support PMML export, but for ML it is still missing
(see https://issues.apache.org/jira/browse/SPARK-11171).

There is an external project for PMML too (note licensing) -
https://github.com/jpmml/jpmml-sparkml - which is by now actually quite
comprehensive. It shows that PMML can represent a pretty large subset of
typical ML pipeline functionality.

On the Python side sadly there is even less - I would say your options are
pretty much "roll your own" currently, or export in PMML or PFA.

Finally, part of the "mllib-local" idea was around enabling this local
model-serving (for some initial discussion about the future see
https://issues.apache.org/jira/browse/SPARK-16365).

N

On Thu, 11 Aug 2016 at 06:28 Michael Allman  wrote:

> Nick,
>
> Check out MLeap: https://github.com/TrueCar/mleap. It's not python, but
> we use it in production to serve a random forest model trained by a Spark
> ML pipeline.
>
> Thanks,
>
> Michael
>
> On Aug 10, 2016, at 7:50 PM, Nicholas Chammas 
> wrote:
>
> Are there any existing JIRAs covering the possibility of serving up Spark
> ML models via, for example, a regular Python web app?
>
> The story goes like this: You train your model with Spark on several TB of
> data, and now you want to use it in a prediction service that you’re
> building, say with Flask <http://flask.pocoo.org/>. In principle, you
> don’t need Spark anymore since you’re just passing individual data points
> to your model and looking for it to spit some prediction back.
>
> I assume this is something people do today, right? I presume Spark needs
> to run in their web service to serve up the model. (Sorry, I’m new to the
> ML side of Spark. 😅)
>
> Are there any JIRAs discussing potential improvements to this story? I did
> a search, but I’m not sure what exactly to look for. SPARK-4587
> <https://issues.apache.org/jira/browse/SPARK-4587> (model import/export)
> looks relevant, but doesn’t address the story directly.
>
> Nick
> ​
>
>
>


Re: Java 8

2016-08-20 Thread Nick Pentreath
Spark already supports compiling with Java 8. What refactoring are you
referring to, and where do you expect to see performance gains?

On Sat, 20 Aug 2016 at 12:41, Timur Shenkao  wrote:

> Hello, guys!
>
> Are there any plans / tickets / branches in repository on Java 8?
>
> I ask because ML library will gain in performance. I'd like to take part
> in refactoring.
>


Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nick Pentreath
It's not impossible that a Transformer could output multiple columns - it's
simply because none of the current ones do. It's true that it might be a
relatively less common use case in general.

But take StringIndexer for example. It turns strings (categorical features)
into ints (0-based indexes). It could (should) accept multiple input
columns for efficiency (see
https://issues.apache.org/jira/browse/SPARK-11215). This is a case where
multiple output columns would be required.

N

On Tue, 23 Aug 2016 at 16:15 Nicholas Chammas 
wrote:

> If you create your own Spark 2.x ML Transformer, there are multiple
> mix-ins (is that the correct term?) that you can use to define its behavior
> which are in ml/param/shared.py
> <https://github.com/apache/spark/blob/master/python/pyspark/ml/param/shared.py>
> .
>
> Among them are the following mix-ins:
>
>- HasInputCol
>- HasInputCols
>- HasOutputCol
>
> What’s *not* available is a HasOutputCols mix-in, and I assume that is
> intentional.
>
> Is there a design reason why Transformers should not be able to define
> multiple output columns?
>
> I’m guessing if you are an ML beginner who thinks they need a Transformer
> with multiple output columns, you’ve misunderstood something. 😅
>
> Nick
> ​
>


Re: Organizing Spark ML example packages

2016-09-12 Thread Nick Pentreath
Never actually got around to doing this - do folks still think it
worthwhile?

On Thu, 21 Apr 2016 at 00:10 Joseph Bradley  wrote:

> Sounds good to me.  I'd request we be strict during this process about
> requiring *no* changes to the example itself, which will make review easier.
>
> On Tue, Apr 19, 2016 at 11:12 AM, Bryan Cutler  wrote:
>
>> +1, adding some organization would make it easier for people to find a
>> specific example
>>
>> On Mon, Apr 18, 2016 at 11:52 PM, Yanbo Liang  wrote:
>>
>>> This sounds good to me, and it will make ML examples more neatly.
>>>
>>> 2016-04-14 5:28 GMT-07:00 Nick Pentreath :
>>>
>>>> Hey Spark devs
>>>>
>>>> I noticed that we now have a large number of examples for ML & MLlib in
>>>> the examples project - 57 for ML and 67 for MLLIB to be precise. This is
>>>> bound to get larger as we add features (though I know there are some PRs to
>>>> clean up duplicated examples).
>>>>
>>>> What do you think about organizing them into packages to match the use
>>>> case and the structure of the code base? e.g.
>>>>
>>>> org.apache.spark.examples.ml.recommendation
>>>>
>>>> org.apache.spark.examples.ml.feature
>>>>
>>>> and so on...
>>>>
>>>> Is it worth doing? The doc pages with include_example would need
>>>> updating, and the run_example script input would just need to change the
>>>> package slightly. Did I miss any potential issue?
>>>>
>>>> N
>>>>
>>>
>>>
>>
>


Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also)

I think a more general version of ranking metrics that allows arbitrary
relevance scores could be useful. Ranking metrics are applicable to other
settings like search or other learning-to-rank use cases, so it should be a
little more generic than pure recommender settings.

The one issue with the proposed implementation is that it is not compatible
with the existing cross-validators within a pipeline.

As I've mentioned on the linked JIRAs & PRs, one option is to create a
special set of cross-validators for recommenders, that address the issues
of (a) dataset splitting specific to recommender settings (user-based
stratified sampling, time-based etc) and (b) ranking-based evaluation.

The other option is to have the ALSModel itself capable of generating the
"ground-truth" set within the same dataframe output from "transform" (ie
predict top k) that can be fed into the cross-validator (with
RankingEvaluator) directly. That's the approach I took so far in
https://github.com/apache/spark/pull/12574.

Both options are valid and have their positives & negatives - open to
comments / suggestions.

On Tue, 20 Sep 2016 at 06:08 Jong Wook Kim  wrote:

> Thanks for the clarification and the relevant links. I overlooked the
> comments explicitly saying that the relevance is binary.
>
> I understand that the label is not a relevance, but I have been, and I
> think many people are using the label as relevance in the implicit feedback
> context where the user-provided exact label is not defined anyway. I think
> that's why RiVal 's using the term
> "preference" for both the label for MAE and the relevance for NDCG.
>
> At the same time, I see why Spark decided to assume the relevance is
> binary, in part to conform to the class RankingMetrics's constructor. I
> think it would be nice if the upcoming DataFrame-based RankingEvaluator can
> be optionally set a "relevance column" that has non-binary relevance
> values, otherwise defaulting to either 1.0 or the label column.
>
> My extended version of RankingMetrics is here:
> https://github.com/jongwook/spark-ranking-metrics . It has a test case
> checking that the numbers are same as RiVal's.
>
> Jong Wook
>
>
>
> On 19 September 2016 at 03:13, Sean Owen  wrote:
>
>> Yes, relevance is always 1. The label is not a relevance score so
>> don't think it's valid to use it as such.
>>
>> On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim  wrote:
>> > Hi,
>> >
>> > I'm trying to evaluate a recommendation model, and found that Spark and
>> > Rival give different results, and it seems that Rival's one is what
>> Kaggle
>> > defines:
>> https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
>> >
>> > Am I using RankingMetrics in a wrong way, or is Spark's implementation
>> > incorrect?
>> >
>> > To my knowledge, NDCG should be dependent on the relevance (or
>> preference)
>> > values, but Spark's implementation seems not; it uses 1.0 where it
>> should be
>> > 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also
>> tried
>> > tweaking, but its method to obtain the ideal DCG also seems wrong.
>> >
>> > Any feedback from MLlib developers would be appreciated. I made a
>> > modified/extended version of RankingMetrics that produces the identical
>> > numbers to Kaggle and Rival's results, and I'm wondering if it is
>> something
>> > appropriate to be added back to MLlib.
>> >
>> > Jong Wook
>>
>
>


Re: Question about using collaborative filtering in MLlib

2016-11-02 Thread Nick Pentreath
I have a PR for it - https://github.com/apache/spark/pull/12574

Sadly I've been tied up and haven't had a chance to work further on it.

The main issue outstanding is deciding on the transform semantics as well
as performance testing.

Any comments / feedback welcome especially on transform semantics.

N


Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Nick Pentreath
@Holden look forward to the blog post - I think a user guide PR based on it
would also be super useful :)

On Fri, 18 Nov 2016 at 05:29 Holden Karau  wrote:

> I've been working on a blog post around this and hope to have it published
> early next month 😀
>
> On Nov 17, 2016 10:16 PM, "Joseph Bradley"  wrote:
>
> Hi Georg,
>
> It's true we need better documentation for this.  I'd recommend checking
> out simple algorithms within Spark for examples:
> ml.feature.Tokenizer
> ml.regression.IsotonicRegression
>
> You should not need to put your library in Spark's namespace.  The shared
> Params in SPARK-7146 are not necessary to create a custom algorithm; they
> are just niceties.
>
> Though there aren't great docs yet, you should be able to follow existing
> examples.  And I'd like to add more docs in the future!
>
> Good luck,
> Joseph
>
> On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler 
> wrote:
>
> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-roll-a-custom-estimator-in-pyspark-mllib
>
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> ).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>
>
>
>


Re: Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-11-30 Thread Nick Pentreath
check out https://github.com/VinceShieh/Spark-AdaOptimizer

On Wed, 30 Nov 2016 at 10:52 WangJianfei 
wrote:

> Hi devs:
> Normally, the adaptive learning rate methods can have a fast
> convergence
> then standard SGD, so why don't we imp them?
> see the link for more details
> http://sebastianruder.com/optimizing-gradient-descent/index.html#adadelta
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-don-t-we-imp-some-adaptive-learning-rate-methods-such-as-adadelat-adam-tp20057.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: unhelpful exception thrown on predict() when ALS trained model doesn't contain user or product?

2016-12-06 Thread Nick Pentreath
Indeed, it's being tracked here:
https://issues.apache.org/jira/browse/SPARK-18230 though no Pr has been
opened yet.

On Tue, 6 Dec 2016 at 13:36 chris snow  wrote:

> I'm using the MatrixFactorizationModel.predict() method and encountered
> the following exception:
>
> Name: java.util.NoSuchElementException
> Message: next on empty iterator
> StackTrace: scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
> scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
> scala.collection.IterableLike$class.head(IterableLike.scala:91)
>
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
>
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
> scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47)
>
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:81)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:79)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:81)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:83)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:85)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:87)
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:89)
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:91)
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:93)
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:95)
> $line78.$read$$iwC$$iwC$$iwC$$iwC.(:97)
> $line78.$read$$iwC$$iwC$$iwC.(:99)
> $line78.$read$$iwC$$iwC.(:101)
> $line78.$read$$iwC.(:103)
> $line78.$read.(:105)
> $line78.$read$.(:109)
> $line78.$read$.()
> $line78.$eval$.(:7)
> $line78.$eval$.()
> $line78.$eval.$print()
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
> java.lang.reflect.Method.invoke(Method.java:507)
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>
> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1$$anonfun$apply$3.apply(ScalaInterpreter.scala:296)
>
> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1$$anonfun$apply$3.apply(ScalaInterpreter.scala:291)
> com.ibm.spark.global.StreamState$.withStreams(StreamState.scala:80)
>
> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1.apply(ScalaInterpreter.scala:290)
>
> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1.apply(ScalaInterpreter.scala:290)
>
> com.ibm.spark.utils.TaskManager$$anonfun$add$2$$anon$1.run(TaskManager.scala:123)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> java.lang.Thread.run(Thread.java:785)
>
> This took some debugging to figure out why I received the Exception, but
> when looking at the predict() implementation, I seems to assume that there
> will always be features found for the provided user and product ids:
>
>
>   /** Predict the rating of one user for one product. */
>   @Since("0.8.0")
>   def predict(user: Int, product: Int): Double = {
> val userVector = userFeatures.lookup(user).head
> val productVector = productFeatures.lookup(product).head
> blas.ddot(rank, userVector, 1, productVector, 1)
>   }
>
> It would be helpful if a more useful exception was raised, e.g.
>
> MissingUserFeatureException : "User ID ${user} not found in model"
> MissingProductFeatureException : "Product ID ${product} not found in model"
>
> WDYT?
>
>
>
>


Re: 2.1.0-rc2 cut; committers please set fix version for branch-2.1 to 2.1.1 instead

2016-12-07 Thread Nick Pentreath
I went ahead and re-marked all the existing 2.1.1 fix version JIRAs (that
had gone into branch-2.1 since RC1 but before RC2) for Spark ML to 2.1.0

On Thu, 8 Dec 2016 at 09:20 Reynold Xin  wrote:

> Thanks.
>


Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib
vectors for row matrix.

I'd recommend using the vector conversion utils (I think in
mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly).
There are until methods for converting single vectors as well as vector
rows of DF. Check the mllib user guide for 2.0 for details.
On Fri, 9 Dec 2016 at 04:42, satyajit vegesna 
wrote:

> Hi All,
>
> PFB code.
>
>
> import org.apache.spark.ml.feature.{HashingTF, IDF}
> import org.apache.spark.ml.linalg.SparseVector
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{SparkConf, SparkContext}
>
> /**
>   * Created by satyajit on 12/7/16.
>   */
> object DIMSUMusingtf extends App {
>
>   val conf = new SparkConf()
> .setMaster("local[1]")
> .setAppName("testColsim")
>   val sc = new SparkContext(conf)
>   val spark = SparkSession
> .builder
> .appName("testColSim").getOrCreate()
>
>   import org.apache.spark.ml.feature.Tokenizer
>
>   val sentenceData = spark.createDataFrame(Seq(
> (0, "Hi I heard about Spark"),
> (0, "I wish Java could use case classes"),
> (1, "Logistic regression models are neat")
>   )).toDF("label", "sentence")
>
>   val tokenizer = new 
> Tokenizer().setInputCol("sentence").setOutputCol("words")
>
>   val wordsData = tokenizer.transform(sentenceData)
>
>
>   val hashingTF = new HashingTF()
> .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
>
>   val featurizedData = hashingTF.transform(wordsData)
>
>
>   val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
>   val idfModel = idf.fit(featurizedData)
>   val rescaledData = idfModel.transform(featurizedData)
>   rescaledData.show()
>   rescaledData.select("features", "label").take(3).foreach(println)
>   val check = rescaledData.select("features")
>
>   val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
>
>   val mat = new RowMatrix(row) //i am basically trying to use Dense.vector as 
> a direct input to
>
> rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
>
>   row.foreach(println)
> }
>
> Any help would be appreciated.
>
> Regards,
> Satyajit.
>
>
>
>


Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath
Yup - it's because almost all model data in spark ML (model coefficients)
is "small" - i.e. Non distributed.

If you look at ALS you'll see there is no repartitioning since the factor
dataframes can be large
On Fri, 13 Jan 2017 at 19:42, Sean Owen  wrote:

> You're referring to code that serializes models, which are quite small.
> For example a PCA model consists of a few principal component vector. It's
> a Dataset of just one element being saved here. It's re-using the code path
> normally used to save big data sets, to output 1 file with 1 thing as
> Parquet.
>
> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim  wrote:
>
> But why is that beneficial? The data is supposedly quite large,
> distributing it across many partitions/files would seem to make sense.
>
> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen  wrote:
>
> That is usually so the result comes out in one file, not partitioned over
> n files.
>
> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim  wrote:
>
> Hi,
>
> I'm curious why it's common for data to be repartitioned to 1 partition
> when saving ml models:
>
> sqlContext.createDataFrame(Seq(data)).repartition(1
> ).write.parquet(dataPath)
>
> This shows up in most ml models I've seen (Word2Vec
> ,
> PCA
> ,
> LDA
> ).
> Am I missing some benefit of repartitioning like this?
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>
>


Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Nick Pentreath
Hi Maciej

If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then
that seems to point to some other underlying issue as the root cause.

Even though adding checkpointing should help, we should understand why it's
different between 1.6 and 2.0?


On Thu, 2 Feb 2017 at 08:22 Liang-Chi Hsieh  wrote:

>
> Hi Maciej,
>
> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>
>
> Liang-Chi Hsieh wrote
> > Hi Maciej,
> >
> > Basically the fitting algorithm in Pipeline is an iterative operation.
> > Running iterative algorithm on Dataset would have RDD lineages and query
> > plans that grow fast. Without cache and checkpoint, it gets slower when
> > the iteration number increases.
> >
> > I think it is why when you run a Pipeline with long stages, it gets much
> > longer time to finish. As I think it is not uncommon to have long stages
> > in a Pipeline, we should improve this. I will submit a PR for this.
> > zero323 wrote
> >> Hi everyone,
> >>
> >> While experimenting with ML pipelines I experience a significant
> >> performance regression when switching from 1.6.x to 2.x.
> >>
> >> import org.apache.spark.ml.{Pipeline, PipelineStage}
> >> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
> >> VectorAssembler}
> >>
> >> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
> >> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
> >> val indexers = df.columns.tail.map(c => new StringIndexer()
> >>   .setInputCol(c)
> >>   .setOutputCol(s"${c}_indexed")
> >>   .setHandleInvalid("skip"))
> >>
> >> val encoders = indexers.map(indexer => new OneHotEncoder()
> >>   .setInputCol(indexer.getOutputCol)
> >>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
> >>   .setDropLast(true))
> >>
> >> val assembler = new
> >> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
> >> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
> >>
> >> new Pipeline().setStages(stages).fit(df).transform(df).show
> >>
> >> Task execution time is comparable and executors are most of the time
> >> idle so it looks like it is a problem with the optimizer. Is it a known
> >> issue? Are there any changes I've missed, that could lead to this
> >> behavior?
> >>
> >> --
> >> Best,
> >> Maciej
> >>
> >>
> >> -
> >> To unsubscribe e-mail:
>
> >> dev-unsubscribe@.apache
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Google Summer of Code 2017 is coming

2017-02-05 Thread Nick Pentreath
I think Sean raises valid points - that the result is highly dependent on
the particular student, project and mentor involved, and that the actual
required time investment is very significant.

Having said that, it's not all bad certainly. Scikit-learn started as a
GSoC project 10 years ago!

Actually they have a pretty good model for accepting students - typically
the student must demonstrate significant prior knowledge and ability with
the project sufficient to complete the work.

The challenge I think Spark has is already folks are strapped for capacity
so finding mentors with time will be tricky. But if there are mentors and
the right project / student fit can be found, I think it's a good idea.


On Sat, 4 Feb 2017 at 01:22 Jacek Laskowski  wrote:

> Thanks Sean. You've again been very helpful to put the right tone to
> the matters. I stand corrected and have no interest in GSoC anymore.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Feb 3, 2017 at 11:38 PM, Sean Owen  wrote:
> > I have a contrarian opinion on GSoC from experience many years ago in
> > Mahout. Of 3 students I interacted with, 2 didn't come close to
> completing
> > the work they signed up for. I think it's mostly that students are hungry
> > for the resumé line item, and don't understand the amount of work they're
> > proposing, and ultimately have little incentive to complete their
> proposal.
> > The stipend is small.
> >
> > I can appreciate the goal of GSoC but it makes more sense for projects
> that
> > don't get as much attention, and Spark gets plenty. I would not expect
> > students to be able to be net contributors to a project like Spark. The
> time
> > they consume in hand-holding will exceed the time it would take for
> someone
> > experienced to just do the work. I would caution anyone from agreeing to
> > this for Spark unless they are willing to devote 5-10 hours per week for
> the
> > summer to helping someone learn.
> >
> > My net experience with GSoC is negative, mostly on account of the
> > participants.
> >
> > On Fri, Feb 3, 2017 at 9:56 PM Holden Karau 
> wrote:
> >>
> >> As someone who did GSoC back in University I think this could be a good
> >> idea if there is enough interest from the PMC & I'd be willing the help
> >> mentor if that is a bottleneck.
> >>
> >> On Fri, Feb 3, 2017 at 12:42 PM, Jacek Laskowski 
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is this something Spark considering? Would be nice to mark issues as
> >>> GSoC in JIRA and solicit feedback. What do you think?
> >>>
> >>> Pozdrawiam,
> >>> Jacek Laskowski
> >>> 
> >>> https://medium.com/@jaceklaskowski/
> >>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> >>> Follow me at https://twitter.com/jaceklaskowski
> >>>
> >>>
> >>>
> >>> -- Forwarded message --
> >>> From: Ulrich Stärk 
> >>> Date: Fri, Feb 3, 2017 at 8:50 PM
> >>> Subject: Google Summer of Code 2017 is coming
> >>> To: ment...@community.apache.org
> >>>
> >>>
> >>> Hello PMCs (incubator Mentors, please forward this email to your
> >>> podlings),
> >>>
> >>> Google Summer of Code [1] is a program sponsored by Google allowing
> >>> students to spend their summer
> >>> working on open source software. Students will receive stipends for
> >>> developing open source software
> >>> full-time for three months. Projects will provide mentoring and
> >>> project ideas, and in return have
> >>> the chance to get new code developed and - most importantly - to
> >>> identify and bring in new committers.
> >>>
> >>> The ASF will apply as a participating organization meaning individual
> >>> projects don't have to apply
> >>> separately.
> >>>
> >>> If you want to participate with your project we ask you to do the
> >>> following things as soon as
> >>> possible but by no later than 2017-02-09:
> >>>
> >>> 1. understand what it means to be a mentor [2].
> >>>
> >>> 2. record your project ideas.
> >>>
> >>> Just create issues in JIRA, label them with gsoc2017, and they will
> >>> show up at [3]. Please be as
> >>> specific as possible when describing your idea. Include the
> >>> programming language, the tools and
> >>> skills required, but try not to scare potential students away. They
> >>> are supposed to learn what's
> >>> required before the program starts.
> >>>
> >>> Use labels, e.g. for the programming language (java, c, c++, erlang,
> >>> python, brainfuck, ...) or
> >>> technology area (cloud, xml, web, foo, bar, ...) and record them at
> [5].
> >>>
> >>> Please use the COMDEV JIRA project for recording your ideas if your
> >>> project doesn't use JIRA (e.g.
> >>> httpd, ooo). Contact d...@community.apache.org if you need assistance.
> >>>
> >>> [4] contains some additional information (will be updated for 2017
> >>> shortly).
> >>>
> >>> 3. subscribe to ment...@community.apache.org; restricted

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-23 Thread Nick Pentreath
Currently your only option is to write (or copy) your own implementations.

Logging is definitely intended to be internal use only, and it's best to
use your own logging lib - Typesafe scalalogging is a common option that
I've used.

As for the VectorUDT, for now that is private. There are no plans to open
it up as yet. It should not be too difficult to have your own UDT
implementation. What type of extensions are you trying to do with the UDT?

Likewise the shared params are for now private. It is a bit annoying to
have to re-create them, but most of them are pretty simple so it's not a
huge overhead.

Perhaps you can add your thoughts & comments to
https://issues.apache.org/jira/browse/SPARK-19498 in terms of extending
Spark ML. Ultimately I support making it easier to extend. But we do have
to balance that with exposing new public APIs and classes that impose
backward compat guarantees.

Perhaps now is a good time to think about some of the common shared params
for example.

Thanks
Nick


On Wed, 22 Feb 2017 at 22:51 Shouheng Yi 
wrote:

Hi Spark developers,



Currently my team at Microsoft is extending Spark’s machine learning
functionalities to include new learners and transformers. We would like
users to use these within spark pipelines so that they can mix and match
with existing Spark learners/transformers, and overall have a native spark
experience. We cannot accomplish this using a non-“org.apache” namespace
with the current implementation, and we don’t want to release code inside
the apache namespace because it’s confusing and there could be naming
rights issues.



We need to extend several classes from spark which happen to have
“private[spark].” For example, one of our class extends VectorUDT[0] which
has private[spark] class VectorUDT as its access modifier. This
unfortunately put us in a strange scenario that forces us to work under the
namespace org.apache.spark.



To be specific, currently the private classes/traits we need to use to
create new Spark learners & Transformers are HasInputCol, VectorUDT and
Logging. We will expand this list as we develop more.



Is there a way to avoid this namespace issue? What do other
people/companies do in this scenario? Thank you for your help!



[0]:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala



Best,

Shouheng


Re: Implementation of RNN/LSTM in Spark

2017-02-23 Thread Nick Pentreath
The short answer is there is none and highly unlikely to be inside of Spark
MLlib any time in the near future.

The best bets are to look at other DL libraries - for JVM there is
Deeplearning4J and BigDL (there are others but these seem to be the most
comprehensive I have come across) - that run on Spark. Also there are
various flavours of TensorFlow / Caffe on Spark. And of course the libs
such as Torch, Keras, Tensorflow, MXNet, Caffe etc. Some of them have Java
or Scala APIs and some form of Spark integration out there in the community
(in varying states of development).

Integrations with Spark are a bit patchy currently but include the
"XOnSpark" flavours mentioned above and TensorFrames (again, there may be
others).

On Thu, 23 Feb 2017 at 14:23 n1kt0  wrote:

> Hi,
> can anyone tell me what the current status about RNNs in Spark is?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementation-of-RNN-LSTM-in-Spark-tp14866p21060.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Nick Pentreath
Sorry for being late to the discussion. I think Joseph, Sean and others
have covered the issues well.

Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
As for the actual critical roadmap items mentioned on SPARK-18813, I think
it makes sense and will comment a bit further on that JIRA.

I would like to encourage votes & watching for issues to give a sense of
what the community wants (I guess Vote is more explicit yet passive, while
actually Watching an issue is more informative as it may indicate a real
use case dependent on the issue?!).

I think if used well this is valuable information for contributors. Of
course not everything on that list can get done. But if I look through the
top votes or watch list, while not all of those are likely to go in, a
great many of the issues are fairly non-contentious in terms of being good
additions to the project.

Things like these are good examples IMO (I just sample a few of them, not
exhaustive):
- sample weights for RF / DT
- multi-model and/or parallel model selection
- make sharedParams public?
- multi-column support for various transformers
- incremental model training
- tree algorithm enhancements

Now, whether these can be prioritised in terms of bandwidth available to
reviewers and committers is a totally different thing. But as Sean mentions
there is some process there for trying to find the balance of the issue
being a "good thing to add", a shepherd with bandwidth & interest in the
issue to review, and the maintenance burden imposed.

Let's take Deep Learning / NN for example. Here's a good example of
something that has a lot of votes/watchers and as Sean mentions it is
something that "everyone wants someone else to implement". In this case,
much of the interest may in fact be "stale" - 2 years ago it would have
been very interesting to have a strong DL impl in Spark. Now, because there
are a plethora of very good DL libraries out there, how many of those Votes
would be "deleted"? Granted few are well integrated with Spark but that can
and is changing (DL4J, BigDL, the "XonSpark" flavours etc).

So this is something that I dare say will not be in Spark any time in the
foreseeable future or perhaps ever given the current status. Perhaps it's
worth seriously thinking about just closing these kind of issues?



On Fri, 27 Jan 2017 at 05:53 Joseph Bradley  wrote:

> Sean has given a great explanation.  A few more comments:
>
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
>
> In general, there are many process improvements we could make.  A few in
> my mind are:
> * Visibility: Let the community know what committers are focusing on.
> This was the primary purpose of the "MLlib roadmap proposal."
> * Community initiatives: This is currently very organic.  Some of the
> organic process could be improved, such as encouraging Votes/Watchers
> (though I agree with Sean about these being one-sided metrics).  Cody's SIP
> work is a great step towards adding more clarity and structure for major
> initiatives.
> * JIRA hygiene: Always a challenge, and always requires some manual
> prodding.  But it's great to push for efforts on this.
>
>
> On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen  wrote:
>
> On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach  wrote:
>
> My confusion was that the ML 2.2 roadmap critical features (
> https://issues.apache.org/jira/browse/SPARK-18813) did not line up with
> the top ML/MLLIB JIRAs by Votes
> or
> Watchers
> 
> .
>
> Your explanation that they do not have 

Re: Feedback on MLlib roadmap process proposal

2017-02-24 Thread Nick Pentreath
FYI I've started going through a few of the top Watched JIRAs and tried to
identify those that are obviously stale and can probably be closed, to try
to clean things up a bit.

On Thu, 23 Feb 2017 at 21:38 Tim Hunter  wrote:

> As Sean wrote very nicely above, the changes made to Spark are decided in
> an organic fashion based on the interests and motivations of the committers
> and contributors. The case of deep learning is a good example. There is a
> lot of interest, and the core algorithms could be implemented without too
> much problem in a few thousands of lines of scala code. However, the
> performance of such a simple implementation would be one to two order of
> magnitude slower than what would get from the popular frameworks out there.
>
> At this point, there are probably more man-hours invested in TensorFlow
> (as an example) than in MLlib, so I think we need to be realistic about
> what we can expect to achieve inside Spark. Unlike BLAS for linear algebra,
> there is no agreed-up interface for deep learning, and each of the XOnSpark
> flavors explores a slightly different design. It will be interesting to see
> what works well in practice. In the meantime, though, there are plenty of
> things that we could do to help developers of other libraries to have a
> great experience with Spark. Matei alluded to that in his Spark Summit
> keynote when he mentioned better integration with low-level libraries.
>
> Tim
>
>
> On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath 
> wrote:
>
> Sorry for being late to the discussion. I think Joseph, Sean and others
> have covered the issues well.
>
> Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
> As for the actual critical roadmap items mentioned on SPARK-18813, I think
> it makes sense and will comment a bit further on that JIRA.
>
> I would like to encourage votes & watching for issues to give a sense of
> what the community wants (I guess Vote is more explicit yet passive, while
> actually Watching an issue is more informative as it may indicate a real
> use case dependent on the issue?!).
>
> I think if used well this is valuable information for contributors. Of
> course not everything on that list can get done. But if I look through the
> top votes or watch list, while not all of those are likely to go in, a
> great many of the issues are fairly non-contentious in terms of being good
> additions to the project.
>
> Things like these are good examples IMO (I just sample a few of them, not
> exhaustive):
> - sample weights for RF / DT
> - multi-model and/or parallel model selection
> - make sharedParams public?
> - multi-column support for various transformers
> - incremental model training
> - tree algorithm enhancements
>
> Now, whether these can be prioritised in terms of bandwidth available to
> reviewers and committers is a totally different thing. But as Sean mentions
> there is some process there for trying to find the balance of the issue
> being a "good thing to add", a shepherd with bandwidth & interest in the
> issue to review, and the maintenance burden imposed.
>
> Let's take Deep Learning / NN for example. Here's a good example of
> something that has a lot of votes/watchers and as Sean mentions it is
> something that "everyone wants someone else to implement". In this case,
> much of the interest may in fact be "stale" - 2 years ago it would have
> been very interesting to have a strong DL impl in Spark. Now, because there
> are a plethora of very good DL libraries out there, how many of those Votes
> would be "deleted"? Granted few are well integrated with Spark but that can
> and is changing (DL4J, BigDL, the "XonSpark" flavours etc).
>
> So this is something that I dare say will not be in Spark any time in the
> foreseeable future or perhaps ever given the current status. Perhaps it's
> worth seriously thinking about just closing these kind of issues?
>
>
>
> On Fri, 27 Jan 2017 at 05:53 Joseph Bradley  wrote:
>
> Sean has given a great explanation.  A few more comments:
>
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
>
> In general, there are many process improvements we could make.  A few in
> my mind are:
> * Visibility: Let the com

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-03-04 Thread Nick Pentreath
Also, note https://issues.apache.org/jira/browse/SPARK-7146 is linked from
SPARK-19498 specifically to discuss opening up sharedParams traits.


On Fri, 3 Mar 2017 at 23:17 Shouheng Yi 
wrote:

> Hi Spark dev list,
>
>
>
> Thank you guys so much for all your inputs. We really appreciated those
> suggestions. After some discussions in the team, we decided to stay under
> apache’s namespace for now, and attach some comments to explain what we did
> and why we did this.
>
>
>
> As the Spark dev list kindly pointed out, this is an existing issue that
> was documented in the JIRA ticket [Spark-19498] [0]. We can follow the JIRA
> ticket to see if there are any new suggested practices that should be
> adopted in the future and make corresponding fixes.
>
>
>
> Best,
>
> Shouheng
>
>
>
> [0] https://issues.apache.org/jira/browse/SPARK-19498
>
>
>
> *From:* Tim Hunter [mailto:timhun...@databricks.com
> ]
> *Sent:* Friday, February 24, 2017 9:08 AM
> *To:* Joseph Bradley 
> *Cc:* Steve Loughran ; Shouheng Yi <
> sho...@microsoft.com.invalid>; Apache Spark Dev ;
> Markus Weimer ; Rogan Carr ;
> Pei Jiang ; Miruna Oprescu 
> *Subject:* Re: [Spark Namespace]: Expanding Spark ML under Different
> Namespace?
>
>
>
> Regarding logging, Graphframes makes a simple wrapper this way:
>
>
>
>
> https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/Logging.scala
> 
>
>
>
> Regarding the UDTs, they have been hidden to be reworked for Datasets, the
> reasons being detailed here [1]. Can you describe your use case in more
> details? You may be better off copy/pasting the UDT code outside of Spark,
> depending on your use case.
>
>
>
> [1] https://issues.apache.org/jira/browse/SPARK-14155
> 
>
>
>
> On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley 
> wrote:
>
> +1 for Nick's comment about discussing APIs which need to be made public
> in https://issues.apache.org/jira/browse/SPARK-19498
> 
> !
>
>
>
> On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran 
> wrote:
>
>
>
> On 22 Feb 2017, at 20:51, Shouheng Yi 
> wrote:
>
>
>
> Hi Spark developers,
>
>
>
> Currently my team at Microsoft is extending Spark’s machine learning
> functionalities to include new learners and transformers. We would like
> users to use these within spark pipelines so that they can mix and match
> with existing Spark learners/transformers, and overall have a native spark
> experience. We cannot accomplish this using a non-“org.apache” namespace
> with the current implementation, and we don’t want to release code inside
> the apache namespace because it’s confusing and there could be naming
> rights issues.
>
>
>
> This isn't actually the ASF has a strong stance against, more left to
> projects themselves. After all: the source is licensed by the ASF, and the
> license doesn't say you can't.
>
>
>
> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
> hive team kept stuff package private. Though that's really a sign that
> things could be improved there.
>
>
>
> Where is problematic is that stack traces end up blaming the wrong group;
> nobody likes getting a bug report which doesn't actually exist in your
> codebase., not least because you have to waste time to even work it out.
>
>
>
> You also have to expect absolutely no stability guarantees, so you'd
> better set your nightly build to work against trunk
>
>
>
> Apache Bahir does put some stuff into org.apache.spark.stream, but they've
> sort of inherited that right.when they picked up the code from spark. new
> stuff is going into org.apache.bahir
>
>
>
>
>
> We need to extend several classes from spark which happen to have
> “private[spark].” For example, one of our class extends VectorUDT[0] which
> has private[spark] class VectorUDT as its access modifier. This
> unfortunately put us in a strange scenario that forces us to work under the
> namespace org.apache.spark.
>
>
>
> To be specific, currently the private classes/traits we need to use to
> create new S

Re: Should we consider a Spark 2.1.1 release?

2017-03-16 Thread Nick Pentreath
Spark 1.5.1 had 87 issues fix version 1 month after 1.5.0.

Spark 1.6.1 had 123 issues 2 months after 1.6.0

2.0.1 was larger (317 issues) at 3 months after 2.0.0 - makes sense due to
how large a release it was.

We are at 185 for 2.1.1 and 3 months after (and not released yet so it
could slip further) - so not totally unusual as the release interval has
certainly increased, but in fairness probably a bit later than usual. I'd
say definitely makes sense to cut the RC!



On Thu, 16 Mar 2017 at 02:06 Michael Armbrust 
wrote:

> Hey Holden,
>
> Thanks for bringing this up!  I think we usually cut patch releases when
> there are enough fixes to justify it.  Sometimes just a few weeks after the
> release.  I guess if we are at 3 months Spark 2.1.0 was a pretty good
> release :)
>
> That said, it is probably time. I was about to start thinking about 2.2 as
> well (we are a little past the posted code-freeze deadline), so I'm happy
> to push the buttons etc (this is a very good description
>  if you are curious). I
> would love help watching JIRA, posting the burn down on issues and
> shepherding in any critical patches.  Feel free to ping me off-line if you
> like to coordinate.
>
> Unless there are any objections, how about we aim for an RC of 2.1.1 on
> Monday and I'll also plan to cut branch-2.2 then?  (I'll send a separate
> email on this as well).
>
> Michael
>
> On Mon, Mar 13, 2017 at 1:40 PM, Holden Karau 
> wrote:
>
> I'd be happy to do the work of coordinating a 2.1.1 release if that's a
> thing a committer can do (I think the release coordinator for the most
> recent Arrow release was a committer and the final publish step took a PMC
> member to upload but other than that I don't remember any issues).
>
> On Mon, Mar 13, 2017 at 1:05 PM Sean Owen  wrote:
>
> It seems reasonable to me, in that other x.y.1 releases have followed ~2
> months after the x.y.0 release and it's been about 3 months since 2.1.0.
>
> Related: creating releases is tough work, so I feel kind of bad voting for
> someone else to do that much work. Would it make sense to deputize another
> release manager to help get out just the maintenance releases? this may in
> turn mean maintenance branches last longer. Experienced hands can continue
> to manage new minor and major releases as they require more coordination.
>
> I know most of the release process is written down; I know it's also still
> going to be work to make it 100% documented. Eventually it'll be necessary
> to make sure it's entirely codified anyway.
>
> Not pushing for it myself, just noting I had heard this brought up in side
> conversations before.
>
>
> On Mon, Mar 13, 2017 at 7:07 PM Holden Karau  wrote:
>
> Hi Spark Devs,
>
> Spark 2.1 has been out since end of December
> 
> and we've got quite a few fixes merged for 2.1.1
> 
> .
>
> On the Python side one of the things I'd like to see us get out into a
> patch release is a packaging fix (now merged) before we upload to PyPI &
> Conda, and we also have the normal batch of fixes like toLocalIterator for
> large DataFrames in PySpark.
>
> I've chatted with Felix & Shivaram who seem to think the R side is looking
> close to in good shape for a 2.1.1 release to submit to CRAN (if I've
> miss-spoken my apologies). The two outstanding issues that are being
> tracked for R are SPARK-18817, SPARK-19237.
>
> Looking at the other components quickly it seems like structured streaming
> could also benefit from a patch release.
>
> What do others think - are there any issues people are actively targeting
> for 2.1.1? Is this too early to be considering a patch release?
>
> Cheers,
>
> Holden
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>
>


Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Nick Pentreath
As for SPARK-19759 , I
don't think that needs to be targeted for 2.1.1 so we don't need to worry
about it

On Tue, 21 Mar 2017 at 13:49 Holden Karau  wrote:

> I agree with Michael, I think we've got some outstanding issues but none
> of them seem like regression from 2.1 so we should be good to start the RC
> process.
>
> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust 
> wrote:
>
> Please speak up if I'm wrong, but none of these seem like critical
> regressions from 2.1.  As such I'll start the RC process later today.
>
> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau 
> wrote:
>
> I'm not super sure it should be a blocker for 2.1.1 -- is it a regression?
> Maybe we can get TDs input on it?
>
> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:
>
> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
> blocker
>
> Best,
>
> Nan
>
> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
> wrote:
>
> I've been scrubbing R and think we are tracking 2 issues
>
> https://issues.apache.org/jira/browse/SPARK-19237
>
> https://issues.apache.org/jira/browse/SPARK-19925
>
>
>
>
> --
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Monday, March 20, 2017 3:12:35 PM
> *To:* dev@spark.apache.org
> *Subject:* Outstanding Spark 2.1.1 issues
>
> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522  - 
> --executor-memory
> flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell -
> https://github.com/apache/spark/pull/16906 PR exists but its difficult to
> add automated tests for this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613  - Flaky
> test: StateStoreRDDSuite.versioning and immutability - I

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-02 Thread Nick Pentreath
I won't +1 just given that it seems certain there will be another RC and
there are the outstanding ML QA blocker issues.

But clean build and test for JVM and Python tests LGTM on CentOS Linux
7.2.1511, OpenJDK 1.8.0_111

On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft 
wrote:

> Hi Ryan,
>
> IMO, the problem is that the Spark Avro version conflicts with the Parquet
> Avro version. As discussed upthread, I don’t think there’s a way to
> *reliably *make sure that Avro 1.8 is on the classpath first while using
> spark-submit. Relocating avro in our project wouldn’t solve the problem,
> because the MethodNotFoundError is thrown from the internals of the
> ParquetAvroOutputFormat, not from code in our project.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 12:33 PM, Ryan Blue  wrote:
>
> Michael, I think that the problem is with your classpath.
>
> Spark has a dependency to 1.7.7, which can't be changed. Your project is
> what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
> dependency on Avro 1.8. It is understandably annoying that using the same
> version of Parquet for your parquet-avro dependency is what causes your
> project to depend on Avro 1.8, but Spark's dependencies aren't a problem
> because its Parquet dependency doesn't bring in Avro.
>
> There are a few ways around this:
> 1. Make sure Avro 1.8 is found in the classpath first
> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
> 3. Use parquet-avro 1.8.1 in your project, which I think should work with
> 1.8.2 and avoid the Avro change
>
> The work-around in Spark is for tests, which do use parquet-avro. We can
> look at a Parquet 1.8.3 that avoids this issue, but I think this is
> reasonable for the 2.2.0 release.
>
> rb
>
> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer  wrote:
>
>> Please excuse me if I'm misunderstanding -- the problem is not with our
>> library or our classpath.
>>
>> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
>> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
>> already has to work around this for unit tests to pass.
>>
>>
>>
>> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue  wrote:
>>
>>> Thanks for the extra context, Frank. I agree that it sounds like your
>>> problem comes from the conflict between your Jars and what comes with
>>> Spark. Its the same concern that makes everyone shudder when anything has a
>>> public dependency on Jackson. :)
>>>
>>> What we usually do to get around situations like this is to relocate the
>>> problem library inside the shaded Jar. That way, Spark uses its version of
>>> Avro and your classes use a different version of Avro. This works if you
>>> don't need to share classes between the two. Would that work for your
>>> situation?
>>>
>>> rb
>>>
>>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers 
>>> wrote:
>>>
 sounds like you are running into the fact that you cannot really put
 your classes before spark's on classpath? spark's switches to support this
 never really worked for me either.

 inability to control the classpath + inconsistent jars => trouble ?

 On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
 fnoth...@berkeley.edu> wrote:

> Hi Ryan,
>
> We do set Avro to 1.8 in our downstream project. We also set Spark as
> a provided dependency, and build an überjar. We run via spark-submit, 
> which
> builds the classpath with our überjar and all of the Spark deps. This 
> leads
> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
> the no such method exception to occur.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 11:31 AM, Ryan Blue  wrote:
>
> Frank,
>
> The issue you're running into is caused by using parquet-avro with
> Avro 1.7. Can't your downstream project set the Avro dependency to 1.8?
> Spark can't update Avro because it is a breaking change that would force
> users to rebuilt specific Avro classes in some cases. But you should be
> free to use Avro 1.8 to avoid the problem.
>
> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
> fnoth...@berkeley.edu> wrote:
>
>> Hi Ryan et al,
>>
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>> dependency. My colleague Michael (who posted earlier on this thread)
>> documented this in Spark-19697
>> . I know that
>> Spark has unit tests that check this compatibility issue, but it looks 
>> like
>>

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-19 Thread Nick Pentreath
All the outstanding ML QA doc and user guide items are done for 2.2 so from
that side we should be good to cut another RC :)

On Thu, 18 May 2017 at 00:18 Russell Spitzer 
wrote:

> Seeing an issue with the DataScanExec and some of our integration tests
> for the SCC. Running dataframe read and writes from the shell seems fine
> but the Redaction code seems to get a "None" when doing
> SparkSession.getActiveSession.get in our integration tests. I'm not sure
> why but i'll dig into this later if I get a chance.
>
> Example Failed Test
>
> https://github.com/datastax/spark-cassandra-connector/blob/v2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/sql/CassandraSQLSpec.scala#L311
>
> ```[info]   org.apache.spark.SparkException: Job aborted due to stage
> failure: Task serialization failed: java.util.NoSuchElementException:
> None.get
> [info] java.util.NoSuchElementException: None.get
> [info] at scala.None$.get(Option.scala:347)
> [info] at scala.None$.get(Option.scala:345)
> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org
> $apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
> [info] at
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
> [info] at
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
> ```
>
> Again this only seems to repo in our IT suite so i'm not sure if this is a
> real issue.
>
>
> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley 
> wrote:
>
>> All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.  Thanks
>> everyone who helped out on those!
>>
>> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are
>> essentially all for documentation.
>>
>> Joseph
>>
>> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin 
>> wrote:
>>
>>> Since you'll be creating a new RC, I'd wait until SPARK-20666 is
>>> fixed, since the change that caused it is in branch-2.2. Probably a
>>> good idea to raise it to blocker at this point.
>>>
>>> On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
>>>  wrote:
>>> > I'm going to -1 given the outstanding issues and lack of +1s.  I'll
>>> create
>>> > another RC once ML has had time to take care of the more critical
>>> problems.
>>> > In the meantime please keep testing this release!
>>> >
>>> > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki 
>>> > wrote:
>>> >>
>>> >> +1 (non-binding)
>>> >>
>>> >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
>>> for
>>> >> core have passed.
>>> >>
>>> >> $ java -version
>>> >> openjdk version "1.8.0_111"
>>> >> OpenJDK Runtime Environment (build
>>> >> 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
>>> >> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>>> >> $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7
>>> >> package install
>>> >> $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
>>> core
>>> >> ...
>>> >> Run completed in 15 minutes, 12 seconds.
>>> >> Total number of tests run: 1940
>>> >> Suites: completed 206, aborted 0
>>> >> Tests: succeeded 1940, failed 0, canceled 4, ignored 8, pending 0
>>> >> All tests passed.
>>> >> [INFO]
>>> >>
>>> 
>>> >> [INFO] BUILD SUCCESS
>>> >> [INFO]
>>> >>
>>> 
>>> >> [INFO] Total time: 16:51 min
>>> >> [INFO] Finished at: 2017-05-09T17:51:04+09:00
>>> >> [INFO] Final Memory: 53M/514M
>>> >> [INFO]
>>> >>
>>> 
>>> >> [WARNING] The requested profile "hive" could not be activated because
>>> it
>>> >> does not exist.
>>> >>
>>> >>
>>> >> Kazuaki Ishizaki,
>>> >>
>>> >>
>>> >>
>>> >> From:Michael Armbrust 
>>> >> To:"dev@spark.apache.org" 
>>> >> Date:2017/05/05 02:08
>>> >> Subject:[VOTE] Apache Spark 2.2.0 (RC2)
>>> >> 
>>> >>
>>> >>
>>> >>
>>> >> Please vote on releasing the following candidate as Apache Spark
>>> version
>>> >> 2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and
>>> passes if
>>> >> a majority of at least 3 +1 PMC votes are cast.
>>> >>
>>> >> [ ] +1 Release this package as Apache Spark 2.2.0
>>> >> [ ] -1 Do not release this package because ...
>>> >>
>>> >>
>>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>>> >>
>>> >> The tag to be voted on is v2.2.0-rc2
>>> >> (1d4017b44d5e6ad156abeaae6371747f111dd1f9)
>>> >>
>>> >> List of JIRA tickets resolved can be found with this filter.
>>> >>
>>> >> The release files, including signatures, digests, etc. can be found
>>> at:
>>> >> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-bin/
>>> >>
>>> >> Release artifacts are signed with the following key:
>>> >> https://people.apache.org/keys/committer/pwendell.asc
>>> >>
>>> >> The staging repos

Re: RDD MLLib Deprecation Question

2017-05-30 Thread Nick Pentreath
The short answer is those distributed linalg parts will not go away.

In the medium term, it's much less likely that the distributed matrix
classes will be ported over to DataFrames (though the ideal would be to
have DataFrame-backed distributed matrix classes) - given the time and
effort it's taken just to port the various ML models and feature
transformers over to ML.

The current distributed matrices use the old mllib linear algebra
primitives for backing datastructures and ops, so those will have to be
ported at some point to the ml package vectors & matrices, though overall
functionality would remain the same initially I would expect.

There is https://issues.apache.org/jira/browse/SPARK-15882 that discusses
some of the ideas. The decision would still need to be made on the
higher-level API (whether it remains the same is current, or changes to be
DF-based, and/or changed in other ways, etc)

On Tue, 30 May 2017 at 15:33 John Compitello 
wrote:

> Hey all,
>
> I see on the MLLib website that there are plans to deprecate the RDD based
> API for MLLib once the new ML API reaches feature parity with RDD based
> one. Are there currently plans to reimplement all the distributed linear
> algebra / matrices operations as part of this new API, or are these things
> just going away? Like, will there still be a BlockMatrix class for
> distributed multiplies?
>
> Best,
>
> John
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
The website updates for ML QA (SPARK-20507) are not *actually* critical as
the project website certainly can be updated separately from the source
code guide and is not part of the release to be voted on. In future that
particular work item for the QA process could be marked down in priority,
and is definitely not a release blocker.

In any event I just resolved SPARK-20507, as I don't believe any website
updates are required for this release anyway. That fully resolves the ML QA
umbrella (SPARK-20499).


On Tue, 6 Jun 2017 at 10:16 Sean Owen  wrote:

> On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
> wrote:
>
>> Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
>> knowing that they were unlikely to pass.  That said, I still think these
>> early RCs are valuable. I know several users that wanted to test new
>> features in 2.2 that have used them.  Now, if we would prefer to call them
>> preview or RC0 or something I'd be okay with that as well.
>>
>
> They are valuable, I only suggest it's better to note explicitly when
> there are blockers or must-do tasks that will fail the RC. It makes a big
> difference to whether one would like to +1.
>
> I meant more than just calling them something different. An early RC could
> be voted as a released 'preview' artifact, at the start of the notional QA
> period, with a lower bar to passing, and releasable with known issues. This
> encourages more testing. It also resolves the controversy about whether
> it's OK to include an RC in a product (separate thread).
>
>
> Regarding doc updates, I don't think it is a requirement that they be
>> voted on as part of the release.  Even if they are something version
>> specific.  I think we have regularly updated the website with documentation
>> that was merged after the release.
>>
>
> They're part of the source release too, as markdown, and should be voted
> on. I've never understood otherwise. Have we actually released docs and
> then later changed them, so that they don't match the release? I don't
> recall that, but I do recall updating the non-version-specific website.
>
> Aside from the oddity of having docs generated from x.y source not match
> docs published for x.y, you want the same protections for doc source that
> the project distributes as anything else. It's not just correctness, but
> liability. The hypothetical is always that someone included copyrighted
> text or something without permission and now the project can't rely on the
> argument that it made a good-faith effort to review what it released on the
> site. Someone becomes personally liable.
>
> These are pretty technical reasons though. More practically, what's the
> hurry to release if docs aren't done (_if_ they're not done)? It's being
> presented as normal practice, but seems quite exceptional.
>
>
>
>> I personally don't think the QA umbrella JIRAs are particularly
>> effective, but I also wouldn't ban their use if others think they are.
>> However, I do think that real QA needs an RC to test, so I think it is fine
>> that there is still outstanding QA to be done when an RC is cut.  For
>> example, I plan to run a bunch of streaming workloads on RC4 and will vote
>> accordingly.
>>
>
> QA on RCs is great (see above). The problem is, I can't distinguish
> between a JIRA that means "we must test in general", which sounds like
> something you too would ignore, and one that means "there is specific
> functionality we have to check before a release that we haven't looked at
> yet", which is a committer waving a flag that they implicitly do not want a
> release until resolved. I wouldn't +1 a release that had a Blocker software
> defect one of us reported.
>
> I know I'm harping on this, but this is the one mechanism we do use
> consistently (Blocker JIRAs) to clearly communicate about issues vital to a
> go / no-go release decision, and I think this interferes. The rest of JIRA
> noise doesn't matter much. You can see we're already resorting to secondary
> communications as a result ("anyone have any issues that need to be fixed
> before I cut another RC?" emails) because this is kind of ignored, and
> think we're swapping out a decent mechanism for worse one.
>
> I suspect, as you do, that there's no to-do here in which case they should
> be resolved and we're still on track for release. I'd wait on +1 until then.
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
Now, on the subject of (ML) QA JIRAs.

>From the ML side, I believe they are required (I think others such as
Joseph will agree and in fact have already said as much).

Most are marked as Blockers, though of those the Python API coverage is
strictly not a Blocker as we will never hold the release for API parity
issues (unless of course there is some critical bug or missing thing, but
that really falls under the standard RC bug triage process).

I believe they are Blockers, since they involve auditing binary compat and
new public APIs, visibility issues, Java compat etc. I think it's obvious
that a RC should not pass if these have not been checked.

I actually agree that docs and user guide are absolutely part of the
release, and in fact are one of the more important pieces of the release.
Apart from the issues Sean mentions, not treating these things are critical
issues or even blockers is what inevitably over time leads to the user
guide being out of date, missing important features, etc.

In practice for ML at least we definitely aim to have all the doc / guide
issues done before the final release.

Now in terms of process, none of these QA issues really require an RC, they
can all be carried out once the release branch is cut. Some of the issues
like binary compat are perhaps a bit more tricky but inevitably involves
manually checking through MiMa exclusions added, to verify they are ok, etc
- so again an actual RC is not required here.

So really the answer is to more aggressively burn down these QA issues the
moment the release branch has been cut. Again, I think this echoes what
Joseph has said in previous threads.



On Tue, 6 Jun 2017 at 10:16 Sean Owen  wrote:

> On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
> wrote:
>
>> Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
>> knowing that they were unlikely to pass.  That said, I still think these
>> early RCs are valuable. I know several users that wanted to test new
>> features in 2.2 that have used them.  Now, if we would prefer to call them
>> preview or RC0 or something I'd be okay with that as well.
>>
>
> They are valuable, I only suggest it's better to note explicitly when
> there are blockers or must-do tasks that will fail the RC. It makes a big
> difference to whether one would like to +1.
>
> I meant more than just calling them something different. An early RC could
> be voted as a released 'preview' artifact, at the start of the notional QA
> period, with a lower bar to passing, and releasable with known issues. This
> encourages more testing. It also resolves the controversy about whether
> it's OK to include an RC in a product (separate thread).
>
>
> Regarding doc updates, I don't think it is a requirement that they be
>> voted on as part of the release.  Even if they are something version
>> specific.  I think we have regularly updated the website with documentation
>> that was merged after the release.
>>
>
> They're part of the source release too, as markdown, and should be voted
> on. I've never understood otherwise. Have we actually released docs and
> then later changed them, so that they don't match the release? I don't
> recall that, but I do recall updating the non-version-specific website.
>
> Aside from the oddity of having docs generated from x.y source not match
> docs published for x.y, you want the same protections for doc source that
> the project distributes as anything else. It's not just correctness, but
> liability. The hypothetical is always that someone included copyrighted
> text or something without permission and now the project can't rely on the
> argument that it made a good-faith effort to review what it released on the
> site. Someone becomes personally liable.
>
> These are pretty technical reasons though. More practically, what's the
> hurry to release if docs aren't done (_if_ they're not done)? It's being
> presented as normal practice, but seems quite exceptional.
>
>
>
>> I personally don't think the QA umbrella JIRAs are particularly
>> effective, but I also wouldn't ban their use if others think they are.
>> However, I do think that real QA needs an RC to test, so I think it is fine
>> that there is still outstanding QA to be done when an RC is cut.  For
>> example, I plan to run a bunch of streaming workloads on RC4 and will vote
>> accordingly.
>>
>
> QA on RCs is great (see above). The problem is, I can't distinguish
> between a JIRA that means "we must test in general", which sounds like
> something you too would ignore, and one that means "there is specific
> functionality we have to check before a release that we haven't looked at
> yet", which is a committer waving a flag that they implicitly do not want a
> release until resolved. I wouldn't +1 a release that had a Blocker software
> defect one of us reported.
>
> I know I'm harping on this, but this is the one mechanism we do use
> consistently (Blocker JIRAs) to clearly communicate about issues vital to a
> go / n

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-09 Thread Nick Pentreath
All Scala, Python tests pass. ML QA and doc issues are resolved (as well as
R it seems).

However, I'm seeing the following test failure on R consistently:
https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72


On Thu, 8 Jun 2017 at 08:48 Denny Lee  wrote:

> +1 non-binding
>
> Tested on macOS Sierra, Ubuntu 16.04
> test suite includes various test cases including Spark SQL, ML,
> GraphFrames, Structured Streaming
>
>
> On Wed, Jun 7, 2017 at 9:40 PM vaquar khan  wrote:
>
>> +1 non-binding
>>
>> Regards,
>> vaquar khan
>>
>> On Jun 7, 2017 4:32 PM, "Ricardo Almeida" 
>> wrote:
>>
>> +1 (non-binding)
>>
>> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
>> -Phive-thriftserver -Pscala-2.11 on
>>
>>- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>>- macOS 10.12.5 Java 8 (build 1.8.0_131)
>>
>>
>> On 5 June 2017 at 21:14, Michael Armbrust  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc4
>>>  (
>>> 377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1241/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>
>>
>>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-13 Thread Nick Pentreath
Hi yeah sorry for slow response - I was RHEL and OpenJDK but will have to
report back later with the versions as am AFK.

R version not totally sure but again will revert asap
On Wed, 14 Jun 2017 at 05:09, Felix Cheung 
wrote:

> Thanks
> This was with an external package and unrelated
>
>   >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>
> As for CentOS - would it be possible to test against R older than 3.4.0?
> This is the same error reported by Nick below.
>
> _
> From: Hyukjin Kwon 
> Sent: Tuesday, June 13, 2017 8:02 PM
>
> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
> To: dev 
> Cc: Sean Owen , Nick Pentreath <
> nick.pentre...@gmail.com>, Felix Cheung 
>
>
>
> For the test failure on R, I checked:
>
>
> Per https://github.com/apache/spark/tree/v2.2.0-rc4,
>
> 1. Windows Server 2012 R2 / R 3.3.1 - passed (
> https://ci.appveyor.com/project/spark-test/spark/build/755-r-test-v2.2.0-rc4
> )
> 2. macOS Sierra 10.12.3 / R 3.4.0 - passed
> 3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (
> https://gist.github.com/HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)
>
>
> Per https://github.com/apache/spark/tree/v2.1.1,
>
> 1. CentOS 7.2.1511 / R 3.4.0 - reproduced (
> https://gist.github.com/HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)
>
>
> This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my tests
> and observations.
>
> This is failed in Spark 2.1.1. So, it sounds not a regression although it
> is a bug that should be fixed (whether in Spark or R).
>
>
> 2017-06-14 8:28 GMT+09:00 Xiao Li :
>
>> -1
>>
>> Spark 2.2 is unable to read the partitioned table created by Spark 2.1 or
>> earlier.
>>
>> Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085
>>
>> Will fix it soon.
>>
>> Thanks,
>>
>> Xiao Li
>>
>>
>>
>> 2017-06-13 9:39 GMT-07:00 Joseph Bradley :
>>
>>> Re: the QA JIRAs:
>>> Thanks for discussing them.  I still feel they are very helpful; I
>>> particularly notice not having to spend a solid 2-3 weeks of time QAing
>>> (unlike in earlier Spark releases).  One other point not mentioned above: I
>>> think they serve as a very helpful reminder/training for the community for
>>> rigor in development.  Since we instituted QA JIRAs, contributors have been
>>> a lot better about adding in docs early, rather than waiting until the end
>>> of the cycle (though I know this is drawing conclusions from correlations).
>>>
>>> I would vote in favor of the RC...but I'll wait to see about the
>>> reported failures.
>>>
>>> On Fri, Jun 9, 2017 at 3:30 PM, Sean Owen  wrote:
>>>
>>>> Different errors as in
>>>> https://issues.apache.org/jira/browse/SPARK-20520 but that's also
>>>> reporting R test failures.
>>>>
>>>> I went back and tried to run the R tests and they passed, at least on
>>>> Ubuntu 17 / R 3.3.
>>>>
>>>>
>>>> On Fri, Jun 9, 2017 at 9:12 AM Nick Pentreath 
>>>> wrote:
>>>>
>>>>> All Scala, Python tests pass. ML QA and doc issues are resolved (as
>>>>> well as R it seems).
>>>>>
>>>>> However, I'm seeing the following test failure on R consistently:
>>>>> https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72
>>>>>
>>>>>
>>>>> On Thu, 8 Jun 2017 at 08:48 Denny Lee  wrote:
>>>>>
>>>>>> +1 non-binding
>>>>>>
>>>>>> Tested on macOS Sierra, Ubuntu 16.04
>>>>>> test suite includes various test cases including Spark SQL, ML,
>>>>>> GraphFrames, Structured Streaming
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 7, 2017 at 9:40 PM vaquar khan 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 non-binding
>>>>>>>
>>>>>>> Regards,
>>>>>>> vaquar khan
>>>>>>>
>>>>>>> On Jun 7, 2017 4:32 PM, "Ricardo Almeida" <
>>>>>>> ricardo.alme...@actnowib.com> wrote:
>>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Nick Pentreath
Thanks, I added the details of my environment to the JIRA (for what it's
worth now, as the issue is identified)

On Wed, 14 Jun 2017 at 11:28 Hyukjin Kwon  wrote:

> Actually, I opened - https://issues.apache.org/jira/browse/SPARK-21093.
>
> 2017-06-14 17:08 GMT+09:00 Hyukjin Kwon :
>
>> For a shorter reproducer ...
>>
>>
>> df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
>> collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>>
>> And running the below multiple times (5~7):
>>
>> collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>>
>> looks occasionally throwing an error.
>>
>>
>> I will leave here and probably explain more information if a JIRA is
>> open. This does not look a regression anyway.
>>
>>
>>
>> 2017-06-14 16:22 GMT+09:00 Hyukjin Kwon :
>>
>>>
>>> Per https://github.com/apache/spark/tree/v2.1.1,
>>>
>>> 1. CentOS 7.2.1511 / R 3.3.3 - this test hangs.
>>>
>>> I messed it up a bit while downgrading the R to 3.3.3 (It was an actual
>>> machine not a VM) so it took me a while to re-try this.
>>> I re-built this again and checked the R version is 3.3.3 at least. I
>>> hope this one could double checked.
>>>
>>> Here is the self-reproducer:
>>>
>>> irisDF <- suppressWarnings(createDataFrame (iris))
>>> schema <-  structType(structField("Sepal_Length", "double"),
>>> structField("Avg", "double"))
>>> df4 <- gapply(
>>>   cols = "Sepal_Length",
>>>   irisDF,
>>>   function(key, x) {
>>> y <- data.frame(key, mean(x$Sepal_Width), stringsAsFactors = FALSE)
>>>   },
>>>   schema)
>>> collect(df4)
>>>
>>>
>>>
>>> 2017-06-14 16:07 GMT+09:00 Felix Cheung :
>>>
>>>> Thanks! Will try to setup RHEL/CentOS to test it out
>>>>
>>>> _
>>>> From: Nick Pentreath 
>>>> Sent: Tuesday, June 13, 2017 11:38 PM
>>>> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
>>>> To: Felix Cheung , Hyukjin Kwon <
>>>> gurwls...@gmail.com>, dev 
>>>>
>>>> Cc: Sean Owen 
>>>>
>>>>
>>>> Hi yeah sorry for slow response - I was RHEL and OpenJDK but will have
>>>> to report back later with the versions as am AFK.
>>>>
>>>> R version not totally sure but again will revert asap
>>>> On Wed, 14 Jun 2017 at 05:09, Felix Cheung 
>>>> wrote:
>>>>
>>>>> Thanks
>>>>> This was with an external package and unrelated
>>>>>
>>>>>   >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
>>>>> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>>>>>
>>>>> As for CentOS - would it be possible to test against R older than
>>>>> 3.4.0? This is the same error reported by Nick below.
>>>>>
>>>>> _
>>>>> From: Hyukjin Kwon 
>>>>> Sent: Tuesday, June 13, 2017 8:02 PM
>>>>>
>>>>> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
>>>>> To: dev 
>>>>> Cc: Sean Owen , Nick Pentreath <
>>>>> nick.pentre...@gmail.com>, Felix Cheung 
>>>>>
>>>>>
>>>>>
>>>>> For the test failure on R, I checked:
>>>>>
>>>>>
>>>>> Per https://github.com/apache/spark/tree/v2.2.0-rc4,
>>>>>
>>>>> 1. Windows Server 2012 R2 / R 3.3.1 - passed (
>>>>> https://ci.appveyor.com/project/spark-test/spark/build/755-r-test-v2.2.0-rc4
>>>>> )
>>>>> 2. macOS Sierra 10.12.3 / R 3.4.0 - passed
>>>>> 3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
>>>>> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>>>>> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (
>>>>> https://gist.github.com/HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)
>>>>>
>>>>>
>>>>> Per https://github.com/apache/spark/tree/v2.1.1,
>>>>>
>>>>> 1. CentOS 7.2.1511 / R 3.4.0 - reproduced (
>>>>> https://gist.github.com/HyukjinKwon/6064b0d10bab8fc1dc6212452d83b30

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Nick Pentreath
As before, release looks good, all Scala, Python tests pass. R tests fail
with same issue in SPARK-21093 but it's not a blocker.

+1 (binding)


On Wed, 21 Jun 2017 at 01:49 Michael Armbrust 
wrote:

> I will kick off the voting with a +1.
>
> On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc5
>>  (
>> 62e442e73a2fa663892d2edaff5f7d72d7f402ed)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1243/
>>
>> The documentation corresponding to this release can be found at:
>> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Nick Pentreath
+1 (binding)

On Mon, 3 Jul 2017 at 11:53 Yanbo Liang  wrote:

> +1
>
> On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> +1
>>
>> On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida <
>> ricardo.alme...@actnowib.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
>>> -Phive-thriftserver -Pscala-2.11 on
>>>
>>>- macOS 10.12.5 Java 8 (build 1.8.0_131)
>>>- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>>>
>>>
>>>
>>>
>>>
>>> On 1 Jul 2017 02:45, "Michael Armbrust"  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc6
>>>  (
>>> a2c7b2133cfee7fa9abfaa2bfbfb637155466783)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1245/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>>
>>>
>>
>>
>


Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for
each release. I think generally we catch all breaking and most major
behavior changes

On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun  wrote:

> +1
>
> On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li  wrote:
>
>> Hi, Devs,
>>
>> Many questions from the open source community are actually caused by the
>> behavior changes we made in each release. So far, the migration guides
>> (e.g.,
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide)
>> were not being properly updated. In the last few releases, multiple
>> behavior changes are not documented in migration guides and even release
>> notes. I propose to do the document updates in the same PRs that introduce
>> the behavior changes. If the contributors can't make it, the committers who
>> merge the PRs need to do it instead. We also can create a dedicated page
>> for migration guides of all the components. Hopefully, this can assist the
>> migration efforts.
>>
>> Thanks,
>>
>> Xiao Li
>>
>
>


Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Nick Pentreath
+1 (binding)

—
Sent from Mailbox

On Thu, Nov 6, 2014 at 6:52 PM, Debasish Das 
wrote:

> +1
> The app to track PRs based on component is a great idea...
> On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara 
> wrote:
>> +1
>>
>> Sean
>>
>> On Nov 5, 2014, at 6:32 PM, Matei Zaharia  wrote:
>>
>> > Hi all,
>> >
>> > I wanted to share a discussion we've been having on the PMC list, as
>> well as call for an official vote on it on a public list. Basically, as the
>> Spark project scales up, we need to define a model to make sure there is
>> still great oversight of key components (in particular internal
>> architecture and public APIs), and to this end I've proposed implementing a
>> maintainer model for some of these components, similar to other large
>> projects.
>> >
>> > As background on this, Spark has grown a lot since joining Apache. We've
>> had over 80 contributors/month for the past 3 months, which I believe makes
>> us the most active project in contributors/month at Apache, as well as over
>> 500 patches/month. The codebase has also grown significantly, with new
>> libraries for SQL, ML, graphs and more.
>> >
>> > In this kind of large project, one common way to scale development is to
>> assign "maintainers" to oversee key components, where each patch to that
>> component needs to get sign-off from at least one of its maintainers. Most
>> existing large projects do this -- at Apache, some large ones with this
>> model are CloudStack (the second-most active project overall), Subversion,
>> and Kafka, and other examples include Linux and Python. This is also
>> by-and-large how Spark operates today -- most components have a de-facto
>> maintainer.
>> >
>> > IMO, adopting this model would have two benefits:
>> >
>> > 1) Consistent oversight of design for that component, especially
>> regarding architecture and API. This process would ensure that the
>> component's maintainers see all proposed changes and consider them to fit
>> together in a good way.
>> >
>> > 2) More structure for new contributors and committers -- in particular,
>> it would be easy to look up who’s responsible for each module and ask them
>> for reviews, etc, rather than having patches slip between the cracks.
>> >
>> > We'd like to start with in a light-weight manner, where the model only
>> applies to certain key components (e.g. scheduler, shuffle) and user-facing
>> APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
>> it if we deem it useful. The specific mechanics would be as follows:
>> >
>> > - Some components in Spark will have maintainers assigned to them, where
>> one of the maintainers needs to sign off on each patch to the component.
>> > - Each component with maintainers will have at least 2 maintainers.
>> > - Maintainers will be assigned from the most active and knowledgeable
>> committers on that component by the PMC. The PMC can vote to add / remove
>> maintainers, and maintained components, through consensus.
>> > - Maintainers are expected to be active in responding to patches for
>> their components, though they do not need to be the main reviewers for them
>> (e.g. they might just sign off on architecture / API). To prevent inactive
>> maintainers from blocking the project, if a maintainer isn't responding in
>> a reasonable time period (say 2 weeks), other committers can merge the
>> patch, and the PMC will want to discuss adding another maintainer.
>> >
>> > If you'd like to see examples for this model, check out the following
>> projects:
>> > - CloudStack:
>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
>> <
>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
>> >
>> > - Subversion:
>> https://subversion.apache.org/docs/community-guide/roles.html <
>> https://subversion.apache.org/docs/community-guide/roles.html>
>> >
>> > Finally, I wanted to list our current proposal for initial components
>> and maintainers. It would be good to get feedback on other components we
>> might add, but please note that personnel discussions (e.g. "I don't think
>> Matei should maintain *that* component) should only happen on the private
>> list. The initial components were chosen to include all public APIs and the
>> main core components, and the maintainers were chosen from the most active
>> contributors to those modules.
>> >
>> > - Spark core public API: Matei, Patrick, Reynold
>> > - Job scheduler: Matei, Kay, Patrick
>> > - Shuffle and network: Reynold, Aaron, Matei
>> > - Block manager: Reynold, Aaron
>> > - YARN: Tom, Andrew Or
>> > - Python: Josh, Matei
>> > - MLlib: Xiangrui, Matei
>> > - SQL: Michael, Reynold
>> > - Streaming: TD, Matei
>> > - GraphX: Ankur, Joey, Reynold
>> >
>> > I'd like to formally call a [VOTE] on this model, to last 72 hours. The
>> [VOTE] will end on Nov 8, 2014 at 6 PM PST.
>> >
>> > Matei
>>
>>
>> -
>> To unsubscribe, e-mail: dev

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread Nick Pentreath
+1

—
Sent from Mailbox

On Sat, Dec 13, 2014 at 3:12 PM, GuoQiang Li  wrote:

> +1 (non-binding).  Tested on CentOS 6.4
> -- Original --
> From:  "Patrick Wendell";;
> Date:  Thu, Dec 11, 2014 05:08 AM
> To:  "dev发送@spark.apache.org";
> Subject:  [VOTE] Release Apache Spark 1.2.0 (RC2)
> Please vote on releasing the following candidate as Apache Spark version 
> 1.2.0!
> The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc2/
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1055/
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
> Please vote on releasing this package as Apache Spark 1.2.0!
> The vote is open until Saturday, December 13, at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
> [ ] +1 Release this package as Apache Spark 1.2.0
> [ ] -1 Do not release this package because ...
> To learn more about Apache Spark, please see
> http://spark.apache.org/
> == What justifies a -1 vote for this release? ==
> This vote is happening relatively late into the QA period, so
> -1 votes should only occur for significant regressions from
> 1.0.2. Bugs already present in 1.1.X, minor
> regressions, or bugs related to new features will not block this
> release.
> == What default changes should I be aware of? ==
> 1. The default value of "spark.shuffle.blockTransferService" has been
> changed to "netty"
> --> Old behavior can be restored by switching to "nio"
> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
> --> Old behavior can be restored by setting "spark.shuffle.manager" to "hash".
> == How does this differ from RC1 ==
> This has fixes for a handful of issues identified - some of the
> notable fixes are:
> [Core]
> SPARK-4498: Standalone Master can fail to recognize completed/failed
> applications
> [SQL]
> SPARK-4552: Query for empty parquet table in spark sql hive get
> IllegalArgumentException
> SPARK-4753: Parquet2 does not prune based on OR filters on partition columns
> SPARK-4761: With JDBC server, set Kryo as default serializer and
> disable reference tracking
> SPARK-4785: When called with arguments referring column fields, PMOD throws 
> NPE
> - Patrick
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org

Re: Highly interested in contributing to spark

2015-01-01 Thread Nick Pentreath
I'm sure Spark will sign up for GSoC again this year - and id be surprised if 
there was not some interest now for projects :)


If I have the time at that point in the year I'd be happy to mentor a project 
in MLlib but will have to see how my schedule is at that point!




Manoj perhaps some of the locality sensitive hashing stuff you did for 
scikit-learn could find its way to Spark or spark-projects.


—
Sent from Mailbox

On Fri, Jan 2, 2015 at 6:28 AM, Reynold Xin  wrote:

> Hi Manoj,
> Thanks for the email.
> Yes - you should start with the starter task before attempting larger ones.
> Last year I signed up as a mentor for GSoC, but no student signed up. I
> don't think I'd have time to be a mentor this year, but others might.
> On Thu, Jan 1, 2015 at 4:54 PM, Manoj Kumar 
> wrote:
>> Hello,
>>
>> I am Manoj (https://github.com/MechCoder), an undergraduate student highly
>> interested in Machine Learning. I have contributed to SymPy and
>> scikit-learn as part of Google Summer of Code projects and my bachelor's
>> thesis. I have a few quick (non-technical) questions before I dive into the
>> issue tracker.
>>
>> Are the ones marked trivial easy to fix ones, that I could try before
>> attempting slightly more ambitious ones? Also I would like to know if
>> Apache Spark takes part in Google Summer of Code projects under the Apache
>> Software Foundation. It would be really great if it does!
>>
>> Looking forward!
>>
>> --
>> Godspeed,
>> Manoj Kumar,
>> Mech Undergrad
>> http://manojbits.wordpress.com
>>

Re: Highly interested in contributing to spark

2015-01-01 Thread Nick Pentreath
Oh actually I was confused with another project, yours was not LSH sorry!






—
Sent from Mailbox

On Fri, Jan 2, 2015 at 8:19 AM, Nick Pentreath 
wrote:

> I'm sure Spark will sign up for GSoC again this year - and id be surprised if 
> there was not some interest now for projects :)
> If I have the time at that point in the year I'd be happy to mentor a project 
> in MLlib but will have to see how my schedule is at that point!
> Manoj perhaps some of the locality sensitive hashing stuff you did for 
> scikit-learn could find its way to Spark or spark-projects.
> —
> Sent from Mailbox
> On Fri, Jan 2, 2015 at 6:28 AM, Reynold Xin  wrote:
>> Hi Manoj,
>> Thanks for the email.
>> Yes - you should start with the starter task before attempting larger ones.
>> Last year I signed up as a mentor for GSoC, but no student signed up. I
>> don't think I'd have time to be a mentor this year, but others might.
>> On Thu, Jan 1, 2015 at 4:54 PM, Manoj Kumar 
>> wrote:
>>> Hello,
>>>
>>> I am Manoj (https://github.com/MechCoder), an undergraduate student highly
>>> interested in Machine Learning. I have contributed to SymPy and
>>> scikit-learn as part of Google Summer of Code projects and my bachelor's
>>> thesis. I have a few quick (non-technical) questions before I dive into the
>>> issue tracker.
>>>
>>> Are the ones marked trivial easy to fix ones, that I could try before
>>> attempting slightly more ambitious ones? Also I would like to know if
>>> Apache Spark takes part in Google Summer of Code projects under the Apache
>>> Software Foundation. It would be really great if it does!
>>>
>>> Looking forward!
>>>
>>> --
>>> Godspeed,
>>> Manoj Kumar,
>>> Mech Undergrad
>>> http://manojbits.wordpress.com
>>>

Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
Hey 


These converters are actually just intended to be examples of how to set up a 
custom converter for a specific input format. The converter interface is there 
to provide flexibility where needed, although with the new SparkSQL data store 
interface the intention is that most common use cases can be handled using that 
approach rather than custom converters.




The intention is not to have specific converters living in Spark core, which is 
why these are in the examples project.




Having said that, if you wish to expand the example converter for others 
reference do feel free to submit a PR.




Ideally though, I would think that various custom converters would be part of 
external projects that can be listed with http://spark-packages.org/ I see your 
project is already listed there.


—
Sent from Mailbox

On Mon, Jan 5, 2015 at 5:37 PM, Ted Yu  wrote:

> In my opinion this would be useful - there was another thread where returning
> only the value of first column in the result was mentioned.
> Please create a SPARK JIRA and a pull request.
> Cheers
> On Mon, Jan 5, 2015 at 6:42 AM, tgbaggio  wrote:
>> Hi,
>>
>> In  HBaseConverter.scala
>> <
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
>> >
>> , the python converter HBaseResultToStringConverter return only the value
>> of
>> first column in the result. In my opinion, it limits the utility of this
>> converter, because it returns only one value per row and moreover it loses
>> the other information of record, such as column:cell, timestamp.
>>
>> Therefore, I would like to propose some modifications about
>> HBaseResultToStringConverter which will be able to return all records in
>> the
>> hbase with more complete information: I have already written some code in
>> pythonConverters.scala
>> <
>> https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala
>> >
>> and it works
>>
>> Is it OK to modify the code in HBaseConverters.scala, please?
>> Thanks a lot in advance.
>>
>> Cheers
>> Gen
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-tp10001.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
Absolutely; as I mentioned by all means submit a PR - I just wanted to point 
out that any specific converter is not "officially" supported, although the 
interface is of course.


I'm happy to review a PR just ping me when ready.


—
Sent from Mailbox

On Mon, Jan 5, 2015 at 7:06 PM, Ted Yu  wrote:

> HBaseConverter is in Spark source tree. Therefore I think it makes sense
> for this improvement to be accepted so that the example is more useful.
> Cheers
> On Mon, Jan 5, 2015 at 7:54 AM, Nick Pentreath 
> wrote:
>> Hey
>>
>> These converters are actually just intended to be examples of how to set
>> up a custom converter for a specific input format. The converter interface
>> is there to provide flexibility where needed, although with the new
>> SparkSQL data store interface the intention is that most common use cases
>> can be handled using that approach rather than custom converters.
>>
>> The intention is not to have specific converters living in Spark core,
>> which is why these are in the examples project.
>>
>> Having said that, if you wish to expand the example converter for others
>> reference do feel free to submit a PR.
>>
>> Ideally though, I would think that various custom converters would be part
>> of external projects that can be listed with http://spark-packages.org/ I
>> see your project is already listed there.
>>
>> —
>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>> On Mon, Jan 5, 2015 at 5:37 PM, Ted Yu  wrote:
>>
>>> In my opinion this would be useful - there was another thread where
>>> returning
>>> only the value of first column in the result was mentioned.
>>>
>>> Please create a SPARK JIRA and a pull request.
>>>
>>> Cheers
>>>
>>> On Mon, Jan 5, 2015 at 6:42 AM, tgbaggio  wrote:
>>>
>>> > Hi,
>>> >
>>> > In HBaseConverter.scala
>>> > <
>>> >
>>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
>>> > >
>>> > , the python converter HBaseResultToStringConverter return only the
>>> value
>>> > of
>>> > first column in the result. In my opinion, it limits the utility of
>>> this
>>> > converter, because it returns only one value per row and moreover it
>>> loses
>>> > the other information of record, such as column:cell, timestamp.
>>> >
>>> > Therefore, I would like to propose some modifications about
>>> > HBaseResultToStringConverter which will be able to return all records
>>> in
>>> > the
>>> > hbase with more complete information: I have already written some code
>>> in
>>> > pythonConverters.scala
>>> > <
>>> >
>>> https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala
>>> > >
>>> > and it works
>>> >
>>> > Is it OK to modify the code in HBaseConverters.scala, please?
>>> > Thanks a lot in advance.
>>> >
>>> > Cheers
>>> > Gen
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-tp10001.html
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> > Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>> >
>>>
>>
>>

Re: Welcoming three new committers

2015-02-04 Thread Nick Pentreath
Congrats and welcome Sean, Joseph and Cheng!


On Wed, Feb 4, 2015 at 2:10 PM, Sean Owen  wrote:

> Thanks all, I appreciate the vote of trust. I'll do my best to help
> keep JIRA and commits moving along, and am ramping up carefully this
> week. Now get back to work reviewing things!
>
> On Tue, Feb 3, 2015 at 4:34 PM, Matei Zaharia 
> wrote:
> > Hi all,
> >
> > The PMC recently voted to add three new committers: Cheng Lian, Joseph
> Bradley and Sean Owen. All three have been major contributors to Spark in
> the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many
> pieces throughout Spark Core. Join me in welcoming them as committers!
> >
> > Matei
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Directly broadcasting (sort of) RDDs

2015-03-20 Thread Nick Pentreath
There is block matrix in Spark 1.3 - 
http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix





However I believe it only supports dense matrix blocks.




Still, might be possible to use it or exetend 




JIRAs:


https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3434





Was based on 


https://github.com/amplab/ml-matrix





Another lib:


https://github.com/PasaLab/marlin/blob/master/README.md







—
Sent from Mailbox

On Sat, Mar 21, 2015 at 12:24 AM, Guillaume Pitel
 wrote:

> Hi,
> I have an idea that I would like to discuss with the Spark devs. The 
> idea comes from a very real problem that I have struggled with since 
> almost a year. My problem is very simple, it's a dense matrix * sparse 
> matrix  operation. I have a dense matrix RDD[(Int,FloatMatrix)] which is 
> divided in X large blocks (one block per partition), and a sparse matrix 
> RDD[((Int,Int),Array[Array[(Int,Float)]]] , divided in X * Y blocks. The 
> most efficient way to perform the operation is to collectAsMap() the 
> dense matrix and broadcast it, then perform the block-local 
> mutliplications, and combine the results by column.
> This is quite fine, unless the matrix is too big to fit in memory 
> (especially since the multiplication is performed several times 
> iteratively, and the broadcasts are not always cleaned from memory as I 
> would naively expect).
> When the dense matrix is too big, a second solution is to split the big 
> sparse matrix in several RDD, and do several broadcasts. Doing this 
> creates quite a big overhead, but it mostly works, even though I often 
> face some problems with unaccessible broadcast files, for instance.
> Then there is the terrible but apparently very effective good old join. 
> Since X blocks of the sparse matrix use the same block from the dense 
> matrix, I suspect that the dense matrix is somehow replicated X times 
> (either on disk or in the network), which is the reason why the join 
> takes so much time.
> After this bit of a context, here is my idea : would it be possible to 
> somehow "broadcast" (or maybe more accurately, share or serve) a 
> persisted RDD which is distributed on all workers, in a way that would, 
> a bit like the IndexedRDD, allow a task to access a partition or an 
> element of a partition in the closure, with a worker-local memory cache 
> . i.e. the information about where each block resides would be 
> distributed on the workers, to allow them to access parts of the RDD 
> directly. I think that's already a bit how RDD are shuffled ?
> The RDD could stay distributed (no need to collect then broadcast), and 
> only necessary transfers would be required.
> Is this a bad idea, is it already implemented somewhere (I would love it 
> !) ?or is it something that could add efficiency not only for my use 
> case, but maybe for others ? Could someone give me some hint about how I 
> could add this possibility to Spark ? I would probably try to extend a 
> RDD into a specific SharedIndexedRDD with a special lookup that would be 
> allowed from tasks as a special case, and that would try to contact the 
> blockManager and reach the corresponding data from the right worker.
> Thanks in advance for your advices
> Guillaume
> -- 
> eXenSa
>   
> *Guillaume PITEL, Président*
> +33(0)626 222 431
> eXenSa S.A.S. 
> 41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
Imran, on your point to read multiple files together in a partition, is it
not simpler to use the approach of copy Hadoop conf and set per-RDD
settings for min split to control the input size per partition, together
with something like CombineFileInputFormat?

On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid  wrote:

> I think this would be a great addition, I totally agree that you need to be
> able to set these at a finer context than just the SparkContext.
>
> Just to play devil's advocate, though -- the alternative is for you just
> subclass HadoopRDD yourself, or make a totally new RDD, and then you could
> expose whatever you need.  Why is this solution better?  IMO the criteria
> are:
> (a) common operations
> (b) error-prone / difficult to implement
> (c) non-obvious, but important for performance
>
> I think this case fits (a) & (c), so I think its still worthwhile.  But its
> also worth asking whether or not its too difficult for a user to extend
> HadoopRDD right now.  There have been several cases in the past week where
> we've suggested that a user should read from hdfs themselves (eg., to read
> multiple files together in one partition) -- with*out* reusing the code in
> HadoopRDD, though they would lose things like the metric tracking &
> preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
> refactoring to make that easier to do?  Or do we just need a good example?
>
> Imran
>
> (sorry for hijacking your thread, Koert)
>
>
>
> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers  wrote:
>
> > see email below. reynold suggested i send it to dev instead of user
> >
> > -- Forwarded message --
> > From: Koert Kuipers 
> > Date: Mon, Mar 23, 2015 at 4:36 PM
> > Subject: hadoop input/output format advanced control
> > To: "u...@spark.apache.org" 
> >
> >
> > currently its pretty hard to control the Hadoop Input/Output formats used
> > in Spark. The conventions seems to be to add extra parameters to all
> > methods and then somewhere deep inside the code (for example in
> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
> into
> > settings on the Hadoop Configuration object.
> >
> > for example for compression i see "codec: Option[Class[_ <:
> > CompressionCodec]] = None" added to a bunch of methods.
> >
> > how scalable is this solution really?
> >
> > for example i need to read from a hadoop dataset and i dont want the
> input
> > (part) files to get split up. the way to do this is to set
> > "mapred.min.split.size". now i dont want to set this at the level of the
> > SparkContext (which can be done), since i dont want it to apply to input
> > formats in general. i want it to apply to just this one specific input
> > dataset i need to read. which leaves me with no options currently. i
> could
> > go add yet another input parameter to all the methods
> > (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
> > etc.). but that seems ineffective.
> >
> > why can we not expose a Map[String, String] or some other generic way to
> > manipulate settings for hadoop input/output formats? it would require
> > adding one more parameter to all methods to deal with hadoop input/output
> > formats, but after that its done. one parameter to rule them all
> >
> > then i could do:
> > val x = sc.textFile("/some/path", formatSettings =
> > Map("mapred.min.split.size" -> "12345"))
> >
> > or
> > rdd.saveAsTextFile("/some/path, formatSettings =
> > Map(mapred.output.compress" -> "true", "mapred.output.compression.codec"
> ->
> > "somecodec"))
> >
>


Re: [discuss] ending support for Java 6?

2015-04-30 Thread Nick Pentreath
+1 for this think it's high time.




We should of course do it with enough warning for users. 1.4 May be too early 
(not for me though!). Perhaps we specify that 1.5 will officially move to JDK7?









—
Sent from Mailbox

On Fri, May 1, 2015 at 12:16 AM, Ram Sriharsha
 wrote:

> +1 for end of support for Java 6 
>  On Thursday, April 30, 2015 3:08 PM, Vinod Kumar Vavilapalli 
>  wrote:
>
>  FYI, after enough consideration, we the Hadoop community dropped support for 
> JDK 6 starting release Apache Hadoop 2.7.x.
> Thanks
> +Vinod
> On Apr 30, 2015, at 12:02 PM, Reynold Xin  wrote:
>> This has been discussed a few times in the past, but now Oracle has ended
>> support for Java 6 for over a year, I wonder if we should just drop Java 6
>> support.
>> 
>> There is one outstanding issue Tom has brought to my attention: PySpark on
>> YARN doesn't work well with Java 7/8, but we have an outstanding pull
>> request to fix that.
>> 
>> https://issues.apache.org/jira/browse/SPARK-6869
>> https://issues.apache.org/jira/browse/SPARK-1920
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-10 Thread Nick Pentreath
Looks very interesting, thanks for sharing this.

I haven't had much chance to do more than a quick glance over the code.
Quick question - are the Word2Vec and GLOVE implementations fully parallel
on Spark?

On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright  wrote:

>
> The deeplearning4j framework provides a variety of distributed, neural
> network-based learning algorithms, including convolutional nets, deep
> auto-encoders, deep-belief nets, and recurrent nets.  We’re working on
> integration with the Spark ML pipeline, leveraging the developer API.
> This announcement is to share some code and get feedback from the Spark
> community.
>
> The integration code is located in the dl4j-spark-ml module
> 
>  in
> the deeplearning4j repository.
>
> Major aspects of the integration work:
>
>1. *ML algorithms.*  To bind the dl4j algorithms to the ML pipeline,
>we developed a new classifier
>
> 
>  and
>a new unsupervised learning estimator
>
> .
>
>2. *ML attributes.* We strove to interoperate well with other pipeline
>components.   ML Attributes are column-level metadata enabling information
>sharing between pipeline components.See here
>
> 
>  how
>the classifier reads label metadata from a column provided by the new
>StringIndexer
>
> 
>.
>3. *Large binary data.*  It is challenging to work with large binary
>data in Spark.   An effective approach is to leverage PrunedScan and to
>carefully control partition sizes.  Here
>
> 
>  we
>explored this with a custom data source based on the new relation API.
>4. *Column-based record readers.*  Here
>
> 
>  we
>explored how to construct rows from a Hadoop input split by composing a
>number of column-level readers, with pruning support.
>5. *UDTs*.   With Spark SQL it is possible to introduce new data
>types.   We prototyped an experimental Tensor type, here
>
> 
>.
>6. *Spark Package.*   We developed a spark package to make it easy to
>use the dl4j framework in spark-shell and with spark-submit.  See the
>deeplearning4j/dl4j-spark-ml
> repository for
>useful snippets involving the sbt-spark-package plugin.
>7. *Example code.*   Examples demonstrate how the standardized ML API
>simplifies interoperability, such as with label preprocessing and feature
>scaling.   See the deeplearning4j/dl4j-spark-ml-examples
> repository
>for an expanding set of example pipelines.
>
> Hope this proves useful to the community as we transition to exciting new
> concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working
> with multiple GPUs on AWS  and
> we're looking forward to optimizations that will speed neural net training
> even more.
>
> Eron Wright
> Contributor | deeplearning4j.org
>
>


Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-06-18 Thread Nick Pentreath
If it's going into the DataFrame API (which it probably should rather than
in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT)
which would mean it doesn't have to implement Serializable, as it appears
that serialization is taken care of in the UDT def (e.g.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L254
)

If I understand correctly UDT SerDe correctly?

On Thu, Jun 11, 2015 at 2:47 AM, Ray Ortigas 
wrote:

> Hi Grega and Reynold,
>
> Grega, if you still want to use t-digest, I filed this PR because I
> thought your t-digest suggestion was a good idea.
>
> https://github.com/tdunning/t-digest/pull/56
>
> If it is helpful feel free to do whatever with it.
>
> Regards,
> Ray
>
>
> On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin  wrote:
>
>> This email is good. Just one note -- a lot of people are swamped right
>> before Spark Summit, so you might not get prompt responses this week.
>>
>>
>> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret  wrote:
>>
>>> I have some time to work on it now. What's a good way to continue the
>>> discussions before coding it?
>>>
>>> This e-mail list, JIRA or something else?
>>>
>>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin 
>>> wrote:
>>>
 I think those are great to have. I would put them in the DataFrame API
 though, since this is applying to structured data. Many of the advanced
 functions on the PairRDDFunctions should really go into the DataFrame API
 now we have it.

 One thing that would be great to understand is what state-of-the-art
 alternatives are out there. I did a quick google scholar search using the
 keyword "approximate quantile" and found some older papers. Just the
 first few I found:

 http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs


 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
  by Bruce Lindsay, IBM

 http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf





 On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret 
 wrote:

> Hi!
>
> I'd like to get community's opinion on implementing a generic quantile
> approximation algorithm for Spark that is O(n) and requires limited 
> memory.
> I would find it useful and I haven't found any existing implementation. 
> The
> plan was basically to wrap t-digest
> , implement the
> serialization/deserialization boilerplate and provide
>
> def cdf(x: Double): Double
> def quantile(q: Double): Double
>
>
> on RDD[Double] and RDD[(K, Double)].
>
> Let me know what you think. Any other ideas/suggestions also welcome!
>
> Best,
> Grega
> --
> [image: Inline image 1]*Grega Kešpret*
> Senior Software Engineer, Analytics
>
> Skype: gregakespret
> celtra.com  | @celtramobile
> 
>
>

>>>
>>
>


HyperLogLogUDT

2015-06-23 Thread Nick Pentreath
Hey Spark devs

I've been looking at DF UDFs and UDAFs. The approx distinct is using
hyperloglog,
but there is only an option to return the count as a Long.

It can be useful to be able to return and store the actual data structure
(ie serialized HLL). This effectively allows one to do aggregation /
rollups over columns while still preserving the ability to get distinct
counts.

For example, one can store daily aggregates of events, grouped by various
columns, while storing for each grouping the HLL of say unique users. So
you can get the uniques per day directly but could also very easily do
arbitrary aggregates (say monthly, annually) and still be able to get a
unique count for that period by merging the daily HLLS.

I did this a while back as a Hive UDAF (https://github.com/MLnick/hive-udf)
which returns a Struct field containing a "cardinality" field and a
"binary" field containing the serialized HLL.

I was wondering if there would be interest in something like this? I am not
so clear on how UDTs work with regards to SerDe - so could one adapt the
HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
count as a field? Then I assume this would automatically play nicely with
DataFrame I/O etc. The gotcha is one needs to then call
"approx_count_field.count" (or is there a concept of a "default field" for
a Struct?).

Also, being able to provide the bitsize parameter may be useful...

The same thinking would apply potentially to other approximate (and
mergeable) data structures like T-Digest and maybe CMS.

Nick


Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
Any thoughts?



—
Sent from Mailbox

On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath 
wrote:

> Hey Spark devs
> I've been looking at DF UDFs and UDAFs. The approx distinct is using
> hyperloglog,
> but there is only an option to return the count as a Long.
> It can be useful to be able to return and store the actual data structure
> (ie serialized HLL). This effectively allows one to do aggregation /
> rollups over columns while still preserving the ability to get distinct
> counts.
> For example, one can store daily aggregates of events, grouped by various
> columns, while storing for each grouping the HLL of say unique users. So
> you can get the uniques per day directly but could also very easily do
> arbitrary aggregates (say monthly, annually) and still be able to get a
> unique count for that period by merging the daily HLLS.
> I did this a while back as a Hive UDAF (https://github.com/MLnick/hive-udf)
> which returns a Struct field containing a "cardinality" field and a
> "binary" field containing the serialized HLL.
> I was wondering if there would be interest in something like this? I am not
> so clear on how UDTs work with regards to SerDe - so could one adapt the
> HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
> count as a field? Then I assume this would automatically play nicely with
> DataFrame I/O etc. The gotcha is one needs to then call
> "approx_count_field.count" (or is there a concept of a "default field" for
> a Struct?).
> Also, being able to provide the bitsize parameter may be useful...
> The same thinking would apply potentially to other approximate (and
> mergeable) data structures like T-Digest and maybe CMS.
> Nick

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
Sure I can copy the code but my aim was more to understand:




(A) if this is broadly interesting enough to folks to think about updating / 
extending the existing UDAF within Spark

(b) how to register ones own custom UDAF - in which case it could be a Spark 
package for example 




All examples deal with registering a UDF but nothing about UDAFs



—
Sent from Mailbox

On Wed, Jul 1, 2015 at 6:32 PM, Daniel Darabos
 wrote:

> It's already possible to just copy the code from countApproxDistinct
> <https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153>
> and
> access the HLL directly, or do anything you like.
> On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath 
> wrote:
>> Any thoughts?
>>
>> —
>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>> On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath > > wrote:
>>
>>> Hey Spark devs
>>>
>>> I've been looking at DF UDFs and UDAFs. The approx distinct is using
>>> hyperloglog,
>>> but there is only an option to return the count as a Long.
>>>
>>> It can be useful to be able to return and store the actual data structure
>>> (ie serialized HLL). This effectively allows one to do aggregation /
>>> rollups over columns while still preserving the ability to get distinct
>>> counts.
>>>
>>> For example, one can store daily aggregates of events, grouped by various
>>> columns, while storing for each grouping the HLL of say unique users. So
>>> you can get the uniques per day directly but could also very easily do
>>> arbitrary aggregates (say monthly, annually) and still be able to get a
>>> unique count for that period by merging the daily HLLS.
>>>
>>> I did this a while back as a Hive UDAF (
>>> https://github.com/MLnick/hive-udf) which returns a Struct field
>>> containing a "cardinality" field and a "binary" field containing the
>>> serialized HLL.
>>>
>>> I was wondering if there would be interest in something like this? I am
>>> not so clear on how UDTs work with regards to SerDe - so could one adapt
>>> the HyperLogLogUDT to be a Struct with the serialized HLL as a field as
>>> well as count as a field? Then I assume this would automatically play
>>> nicely with DataFrame I/O etc. The gotcha is one needs to then call
>>> "approx_count_field.count" (or is there a concept of a "default field" for
>>> a Struct?).
>>>
>>> Also, being able to provide the bitsize parameter may be useful...
>>>
>>> The same thinking would apply potentially to other approximate (and
>>> mergeable) data structures like T-Digest and maybe CMS.
>>>
>>> Nick
>>>
>>
>>

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Inspired by this post:
http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
I've started putting together something based on the Spark 1.5 UDAF
interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141

Some questions -

1. How do I get the UDAF to accept input arguments of different type? We
can hash anything basically for HLL - Int, Long, String, Object, raw bytes
etc. Right now it seems we'd need to build a new UDAF for each input type,
which seems strange - I should be able to use one UDAF that can handle raw
input of different types, as well as handle existing HLLs that can be
merged/aggregated (e.g. for grouped data)
2. @Reynold, how would I ensure this works for Tungsten (ie against raw
bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
Where should I look for examples on how this works internally?
3. I've based this on the Sum and Avg examples for the new UDAF interface -
any suggestions or issue please advise. Is the intermediate buffer
efficient?
4. The current HyperLogLogUDT is private - so I've had to make my own one
which is a bit pointless as it's copy-pasted. Any thoughts on exposing that
type? Or I need to make the package spark.sql ...

Nick

On Thu, Jul 2, 2015 at 8:06 AM, Reynold Xin  wrote:

> Yes - it's very interesting. However, ideally we should have a version of
> hyperloglog that can work directly against some raw bytes in memory (rather
> than java objects), in order for this to fit the Tungsten execution model
> where everything is operating directly against some memory address.
>
> On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath 
> wrote:
>
>> Sure I can copy the code but my aim was more to understand:
>>
>> (A) if this is broadly interesting enough to folks to think about
>> updating / extending the existing UDAF within Spark
>> (b) how to register ones own custom UDAF - in which case it could be a
>> Spark package for example
>>
>> All examples deal with registering a UDF but nothing about UDAFs
>>
>> —
>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>> On Wed, Jul 1, 2015 at 6:32 PM, Daniel Darabos <
>> daniel.dara...@lynxanalytics.com> wrote:
>>
>>> It's already possible to just copy the code from countApproxDistinct
>>> <https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153>
>>>  and
>>> access the HLL directly, or do anything you like.
>>>
>>> On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath >> > wrote:
>>>
>>>> Any thoughts?
>>>>
>>>> —
>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>
>>>>
>>>> On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Hey Spark devs
>>>>>
>>>>> I've been looking at DF UDFs and UDAFs. The approx distinct is using
>>>>> hyperloglog,
>>>>> but there is only an option to return the count as a Long.
>>>>>
>>>>> It can be useful to be able to return and store the actual data
>>>>> structure (ie serialized HLL). This effectively allows one to do
>>>>> aggregation / rollups over columns while still preserving the ability to
>>>>> get distinct counts.
>>>>>
>>>>> For example, one can store daily aggregates of events, grouped by
>>>>> various columns, while storing for each grouping the HLL of say unique
>>>>> users. So you can get the uniques per day directly but could also very
>>>>> easily do arbitrary aggregates (say monthly, annually) and still be able 
>>>>> to
>>>>> get a unique count for that period by merging the daily HLLS.
>>>>>
>>>>> I did this a while back as a Hive UDAF (
>>>>> https://github.com/MLnick/hive-udf) which returns a Struct field
>>>>> containing a "cardinality" field and a "binary" field containing the
>>>>> serialized HLL.
>>>>>
>>>>> I was wondering if there would be interest in something like this? I
>>>>> am not so clear on how UDTs work with regards to SerDe - so could one 
>>>>> adapt
>>>>> the HyperLogLogUDT to be a Struct with the serialized HLL as a field as
>>>>> well as count as a field? Then I assume this would automatically play
>>>>> nicely with DataFrame I/O etc. The gotcha is one needs to then call
>>>>> "approx_count_field.count" (or is there a concept of a "default field" for
>>>>> a Struct?).
>>>>>
>>>>> Also, being able to provide the bitsize parameter may be useful...
>>>>>
>>>>> The same thinking would apply potentially to other approximate (and
>>>>> mergeable) data structures like T-Digest and maybe CMS.
>>>>>
>>>>> Nick
>>>>>
>>>>
>>>>
>>>
>>
>


Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Can I ask why you've done this as a custom implementation rather than using
StreamLib, which is already implemented and widely used? It seems more
portable to me to use a library - for example, I'd like to export the
grouped data with raw HLLs to say Elasticsearch, and then do further
on-demand aggregation in ES and visualization in Kibana etc.

Others may want to do something similar into Hive, Cassandra, HBase or
whatever they are using. In this case they'd need to use this particular
implementation from Spark which may be tricky to include in a dependency
etc.

If there are enhancements, does it not make sense to do a PR to StreamLib?
Or does this interact in some better way with Tungsten?

I am unclear on how the interop with Tungsten raw memory works - some
pointers on that and where to look in the Spark code would be helpful.

On Sat, Sep 12, 2015 at 10:45 AM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Hello Nick,
>
> I have been working on a (UDT-less) implementation of HLL++. You can find
> the PR here: https://github.com/apache/spark/pull/8362. This current
> implements the dense version of HLL++, which is a further development of
> HLL. It returns a Long, but it shouldn't be to hard to return a Row
> containing the cardinality and/or the HLL registers (the binary data).
>
> I am curious what the stance is on using UDTs in the new UDAF interface.
> Is this still viable? This wouldn't work with UnsafeRow for instance. The
> OpenHashSetUDT for instance would be a nice building block for CollectSet
> and all Distinct Aggregate operators. Are there any opinions on this?
>
> Kind regards,
>
> Herman van Hövell tot Westerflier
>
> QuestTec B.V.
> Torenwacht 98
> 2353 DC Leiderdorp
> hvanhov...@questtec.nl
> +599 9 521 4402
>
>
> 2015-09-12 10:07 GMT+02:00 Nick Pentreath :
>
>> Inspired by this post:
>> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
>> I've started putting together something based on the Spark 1.5 UDAF
>> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141
>>
>> Some questions -
>>
>> 1. How do I get the UDAF to accept input arguments of different type? We
>> can hash anything basically for HLL - Int, Long, String, Object, raw bytes
>> etc. Right now it seems we'd need to build a new UDAF for each input type,
>> which seems strange - I should be able to use one UDAF that can handle raw
>> input of different types, as well as handle existing HLLs that can be
>> merged/aggregated (e.g. for grouped data)
>> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw
>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
>> Where should I look for examples on how this works internally?
>> 3. I've based this on the Sum and Avg examples for the new UDAF interface
>> - any suggestions or issue please advise. Is the intermediate buffer
>> efficient?
>> 4. The current HyperLogLogUDT is private - so I've had to make my own one
>> which is a bit pointless as it's copy-pasted. Any thoughts on exposing that
>> type? Or I need to make the package spark.sql ...
>>
>> Nick
>>
>> On Thu, Jul 2, 2015 at 8:06 AM, Reynold Xin  wrote:
>>
>>> Yes - it's very interesting. However, ideally we should have a version
>>> of hyperloglog that can work directly against some raw bytes in memory
>>> (rather than java objects), in order for this to fit the Tungsten execution
>>> model where everything is operating directly against some memory address.
>>>
>>> On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> Sure I can copy the code but my aim was more to understand:
>>>>
>>>> (A) if this is broadly interesting enough to folks to think about
>>>> updating / extending the existing UDAF within Spark
>>>> (b) how to register ones own custom UDAF - in which case it could be a
>>>> Spark package for example
>>>>
>>>> All examples deal with registering a UDF but nothing about UDAFs
>>>>
>>>> —
>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>
>>>>
>>>> On Wed, Jul 1, 2015 at 6:32 PM, Daniel Darabos <
>>>> daniel.dara...@lynxanalytics.com> wrote:
>>>>
>>>>> It's already possible to just copy the code from countApproxDistinct
>>>>> <https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153>

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
I should add that surely the idea behind UDT is exactly that it can (a) fit
automatically into DFs and Tungsten and (b) that it can be used efficiently
in writing ones own UDTs and UDAFs?


On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath 
wrote:

> Can I ask why you've done this as a custom implementation rather than
> using StreamLib, which is already implemented and widely used? It seems
> more portable to me to use a library - for example, I'd like to export the
> grouped data with raw HLLs to say Elasticsearch, and then do further
> on-demand aggregation in ES and visualization in Kibana etc.
>
> Others may want to do something similar into Hive, Cassandra, HBase or
> whatever they are using. In this case they'd need to use this particular
> implementation from Spark which may be tricky to include in a dependency
> etc.
>
> If there are enhancements, does it not make sense to do a PR to StreamLib?
> Or does this interact in some better way with Tungsten?
>
> I am unclear on how the interop with Tungsten raw memory works - some
> pointers on that and where to look in the Spark code would be helpful.
>
> On Sat, Sep 12, 2015 at 10:45 AM, Herman van Hövell tot Westerflier <
> hvanhov...@questtec.nl> wrote:
>
>> Hello Nick,
>>
>> I have been working on a (UDT-less) implementation of HLL++. You can find
>> the PR here: https://github.com/apache/spark/pull/8362. This current
>> implements the dense version of HLL++, which is a further development of
>> HLL. It returns a Long, but it shouldn't be to hard to return a Row
>> containing the cardinality and/or the HLL registers (the binary data).
>>
>> I am curious what the stance is on using UDTs in the new UDAF interface.
>> Is this still viable? This wouldn't work with UnsafeRow for instance. The
>> OpenHashSetUDT for instance would be a nice building block for CollectSet
>> and all Distinct Aggregate operators. Are there any opinions on this?
>>
>> Kind regards,
>>
>> Herman van Hövell tot Westerflier
>>
>> QuestTec B.V.
>> Torenwacht 98
>> 2353 DC Leiderdorp
>> hvanhov...@questtec.nl
>> +599 9 521 4402
>>
>>
>> 2015-09-12 10:07 GMT+02:00 Nick Pentreath :
>>
>>> Inspired by this post:
>>> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
>>> I've started putting together something based on the Spark 1.5 UDAF
>>> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141
>>>
>>> Some questions -
>>>
>>> 1. How do I get the UDAF to accept input arguments of different type? We
>>> can hash anything basically for HLL - Int, Long, String, Object, raw bytes
>>> etc. Right now it seems we'd need to build a new UDAF for each input type,
>>> which seems strange - I should be able to use one UDAF that can handle raw
>>> input of different types, as well as handle existing HLLs that can be
>>> merged/aggregated (e.g. for grouped data)
>>> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw
>>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
>>> Where should I look for examples on how this works internally?
>>> 3. I've based this on the Sum and Avg examples for the new UDAF
>>> interface - any suggestions or issue please advise. Is the intermediate
>>> buffer efficient?
>>> 4. The current HyperLogLogUDT is private - so I've had to make my own
>>> one which is a bit pointless as it's copy-pasted. Any thoughts on exposing
>>> that type? Or I need to make the package spark.sql ...
>>>
>>> Nick
>>>
>>> On Thu, Jul 2, 2015 at 8:06 AM, Reynold Xin  wrote:
>>>
>>>> Yes - it's very interesting. However, ideally we should have a version
>>>> of hyperloglog that can work directly against some raw bytes in memory
>>>> (rather than java objects), in order for this to fit the Tungsten execution
>>>> model where everything is operating directly against some memory address.
>>>>
>>>> On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Sure I can copy the code but my aim was more to understand:
>>>>>
>>>>> (A) if this is broadly interesting enough to folks to think about
>>>>> updating / extending the existing UDAF within Spark
>>>>> (b) how to register ones own custom UDAF - in which case it could be a
>>>>> Spark package for ex

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Ok, that makes sense. So this is (a) more efficient, since as far as I can
see it is updating the HLL registers directly in the buffer for each value,
and (b) would be "Tungsten-compatible" as it can work against UnsafeRow? Is
it currently possible to specify an UnsafeRow as a buffer in a UDAF?

So is extending AggregateFunction2 the preferred approach over the
UserDefinedAggregationFunction interface? Or it is that internal only?

I see one of the main use cases for things like HLL / CMS and other
approximate data structure being the fact that you can store them as
columns representing distinct counts in an aggregation. And then do further
arbitrary aggregations on that data as required. e.g. store hourly
aggregate data, and compute daily or monthly aggregates from that, while
still keeping the ability to have distinct counts on certain fields.

So exposing the serialized HLL as Array[Byte] say, so that it can be
further aggregated in a later DF operation, or saved to an external data
source, would be super useful.



On Sat, Sep 12, 2015 at 6:06 PM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> I am typically all for code re-use. The reason for writing this is to
> prevent the indirection of a UDT and work directly against memory. A UDT
> will work fine at the moment because we still use
> GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you
> would use an UnsafeRow as an AggregationBuffer (which is attractive when
> you have a lot of groups during aggregation) the use of an UDT is either
> impossible or it would become very slow because it would require us to
> deserialize/serialize a UDT on every update.
>
> As for compatibility, the implementation produces exactly the same results
> as the ClearSpring implementation. You could easily export the HLL++
> register values to the current ClearSpring implementation and export those.
>
> Met vriendelijke groet/Kind regards,
>
> Herman van Hövell tot Westerflier
>
> QuestTec B.V.
> Torenwacht 98
> 2353 DC Leiderdorp
> hvanhov...@questtec.nl
> +599 9 521 4402
>
>
> 2015-09-12 11:06 GMT+02:00 Nick Pentreath :
>
>> I should add that surely the idea behind UDT is exactly that it can (a)
>> fit automatically into DFs and Tungsten and (b) that it can be used
>> efficiently in writing ones own UDTs and UDAFs?
>>
>>
>> On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> Can I ask why you've done this as a custom implementation rather than
>>> using StreamLib, which is already implemented and widely used? It seems
>>> more portable to me to use a library - for example, I'd like to export the
>>> grouped data with raw HLLs to say Elasticsearch, and then do further
>>> on-demand aggregation in ES and visualization in Kibana etc.
>>>
>>> Others may want to do something similar into Hive, Cassandra, HBase or
>>> whatever they are using. In this case they'd need to use this particular
>>> implementation from Spark which may be tricky to include in a dependency
>>> etc.
>>>
>>> If there are enhancements, does it not make sense to do a PR to
>>> StreamLib? Or does this interact in some better way with Tungsten?
>>>
>>> I am unclear on how the interop with Tungsten raw memory works - some
>>> pointers on that and where to look in the Spark code would be helpful.
>>>
>>> On Sat, Sep 12, 2015 at 10:45 AM, Herman van Hövell tot Westerflier <
>>> hvanhov...@questtec.nl> wrote:
>>>
>>>> Hello Nick,
>>>>
>>>> I have been working on a (UDT-less) implementation of HLL++. You can
>>>> find the PR here: https://github.com/apache/spark/pull/8362. This
>>>> current implements the dense version of HLL++, which is a further
>>>> development of HLL. It returns a Long, but it shouldn't be to hard to
>>>> return a Row containing the cardinality and/or the HLL registers (the
>>>> binary data).
>>>>
>>>> I am curious what the stance is on using UDTs in the new UDAF
>>>> interface. Is this still viable? This wouldn't work with UnsafeRow for
>>>> instance. The OpenHashSetUDT for instance would be a nice building block
>>>> for CollectSet and all Distinct Aggregate operators. Are there any opinions
>>>> on this?
>>>>
>>>> Kind regards,
>>>>
>>>> Herman van Hövell tot Westerflier
>>>>
>>>> QuestTec B.V.
>>>> Torenwacht 98
>>>> 2353 DC Leiderdorp
>>>> hvanhov...@questtec.nl
>>&g

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Thanks Yin




So how does one ensure a UDAF works with Tungsten and UnsafeRow buffers? Or is 
this something that will be included in the UDAF interface in future? 




Is there a performance difference between Extending UDAF vs Aggregate2?




It's also not clear to me how to handle inputs of different types? What if my 
UDAF can handle String and Long for example? Do I need to specify AnyType or is 
there a way to specify multiple types possible for a single input column?




If no performance difference and UDAF can work with Tungsten, then Herman does 
it perhaps make sense to use UDAF (but without a UDT as you've done for 
performance)? As it would then be easy to extend that UDAF and adjust the 
output types as needed. It also provides a really nice example of how to use 
the interface for something advanced and high performance.



—
Sent from Mailbox

On Sun, Sep 13, 2015 at 12:09 AM, Yin Huai  wrote:

> Hi Nick,
> The buffer exposed to UDAF interface is just a view of underlying buffer
> (this underlying buffer is shared by different aggregate functions and
> every function takes one or multiple slots). If you need a UDAF, extending
> UserDefinedAggregationFunction is the preferred
> approach. AggregateFunction2 is used for built-in aggregate function.
> Thanks,
> Yin
> On Sat, Sep 12, 2015 at 10:40 AM, Nick Pentreath 
> wrote:
>> Ok, that makes sense. So this is (a) more efficient, since as far as I can
>> see it is updating the HLL registers directly in the buffer for each value,
>> and (b) would be "Tungsten-compatible" as it can work against UnsafeRow? Is
>> it currently possible to specify an UnsafeRow as a buffer in a UDAF?
>>
>> So is extending AggregateFunction2 the preferred approach over the
>> UserDefinedAggregationFunction interface? Or it is that internal only?
>>
>> I see one of the main use cases for things like HLL / CMS and other
>> approximate data structure being the fact that you can store them as
>> columns representing distinct counts in an aggregation. And then do further
>> arbitrary aggregations on that data as required. e.g. store hourly
>> aggregate data, and compute daily or monthly aggregates from that, while
>> still keeping the ability to have distinct counts on certain fields.
>>
>> So exposing the serialized HLL as Array[Byte] say, so that it can be
>> further aggregated in a later DF operation, or saved to an external data
>> source, would be super useful.
>>
>>
>>
>> On Sat, Sep 12, 2015 at 6:06 PM, Herman van Hövell tot Westerflier <
>> hvanhov...@questtec.nl> wrote:
>>
>>> I am typically all for code re-use. The reason for writing this is to
>>> prevent the indirection of a UDT and work directly against memory. A UDT
>>> will work fine at the moment because we still use
>>> GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you
>>> would use an UnsafeRow as an AggregationBuffer (which is attractive when
>>> you have a lot of groups during aggregation) the use of an UDT is either
>>> impossible or it would become very slow because it would require us to
>>> deserialize/serialize a UDT on every update.
>>>
>>> As for compatibility, the implementation produces exactly the same
>>> results as the ClearSpring implementation. You could easily export the
>>> HLL++ register values to the current ClearSpring implementation and export
>>> those.
>>>
>>> Met vriendelijke groet/Kind regards,
>>>
>>> Herman van Hövell tot Westerflier
>>>
>>> QuestTec B.V.
>>> Torenwacht 98
>>> 2353 DC Leiderdorp
>>> hvanhov...@questtec.nl
>>> +599 9 521 4402
>>>
>>>
>>> 2015-09-12 11:06 GMT+02:00 Nick Pentreath :
>>>
>>>> I should add that surely the idea behind UDT is exactly that it can (a)
>>>> fit automatically into DFs and Tungsten and (b) that it can be used
>>>> efficiently in writing ones own UDTs and UDAFs?
>>>>
>>>>
>>>> On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Can I ask why you've done this as a custom implementation rather than
>>>>> using StreamLib, which is already implemented and widely used? It seems
>>>>> more portable to me to use a library - for example, I'd like to export the
>>>>> grouped data with raw HLLs to say Elasticsearch, and then do further
>>>>> on-demand aggregation in ES and visualization in Kibana etc.
>>>>>
>>>>> Other

Re: ml.feature.Word2Vec.transform() very slow issue

2015-11-09 Thread Nick Pentreath
Seems a straightforward change that purely enhances efficiency, so yes
please submit a JIRA and PR for this

On Tue, Nov 10, 2015 at 8:56 AM, Sean Owen  wrote:

> Since it's a fairly expensive operation to build the Map, I tend to agree
> it should not happen in the loop.
>
> On Tue, Nov 10, 2015 at 5:08 AM, Yuming Wang  wrote:
>
>> Hi
>>
>>
>>
>> I found org.apache.spark.ml.feature.Word2Vec.transform() very slow.
>>
>> I think we should not read broadcast every sentence, so I fixed on my forked.
>>
>>
>>
>> https://github.com/979969786/spark/commit/a9f894df3671bb8df2f342de1820dab3185598f3
>>
>>
>>
>> I have use 2 number rows test it. Original version consume *5 minutes*,
>>
>>
>> ​
>>
>> and my version just consume *22 seconds* on same data.
>>
>>
>> ​
>>
>>
>>
>>
>> If I'm right, I will pull request.
>>
>>
>>
>> Thanks
>>
>>
>


Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Nick Pentreath
Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop
input/output format support:
https://github.com/cloudera/kudu/blob/master/java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableInputFormat.java

On Mon, Nov 16, 2015 at 7:52 AM, Reynold Xin  wrote:

> This (updates) is something we are going to think about in the next
> release or two.
>
> On Thu, Nov 12, 2015 at 8:57 AM, Cristian O <
> cristian.b.op...@googlemail.com> wrote:
>
>> Sorry, apparently only replied to Reynold, meant to copy the list as
>> well, so I'm self replying and taking the opportunity to illustrate with an
>> example.
>>
>> Basically I want to conceptually do this:
>>
>> val bigDf = sqlContext.sparkContext.parallelize((1 to 100)).map(i => (i, 
>> 1)).toDF("k", "v")
>> val deltaDf = sqlContext.sparkContext.parallelize(Array(1, 5)).map(i => 
>> (i, 1)).toDF("k", "v")
>>
>> bigDf.cache()
>>
>> bigDf.registerTempTable("big")
>> deltaDf.registerTempTable("delta")
>>
>> val newBigDf = sqlContext.sql("SELECT big.k, big.v + IF(delta.v is null, 0, 
>> delta.v) FROM big LEFT JOIN delta on big.k = delta.k")
>>
>> newBigDf.cache()
>> bigDf.unpersist()
>>
>>
>> This is essentially an update of keys "1" and "5" only, in a dataset
>> of 1 million keys.
>>
>> This can be achieved efficiently if the join would preserve the cached
>> blocks that have been unaffected, and only copy and mutate the 2 affected
>> blocks corresponding to the matching join keys.
>>
>> Statistics can determine which blocks actually need mutating. Note also
>> that shuffling is not required assuming both dataframes are pre-partitioned
>> by the same key K.
>>
>> In SQL this could actually be expressed as an UPDATE statement or for a
>> more generalized use as a MERGE UPDATE:
>> https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
>>
>> While this may seem like a very special case optimization, it would
>> effectively implement UPDATE support for cached DataFrames, for both
>> optimal and non-optimal usage.
>>
>> I appreciate there's quite a lot here, so thank you for taking the time
>> to consider it.
>>
>> Cristian
>>
>>
>>
>> On 12 November 2015 at 15:49, Cristian O > > wrote:
>>
>>> Hi Reynold,
>>>
>>> Thanks for your reply.
>>>
>>> Parquet may very well be used as the underlying implementation, but this
>>> is more than about a particular storage representation.
>>>
>>> There are a few things here that are inter-related and open different
>>> possibilities, so it's hard to structure, but I'll give it a try:
>>>
>>> 1. Checkpointing DataFrames - while a DF can be saved locally as
>>> parquet, just using that as a checkpoint would currently require explicitly
>>> reading it back. A proper checkpoint implementation would just save
>>> (perhaps asynchronously) and prune the logical plan while allowing to
>>> continue using the same DF, now backed by the checkpoint.
>>>
>>> It's important to prune the logical plan to avoid all kinds of issues
>>> that may arise from unbounded expansion with iterative use-cases, like this
>>> one I encountered recently:
>>> https://issues.apache.org/jira/browse/SPARK-11596
>>>
>>> But really what I'm after here is:
>>>
>>> 2. Efficient updating of cached DataFrames - The main use case here is
>>> keeping a relatively large dataset cached and updating it iteratively from
>>> streaming. For example one would like to perform ad-hoc queries on an
>>> incrementally updated, cached DataFrame. I expect this is already becoming
>>> an increasingly common use case. Note that the dataset may require merging
>>> (like adding) or overrriding values by key, so simply appending is not
>>> sufficient.
>>>
>>> This is very similar in concept with updateStateByKey for regular RDDs,
>>> i.e. an efficient copy-on-write mechanism, albeit perhaps at CachedBatch
>>> level  (the row blocks for the columnar representation).
>>>
>>> This can be currently simulated with UNION or (OUTER) JOINs however is
>>> very inefficient as it requires copying and recaching the entire dataset,
>>> and unpersisting the original one. There are also the aforementioned
>>> problems with unbounded logical plans (physical plans are fine)
>>>
>>> These two together, checkpointing and updating cached DataFrames, would
>>> give fault-tolerant efficient updating of DataFrames, meaning streaming
>>> apps can take advantage of the compact columnar representation and Tungsten
>>> optimisations.
>>>
>>> I'm not quite sure if something like this can be achieved by other means
>>> or has been investigated before, hence why I'm looking for feedback here.
>>>
>>> While one could use external data stores, they would have the added IO
>>> penalty, plus most of what's available at the moment is either HDFS
>>> (extremely inefficient for updates) or key-value stores that have 5-10x
>>> space overhead over columnar formats.
>>>
>>> Thanks,
>>> Cristian
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 12 November 2015 at 03:31, Reynold Xin  wrote:
>>>
 Thanks for the ema

Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Nick Evans
I really like the Streaming receiverless API for Kafka streaming jobs, but
I'm finding the manual offset management adds a fair bit of complexity. I'm
sure that others feel the same way, so I'm proposing that we add the
ability to have consumer offsets managed via an easy-to-use API. This would
be done similarly to how it is done in the receiver API.

I haven't written any code yet, but I've looked at the current version of
the codebase and have an idea of how it could be done.

To keep the size of the pull requests small, I propose that the following
distinct features are added in order:

   1. If a group ID is set in the Kafka params, and also if fromOffsets is
   not passed in to createDirectStream, then attempt to resume from the
   remembered offsets for that group ID.
   2. Add a method on KafkaRDDs that commits the offsets for that KafkaRDD
   to Zookeeper.
   3. Update the Python API with any necessary changes.

My goal is to not break the existing API while adding the new functionality.

One point that I'm not sure of is regarding the first point. I'm not sure
whether it's a better idea to set the group ID as mentioned through Kafka
params, or to define a new overload of createDirectStream that expects the
group ID in place of the fromOffsets param. I think the latter is a cleaner
interface, but I'm not sure whether adding a new param is a good idea.

If anyone has any feedback on this general approach, I'd be very grateful.
I'm going to open a JIRA in the next couple days and begin working on the
first point, but I think comments from the community would be very helpful
on building a good API here.


Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Nick Evans
The only dependancy on Zookeeper I see is here:
https://github.com/apache/spark/blob/1c5475f1401d2233f4c61f213d1e2c2ee9673067/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/ReliableKafkaReceiver.scala#L244-L247

If that's the only line that depends on Zookeeper, we could probably try to
implement an abstract offset manager that could be switched out in favour
of the new offset management system, yes? I know kafka.consumer.Consumer
currently depends on Zookeeper, but I'm guessing this library will
eventually be updated to use the new method.

On Mon, Nov 16, 2015 at 5:28 PM, Cody Koeninger  wrote:

> There are already private methods in the code for interacting with Kafka's
> offset management api.
>
> There's a jira for making those methods public, but TD has been reluctant
> to merge it
>
> https://issues.apache.org/jira/browse/SPARK-10963
>
> I think adding any ZK specific behavior to spark is a bad idea, since ZK
> may no longer be the preferred storage location for Kafka offsets within
> the next year.
>
>
>
> On Mon, Nov 16, 2015 at 9:53 AM, Nick Evans  wrote:
>
>> I really like the Streaming receiverless API for Kafka streaming jobs,
>> but I'm finding the manual offset management adds a fair bit of complexity.
>> I'm sure that others feel the same way, so I'm proposing that we add the
>> ability to have consumer offsets managed via an easy-to-use API. This would
>> be done similarly to how it is done in the receiver API.
>>
>> I haven't written any code yet, but I've looked at the current version of
>> the codebase and have an idea of how it could be done.
>>
>> To keep the size of the pull requests small, I propose that the following
>> distinct features are added in order:
>>
>>1. If a group ID is set in the Kafka params, and also if fromOffsets
>>is not passed in to createDirectStream, then attempt to resume from the
>>remembered offsets for that group ID.
>>2. Add a method on KafkaRDDs that commits the offsets for that
>>KafkaRDD to Zookeeper.
>>3. Update the Python API with any necessary changes.
>>
>> My goal is to not break the existing API while adding the new
>> functionality.
>>
>> One point that I'm not sure of is regarding the first point. I'm not sure
>> whether it's a better idea to set the group ID as mentioned through Kafka
>> params, or to define a new overload of createDirectStream that expects the
>> group ID in place of the fromOffsets param. I think the latter is a cleaner
>> interface, but I'm not sure whether adding a new param is a good idea.
>>
>> If anyone has any feedback on this general approach, I'd be very
>> grateful. I'm going to open a JIRA in the next couple days and begin
>> working on the first point, but I think comments from the community would
>> be very helpful on building a good API here.
>>
>>
>


-- 
*Nick Evans* 
P. (613) 793-5565
LinkedIn <http://linkd.in/nZpN6w> | Website <http://bit.ly/14XTBtj>


Spark Streaming Kinesis - DynamoDB Streams compatability

2015-12-10 Thread Nick Pentreath
Hi Spark users & devs

I was just wondering if anyone out there has interest in DynamoDB Streams (
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html)
as an input source for Spark Streaming Kinesis?

Because DynamoDB Streams provides an adaptor client that works with the
KCL, making this work is fairly straightforward, but would require a little
bit of work to add it to Spark Streaming Kinesis as an option. It also
requires updating the AWS SDK version.

For those using AWS heavily, there are other ways of achieving the same
outcome indirectly, the easiest of which I've found so far is using AWS
Lambdas to read from the DynamoDB Stream, (optionally) transform the
events, and write to a Kinesis stream, allowing one to just use the
existing Spark integration. Still, I'd like to know if there is sufficient
interest or demand for this among the user base to work on a PR adding
DynamoDB Streams support to Spark.

(At the same time, the implementation details happen to provide an
opportunity to address https://issues.apache.org/jira/browse/SPARK-10969,
though not sure how much need there is for that either?)

N


Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
cc'ing dev list

Ok, looks like when the KCL version was updated in
https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
probably leading to dependency conflict, though as Burak mentions its hard
to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
in driver or worker logs, so any exception is getting swallowed somewhere.

Run starting. Expected test count is: 4
KinesisStreamSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- KinesisUtils API
- RDD generation
- basic operation *** FAILED ***
  The code passed to eventually never returned normally. Attempted 13 times
over 2.04 minutes. Last failure message: Set() did not equal Set(5, 10,
1, 6, 9, 2, 7, 3, 8, 4)
  Data received does not match data sent. (KinesisStreamSuite.scala:188)
- failure recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 63 times
over 2.02863831 minutes. Last failure message: isCheckpointPresent
was true, but 0 was not greater than 10. (KinesisStreamSuite.scala:228)
Run completed in 5 minutes, 0 seconds.
Total number of tests run: 4
Suites: completed 1, aborted 0
Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
*** 2 TESTS FAILED ***
[INFO]

[INFO] BUILD FAILURE
[INFO]



KCL 1.3.0 depends on *1.9.37* SDK (
https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
while the Spark Kinesis dependency was kept at *1.9.16.*

I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
1.9.37 and everything works.

Run starting. Expected test count is: 28
KinesisBackedBlockRDDSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- Basic reading from Kinesis
- Read data available in both block manager and Kinesis
- Read data available only in block manager, not in Kinesis
- Read data available only in Kinesis, not in block manager
- Read data available partially in block manager, rest in Kinesis
- Test isBlockValid skips block fetching from block manager
- Test whether RDD is valid after removing blocks from block anager
KinesisStreamSuite:
- KinesisUtils API
- RDD generation
- basic operation
- failure recovery
KinesisReceiverSuite:
- check serializability of SerializableAWSCredentials
- process records including store and checkpoint
- shouldn't store and checkpoint when receiver is stopped
- shouldn't checkpoint when exception occurs during store
- should set checkpoint time to currentTime + checkpoint interval upon
instantiation
- should checkpoint if we have exceeded the checkpoint interval
- shouldn't checkpoint if we have not exceeded the checkpoint interval
- should add to time when advancing checkpoint
- shutdown should checkpoint if the reason is TERMINATE
- shutdown should not checkpoint if the reason is something other than
TERMINATE
- retry success on first attempt
- retry success on second attempt after a Kinesis throttling exception
- retry success on second attempt after a Kinesis dependency exception
- retry failed after a shutdown exception
- retry failed after an invalid state exception
- retry failed after unexpected exception
- retry failed after exhausing all retries
Run completed in 3 minutes, 28 seconds.
Total number of tests run: 28
Suites: completed 4, aborted 0
Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can you
file a JIRA for this?

@dev-list, since KCL brings in AWS SDK dependencies itself, is it necessary
to declare an explicit dependency on aws-java-sdk in the Kinesis POM? Also,
from KCL 1.5.0+, only the relevant components used from the AWS SDKs are
brought in, making things a bit leaner (this can be upgraded in Spark
1.7/2.0 perhaps). All local tests (and integration tests) pass with
removing the explicit dependency and only depending on KCL. Is aws-java-sdk
used anywhere else (AFAIK it is not, but in case I missed something let me
know any good reason to keep the explicit dependency)?

N



On Fri, Dec 11, 2015 at 6:55 AM, Nick Pentreath 
wrote:

> Yeah also the integration tests need to be specifically run - I would have
> thought the contributor would have run those tests and also tested the
> change themselves using live Kinesis :(
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Dec 11, 2015 at 6:18 AM, Burak Yavuz  wrote:
>
>> I don't think the Kinesis tests specifically ran when that was merged
>> into 1.5.2 :(
>> https://github.com/apache/spark/pull/8957
>>
>> https://github.com/apache/spark/commit/883bd8fccf83aae7a2a84

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
Is that PR against master branch?




S3 read comes from Hadoop / jet3t afaik



—
Sent from Mailbox

On Fri, Dec 11, 2015 at 5:38 PM, Brian London 
wrote:

> That's good news  I've got a PR in to up the SDK version to 1.10.40 and the
> KCL to 1.6.1 which I'm running tests on locally now.
> Is the AWS SDK not used for reading/writing from S3 or do we get that for
> free from the Hadoop dependencies?
> On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath 
> wrote:
>> cc'ing dev list
>>
>> Ok, looks like when the KCL version was updated in
>> https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
>> probably leading to dependency conflict, though as Burak mentions its hard
>> to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
>> and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
>> in driver or worker logs, so any exception is getting swallowed somewhere.
>>
>> Run starting. Expected test count is: 4
>> KinesisStreamSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - KinesisUtils API
>> - RDD generation
>> - basic operation *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 13
>> times over 2.04 minutes. Last failure message: Set() did not equal
>> Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
>>   Data received does not match data sent. (KinesisStreamSuite.scala:188)
>> - failure recovery *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 63
>> times over 2.02863831 minutes. Last failure message:
>> isCheckpointPresent was true, but 0 was not greater than 10.
>> (KinesisStreamSuite.scala:228)
>> Run completed in 5 minutes, 0 seconds.
>> Total number of tests run: 4
>> Suites: completed 1, aborted 0
>> Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
>> *** 2 TESTS FAILED ***
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>>
>>
>> KCL 1.3.0 depends on *1.9.37* SDK (
>> https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
>> while the Spark Kinesis dependency was kept at *1.9.16.*
>>
>> I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
>> 1.9.37 and everything works.
>>
>> Run starting. Expected test count is: 28
>> KinesisBackedBlockRDDSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - Basic reading from Kinesis
>> - Read data available in both block manager and Kinesis
>> - Read data available only in block manager, not in Kinesis
>> - Read data available only in Kinesis, not in block manager
>> - Read data available partially in block manager, rest in Kinesis
>> - Test isBlockValid skips block fetching from block manager
>> - Test whether RDD is valid after removing blocks from block anager
>> KinesisStreamSuite:
>> - KinesisUtils API
>> - RDD generation
>> - basic operation
>> - failure recovery
>> KinesisReceiverSuite:
>> - check serializability of SerializableAWSCredentials
>> - process records including store and checkpoint
>> - shouldn't store and checkpoint when receiver is stopped
>> - shouldn't checkpoint when exception occurs during store
>> - should set checkpoint time to currentTime + checkpoint interval upon
>> instantiation
>> - should checkpoint if we have exceeded the checkpoint interval
>> - shouldn't checkpoint if we have not exceeded the checkpoint interval
>> - should add to time when advancing checkpoint
>> - shutdown should checkpoint if the reason is TERMINATE
>> - shutdown should not checkpoint if the reason is something other than
>> TERMINATE
>> - retry success on first attempt
>> - retry success on second attempt after a Kinesis throttling exception
>> - retry success on second attempt after a Kinesis dependency exception
>> - retry failed after a shutdown exception
>> - retry failed after an invalid state exception
>> - retry failed after unexpected exception
>> - retry failed after exhausing all retries
>> Run completed in 3 minutes, 28 seconds.
>> Total number of tests run: 28
>> Suites: completed 4, aborted 0
>> Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
>> All tests passed.
>>
>> So this is a regression in Spark Streaming K

Re: Write access to wiki

2016-01-12 Thread Nick Pentreath
I'd also like to get Wiki write access - at the least it allows a few of us
to amend the "Powered By" and similar pages when those requests come
through (Sean has been doing a lot of that recently :)

On Mon, Jan 11, 2016 at 11:01 PM, Sean Owen  wrote:

> ... I forget who can give access -- is it INFRA at Apache or one of us?
> I can apply any edit you need in the meantime.
>
> Shane may be able to fill you in on how the Jenkins build is set up.
>
> On Mon, Jan 11, 2016 at 8:56 PM, Mark Grover  wrote:
> > Hi all,
> > May I please get write access to the useful tools wiki page?
> >
> > I did some investigation related to docker integration tests and want to
> > list out the pre-requisites required on the machine for those tests to
> pass,
> > on that page.
> >
> > On a related note, I was trying to search for any puppet recipes we
> maintain
> > for setting up build slaves. If our Jenkins infra were wiped out, how do
> we
> > rebuild the slave?
> >
> > Thanks in advance!
> >
> > Mark
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Elasticsearch sink for metrics

2016-01-15 Thread Nick Pentreath
I haven't come across anything, but could you provide more detail on what
issues you're encountering?



On Fri, Jan 15, 2016 at 11:09 AM, Pete Robbins  wrote:

> Has anyone tried pushing Spark metrics into elasticsearch? We have other
> metrics, eg some runtime information, going into ES and would like to be
> able to combine this with the Spark metrics for visualization with Kibana.
>
> I experimented with a new sink using ES's ElasticsearchReporter for the
> Coda Hale metrics but have a few issues with default mappings.
>
> Has anyone already implemented this before I start to dig deeper?
>
> Cheers,
>
>
>


Re: Proposal

2016-01-30 Thread Nick Pentreath
Hi there

Sounds like a fun project :)

I'd recommend getting familiar with the existing k-means implementation as well 
as bisecting k-means in Spark, and then implementing yours based off that. You 
should focus on using the new ML pipelines API, and release it as a package on 
spark-packages.org.

If it got lots of use cases from there, it could be considered for inclusion in 
ML core in the future.

Good luck!

Sent from my iPhone

> On 31 Jan 2016, at 00:23, Acelot  wrote:
> 
> Hi All,
> As part of my final project at university I would try to build an alternative 
> version of k-means algorithm, it's called k-modes introduced here: Improving 
> the Accuracy and Efficiency of the k-means Clustering Algorithm paper (Link: 
> http://www.iaeng.org/publication/WCE2009/WCE2009_pp308-312.pdf). I would like 
> to know any related work. If someone is interested to work in this project 
> contact with me,
> Kind regards,
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: ML ALS API

2016-03-08 Thread Nick Pentreath
Hi Maciej

Yes, that *train* method is intended to be public, but it is marked as
*DeveloperApi*, which means that backward compatibility is not necessarily
guaranteed, and that method may change. Having said that, even APIs marked
as DeveloperApi do tend to be relatively stable.

As the comment mentions:

 * :: DeveloperApi ::
 * An implementation of ALS that supports *generic ID types*, specialized
for Int and Long. This is
 * exposed as a developer API for users who do need other ID types. But it
is not recommended
 * because it increases the shuffle size and memory requirement during
training.

This *train* method is intended for the use case where user and item ids
are not the default Int (e.g. String). As you can see it returns the factor
RDDs directly, as opposed to an ALSModel instance, so overall it is a
little less user-friendly.

The *Float* ratings are to save space and make ALS more efficient overall.
That will not change in 2.0+ (especially since the precision of ratings is
not very important).

Hope that helps.

On Tue, 8 Mar 2016 at 08:20 Maciej Szymkiewicz 
wrote:

> Can I ask for a clarifications regarding ml.recommendation.ALS:
>
> - is train method
> (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L598
> )
> intended to be be public?
> - Rating class
> (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L436)is
> using float instead of double like its MLLib counterpart. Is it going to
> be a default encoding in 2.0+?
>
> --
> Best,
> Maciej Szymkiewicz
>
>
>


Re: Running ALS on comparitively large RDD

2016-03-10 Thread Nick Pentreath
Could you provide more details about:
1. Data set size (# ratings, # users and # products)
2. Spark cluster set up and version

Thanks

On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan  wrote:

> Hello All,
>
> I've been running Spark's ALS on a dataset of users and rated items. I
> first encode my users to integers by using an auto increment function (
> just like zipWithIndex), I do the same for my items. I then create an RDD
> of the ratings and feed it to ALS.
>
> My issue is that the ALS algorithm never completes. Attached is a
> screenshot of the stages window.
>
> Any help will be greatly appreciated
>
> --
> Regards,
> *Deepak Gopalakrishnan*
> *Mobile*:+918891509774
> *Skype* : deepakgk87
> http://myexps.blogspot.com
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org


Re: Running ALS on comparitively large RDD

2016-03-11 Thread Nick Pentreath
Hmmm, something else is going on there. What data source are you reading
from? How much driver and executor memory have you provided to Spark?



On Fri, 11 Mar 2016 at 09:21 Deepak Gopalakrishnan  wrote:

> 1. I'm using about 1 million users against few thousand products. I
> basically have around a million ratings
> 2. Spark 1.6 on Amazon EMR
>
> On Fri, Mar 11, 2016 at 12:46 PM, Nick Pentreath  > wrote:
>
>> Could you provide more details about:
>> 1. Data set size (# ratings, # users and # products)
>> 2. Spark cluster set up and version
>>
>> Thanks
>>
>> On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan 
>> wrote:
>>
>>> Hello All,
>>>
>>> I've been running Spark's ALS on a dataset of users and rated items. I
>>> first encode my users to integers by using an auto increment function (
>>> just like zipWithIndex), I do the same for my items. I then create an RDD
>>> of the ratings and feed it to ALS.
>>>
>>> My issue is that the ALS algorithm never completes. Attached is a
>>> screenshot of the stages window.
>>>
>>> Any help will be greatly appreciated
>>>
>>> --
>>> Regards,
>>> *Deepak Gopalakrishnan*
>>> *Mobile*:+918891509774
>>> *Skype* : deepakgk87
>>> http://myexps.blogspot.com
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Regards,
> *Deepak Gopalakrishnan*
> *Mobile*:+918891509774
> *Skype* : deepakgk87
> http://myexps.blogspot.com
>
>


Re: Spark ML - Scaling logistic regression for many features

2016-03-12 Thread Nick Pentreath
Also adding dev list in case anyone else has ideas / views.

On Sat, 12 Mar 2016 at 12:52, Nick Pentreath 
wrote:

> Thanks for the feedback.
>
> I think Spark can certainly meet your use case when your data size scales
> up, as the actual model dimension is very small - you will need to use
> those indexers or some other mapping mechanism.
>
> There is ongoing work for Spark 2.0 to make it easier to use models
> outside of Spark - also see PMML export (I think mllib logistic regression
> is supported but I have to check that). That will help use spark models in
> serving environments.
>
> Finally, I will add a JIRA to investigate sparse models for LR - maybe
> also a ticket for multivariate summariser (though I don't think in practice
> there will be much to gain).
>
>
> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann 
> wrote:
>
>> Thanks for the pointer to those indexers, those are some good examples. A
>> good way to go for the trainer and any scoring done in Spark. I will
>> definitely have to deal with scoring in non-Spark systems though.
>>
>> I think I will need to scale up beyond what single-node liblinear can
>> practically provide. The system will need to handle much larger sub-samples
>> of this data (and other projects might be larger still). Additionally, the
>> system needs to train many models in parallel (hyper-parameter optimization
>> with n-fold cross-validation, multiple algorithms, different sets of
>> features).
>>
>> Still, I suppose we'll have to consider whether Spark is the best system
>> for this. For now though, my job is to see what can be achieved with Spark.
>>
>>
>>
>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> Ok, I think I understand things better now.
>>>
>>> For Spark's current implementation, you would need to map those features
>>> as you mention. You could also use say StringIndexer -> OneHotEncoder or
>>> VectorIndexer. You could create a Pipeline to deal with the mapping and
>>> training (e.g.
>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>> Pipeline supports persistence.
>>>
>>> But it depends on your scoring use case too - a Spark pipeline can be
>>> saved and then reloaded, but you need all of Spark dependencies in your
>>> serving app which is often not ideal. If you're doing bulk scoring offline,
>>> then it may suit.
>>>
>>> Honestly though, for that data size I'd certainly go with something like
>>> Liblinear :) Spark will ultimately scale better with # training examples
>>> for very large scale problems. However there are definitely limitations on
>>> model dimension and sparse weight vectors currently. There are potential
>>> solutions to these but they haven't been implemented as yet.
>>>
>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>> daniel.siegm...@teamaol.com> wrote:
>>>
>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Would you mind letting us know the # training examples in the
>>>>> datasets? Also, what do your features look like? Are they text, 
>>>>> categorical
>>>>> etc? You mention that most rows only have a few features, and all rows
>>>>> together have a few 10,000s features, yet your max feature value is 20
>>>>> million. How are your constructing your feature vectors to get a 20 
>>>>> million
>>>>> size? The only realistic way I can see this situation occurring in 
>>>>> practice
>>>>> is with feature hashing (HashingTF).
>>>>>
>>>>
>>>> The sub-sample I'm currently training on is about 50K rows, so ...
>>>> small.
>>>>
>>>> The features causing this issue are numeric (int) IDs for ... lets
>>>> call it "Thing". For each Thing in the record, we set the feature
>>>> Thing.id to a value of 1.0 in our vector (which is of course a
>>>> SparseVector). I'm not sure how IDs are generated for Things, but they
>>>> can be large numbers.
>>>>
>>>> The largest Thing ID is around 20 million, so that ends up being the
>>>> size of the vector. But in fact there are fewer than 10,000 unique Thing
>>>> IDs in this data. The mean number of features per record in what I'm
>>>> currently training against is

Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Nick Pentreath
No, I didn't yet - feel free to create a JIRA.



On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann 
wrote:

> Hi Nick,
>
> Thanks again for your help with this. Did you create a ticket in JIRA for
> investigating sparse models in LR and / or multivariate summariser? If so,
> can you give me the issue key(s)? If not, would you like me to create these
> tickets?
>
> I'm going to look into this some more and see if I can figure out how to
> implement these fixes.
>
> ~Daniel Siegmann
>
> On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath 
> wrote:
>
>> Also adding dev list in case anyone else has ideas / views.
>>
>> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath 
>> wrote:
>>
>>> Thanks for the feedback.
>>>
>>> I think Spark can certainly meet your use case when your data size
>>> scales up, as the actual model dimension is very small - you will need to
>>> use those indexers or some other mapping mechanism.
>>>
>>> There is ongoing work for Spark 2.0 to make it easier to use models
>>> outside of Spark - also see PMML export (I think mllib logistic regression
>>> is supported but I have to check that). That will help use spark models in
>>> serving environments.
>>>
>>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>>> also a ticket for multivariate summariser (though I don't think in practice
>>> there will be much to gain).
>>>
>>>
>>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>>> daniel.siegm...@teamaol.com> wrote:
>>>
>>>> Thanks for the pointer to those indexers, those are some good examples.
>>>> A good way to go for the trainer and any scoring done in Spark. I will
>>>> definitely have to deal with scoring in non-Spark systems though.
>>>>
>>>> I think I will need to scale up beyond what single-node liblinear can
>>>> practically provide. The system will need to handle much larger sub-samples
>>>> of this data (and other projects might be larger still). Additionally, the
>>>> system needs to train many models in parallel (hyper-parameter optimization
>>>> with n-fold cross-validation, multiple algorithms, different sets of
>>>> features).
>>>>
>>>> Still, I suppose we'll have to consider whether Spark is the best
>>>> system for this. For now though, my job is to see what can be achieved with
>>>> Spark.
>>>>
>>>>
>>>>
>>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Ok, I think I understand things better now.
>>>>>
>>>>> For Spark's current implementation, you would need to map those
>>>>> features as you mention. You could also use say StringIndexer ->
>>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
>>>>> the mapping and training (e.g.
>>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>>>> Pipeline supports persistence.
>>>>>
>>>>> But it depends on your scoring use case too - a Spark pipeline can be
>>>>> saved and then reloaded, but you need all of Spark dependencies in your
>>>>> serving app which is often not ideal. If you're doing bulk scoring 
>>>>> offline,
>>>>> then it may suit.
>>>>>
>>>>> Honestly though, for that data size I'd certainly go with something
>>>>> like Liblinear :) Spark will ultimately scale better with # training
>>>>> examples for very large scale problems. However there are definitely
>>>>> limitations on model dimension and sparse weight vectors currently. There
>>>>> are potential solutions to these but they haven't been implemented as yet.
>>>>>
>>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>>>> daniel.siegm...@teamaol.com> wrote:
>>>>>
>>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>>>> nick.pentre...@gmail.com> wrote:
>>>>>>
>>>>>>> Would you mind letting us know the # training examples in the
>>>>>>> datasets? Also, what do your features look like? Are they text, 
>>>>>>> categorical
>>>>>>> etc? You mention that most rows only have a few features, and all rows
>>>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Nick Pentreath
+1 for this proposal - as you mention I think it's the defacto current
situation anyway.

Note that from a developer view it's just the user-facing API that will be
only "ml" - the majority of the actual algorithms still operate on RDDs
under the good currently.
On Wed, 6 Apr 2016 at 05:03, Chris Fregly  wrote:

> perhaps renaming to Spark ML would actually clear up code and
> documentation confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin  wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley 
> wrote:
>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
>> wrote:
>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally
>>> move towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 Overall this sounds good to me. One question I have is that in
 addition to the ML algorithms we have a number of linear algebra
 (various distributed matrices) and statistical methods in the
 spark.mllib package. Is the plan to port or move these to the spark.ml
 namespace in the 2.x series ?

 Thanks
 Shivaram

 On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
 > FWIW, all of that sounds like a good plan to me. Developing one API is
 > certainly better than two.
 >
 > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng 
 wrote:
 >> Hi all,
 >>
 >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
 built
 >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
 API has
 >> been developed under the spark.ml package, while the old RDD-based
 API has
 >> been developed in parallel under the spark.mllib package. While it
 was
 >> easier to implement and experiment with new APIs under a new
 package, it
 >> became harder and harder to maintain as both packages grew bigger and
 >> bigger. And new users are often confused by having two sets of APIs
 with
 >> overlapped functions.
 >>
 >> We started to recommend the DataFrame-based API over the RDD-based
 API in
 >> Spark 1.5 for its versatility and flexibility, and we saw the
 development
 >> and the usage gradually shifting to the DataFrame-based API. Just
 counting
 >> the lines of Scala code, from 1.5 to the current master we added
 ~1
 >> lines to the DataFrame-based API while ~700 to the RDD-based API.
 So, to
 >> gather more resources on the development of the DataFrame-based API
 and to
 >> help users migrate over sooner, I want to propose switching
 RDD-based MLlib
 >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
 >>
 >> * We do not accept new features in the RDD-based spark.mllib
 package, unless
 >> they block implementing new features in the DataFrame-based spark.ml
 >> package.
 >> * We still accept bug fixes in the RDD-based API.
 >> * We will add more features to the DataFrame-based API in the 2.x
 series to
 >> reach feature parity with the RDD-based API.
 >> * Once we reach feature parity (possibly in Spark 2.2), we will
 deprecate
 >> the RDD-based API.
 >> * We will remove the RDD-based API from the main Spark repo in Spark
 3.0.
 >>
 >> Though the RDD-based API is already in de facto maintenance mode,
 this
 >> announcement will make it clear and hence important to both MLlib
 developers
 >> and users. So we’d greatly appreciate your feedback!
 >>
 >> (As a side note, people sometimes use “Spark ML” to refer to the
 >> DataFrame-based API or even the entire MLlib component. This also
 causes
 >> confusion. To be clear, “Spark ML” is not an official name and there
 are no
 >> plans to rename MLlib to “Spark ML” at this time.)
 >>
 >> Best,
 >> Xiangrui
 >
 > -
 > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 > For additional commands, e-mail: user-h...@spark.apache.org
 >

>>>
>>>
>>
>


ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Hi there,

In writing some tests for a PR I'm working on, with a more complex array
type in a DF, I ran into this issue (running off latest master).

Any thoughts?

*// create DF with a column of Array[(Int, Double)]*
val df = sc.parallelize(Seq(
(0, Array((1, 6.0), (1, 4.0))),
(1, Array((1, 3.0), (2, 1.0))),
(2, Array((3, 3.0), (4, 6.0
).toDF("id", "predictions")

*// extract the field from the Row, and use map to extract first element of
tuple*
*// the type of RDD appears correct*
scala> df.rdd.map { row => row.getSeq[(Int, Double)](1).map(_._1) }
res14: org.apache.spark.rdd.RDD[Seq[Int]] = MapPartitionsRDD[32] at map at
:27

*// however, calling collect on the same expression throws
ClassCastException*
scala> df.rdd.map { row => row.getSeq[(Int, Double)](1).map(_._1) }.collect
16/04/06 13:02:49 ERROR Executor: Exception in task 5.0 in stage 10.0 (TID
74)
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
cast to scala.Tuple2
at
$line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(:27)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at
$line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
at
$line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:880)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:880)

*// can collect the extracted field*
*// again, return type appears correct*
scala> df.rdd.map { row => row.getSeq[(Int, Double)](1) }.collect
res23: Array[Seq[(Int, Double)]] = Array(WrappedArray([1,6.0], [1,4.0]),
WrappedArray([1,3.0], [2,1.0]), WrappedArray([3,3.0], [4,6.0]))

*// trying to apply map to extract first element of tuple fails*
scala> df.rdd.map { row => row.getSeq[(Int, Double)](1)
}.collect.map(_.map(_._1))
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
cast to scala.Tuple2
  at $anonfun$2$$anonfun$apply$1.apply(:27)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at $anonfun$2.apply(:27)
  at $anonfun$2.apply(:27)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)


Re: ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Ah I got it - Seq[(Int, Float)] is actually represented as Seq[Row] (seq of
struct type) internally.

So a further extraction is required, e.g. row => row.getSeq[Row](1).map { r
=> r.getInt(0) }

On Wed, 6 Apr 2016 at 13:35 Nick Pentreath  wrote:

> Hi there,
>
> In writing some tests for a PR I'm working on, with a more complex array
> type in a DF, I ran into this issue (running off latest master).
>
> Any thoughts?
>
> *// create DF with a column of Array[(Int, Double)]*
> val df = sc.parallelize(Seq(
> (0, Array((1, 6.0), (1, 4.0))),
> (1, Array((1, 3.0), (2, 1.0))),
> (2, Array((3, 3.0), (4, 6.0
> ).toDF("id", "predictions")
>
> *// extract the field from the Row, and use map to extract first element
> of tuple*
> *// the type of RDD appears correct*
> scala> df.rdd.map { row => row.getSeq[(Int, Double)](1).map(_._1) }
> res14: org.apache.spark.rdd.RDD[Seq[Int]] = MapPartitionsRDD[32] at map at
> :27
>
> *// however, calling collect on the same expression throws
> ClassCastException*
> scala> df.rdd.map { row => row.getSeq[(Int, Double)](1).map(_._1) }.collect
> 16/04/06 13:02:49 ERROR Executor: Exception in task 5.0 in stage 10.0 (TID
> 74)
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
> cast to scala.Tuple2
> at
> $line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(:27)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at
> $line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
> at
> $line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
> at scala.collection.AbstractIterator.to(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:880)
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:880)
>
> *// can collect the extracted field*
> *// again, return type appears correct*
> scala> df.rdd.map { row => row.getSeq[(Int, Double)](1) }.collect
> res23: Array[Seq[(Int, Double)]] = Array(WrappedArray([1,6.0], [1,4.0]),
> WrappedArray([1,3.0], [2,1.0]), WrappedArray([3,3.0], [4,6.0]))
>
> *// trying to apply map to extract first element of tuple fails*
> scala> df.rdd.map { row => row.getSeq[(Int, Double)](1)
> }.collect.map(_.map(_._1))
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
> cast to scala.Tuple2
>   at $anonfun$2$$anonfun$apply$1.apply(:27)
>   at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at $anonfun$2.apply(:27)
>   at $anonfun$2.apply(:27)
>   at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>


Organizing Spark ML example packages

2016-04-14 Thread Nick Pentreath
Hey Spark devs

I noticed that we now have a large number of examples for ML & MLlib in the
examples project - 57 for ML and 67 for MLLIB to be precise. This is bound
to get larger as we add features (though I know there are some PRs to clean
up duplicated examples).

What do you think about organizing them into packages to match the use case
and the structure of the code base? e.g.

org.apache.spark.examples.ml.recommendation

org.apache.spark.examples.ml.feature

and so on...

Is it worth doing? The doc pages with include_example would need updating,
and the run_example script input would just need to change the package
slightly. Did I miss any potential issue?

N


Re: Duplicated fit into TrainValidationSplit

2016-04-27 Thread Nick Pentreath
You should find that the first set of fits are called on the training set,
and the resulting models evaluated on the validation set. The final best
model is then retrained on the entire dataset. This is standard practice -
usually the dataset passed to the train validation split is itself further
split into a training and test set, where the final best model is evaluated
against the test set.
On Wed, 27 Apr 2016 at 14:30, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> Hi guys, I was testing a pipeline here, and found a possible duplicated
> call to fit method into the
> org.apache.spark.ml.tuning.TrainValidationSplit
> 
> class
> In line 110 there is a call to est.fit method that call fit in all
> parameter combinations that we have setup.
> Down in the line 128, after discovering which is the bestmodel, we call
> fit aggain using the bestIndex, wouldn't be better to just access the
> result of the already call fit method stored in the models val?
>
> Kind regards,
> Dirceu
>


Re: Cross Validator to work with K-Fold value of 1?

2016-05-02 Thread Nick Pentreath
There is a JIRA and PR around for supporting polynomial expansion with
degree 1. Offhand I can't recall if it's been merged
On Mon, 2 May 2016 at 17:45, Julio Antonio Soto de Vicente 
wrote:

> Hi,
>
> Same goes for the PolynomialExpansion in org.apache.spark.ml.feature. It
> would be dice to cross-validate with degree 1 polynomial expansion (this
> is, with no expansion at all) vs other degree polynomial expansions.
> Unfortunately, degree is forced to be >= 2.
>
> --
> Julio
>
> > El 2 may 2016, a las 9:05, Rahul Tanwani 
> escribió:
> >
> > Hi,
> >
> > In certain cases (mostly due to time constraints), we need some model to
> run
> > without cross validation. In such a case, since k-fold value for cross
> > validator cannot be one, we have to maintain two different code paths to
> > achieve both the scenarios (with and without cross validation).
> >
> > Would it be an okay idea to generalize the cross validator so it can work
> > with k-fold value of 1? The only purpose for this is to avoid maintaining
> > two different code paths and in functionality it should be similar to as
> if
> > the cross validation is not present.
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Cross-Validator-to-work-with-K-Fold-value-of-1-tp17404.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


PR for In-App Scheduling

2016-05-18 Thread Nick White
Hi ­ I raised a PR here: https://github.com/apache/spark/pull/12951 add a
mechanism that prevents starvation when scheduling work within a single
application. Could a committer take a look? Thanks -

Nick




smime.p7s
Description: S/MIME cryptographic signature


Re: [VOTE] Removing module maintainer process

2016-05-22 Thread Nick Pentreath
+1 (binding)
On Mon, 23 May 2016 at 04:19, Matei Zaharia  wrote:

> Correction, let's run this for 72 hours, so until 9 PM EST May 25th.
>
> > On May 22, 2016, at 8:34 PM, Matei Zaharia 
> wrote:
> >
> > It looks like the discussion thread on this has only had positive
> replies, so I'm going to call a VOTE. The proposal is to remove the
> maintainer process in
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
> <
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>
> given that it doesn't seem to have had a huge impact on the project, and it
> can unnecessarily create friction in contributing. We already have +1s from
> Mridul, Tom, Andrew Or and Imran on that thread.
> >
> > I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
> >
> > Matei
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


  1   2   >