from:"chris"

Re: Thoughts on dataframe cogroup?

2019-04-08 Thread chris

Hi,

Just to say, I really do think this is useful and am currently working on a 
SPIP to formally propose this. One concern I do have, however, is that the 
current arrow serialization code is tied to passing through a single dataframe 
as the udf parameter and so any modification to allow multiple dataframes may 
not be straightforward.  If anyone has any ideas as to how this might be 
achieved in an elegant manner I’d be happy to hear them!

Thanks,

Chris 

> On 26 Feb 2019, at 14:55, Li Jin  wrote:
> 
> Thank you both for the reply. Chris and I have very similar use cases for 
> cogroup. 
> 
> One of the goals for groupby apply + pandas UDF was to avoid things like 
> collect list and reshaping data between Spark and Pandas. Cogroup feels very 
> similar and can be an extension to the groupby apply + pandas UDF 
> functionality.
> 
> I wonder if any PMC/committers have any thoughts/opinions on this?
> 
>> On Tue, Feb 26, 2019 at 2:17 AM  wrote:
>> Just to add to this I’ve also implemented my own cogroup previously and 
>> would welcome a cogroup for datafame.
>> 
>> My specific use case was that I had a large amount of time series data. 
>> Spark has very limited support for time series (specifically as-of joins), 
>> but pandas has good support.
>> 
>> My solution was to take my two dataframes and perform a group by and collect 
>> list on each. The resulting arrays could be passed into a udf where they 
>> could be marshaled into a couple of pandas dataframes and processed using 
>> pandas excellent time series functionality.
>> 
>> If cogroup was available natively on dataframes this would have been a bit 
>> nicer. The ideal would have been some pandas udf version of cogroup that 
>> gave me a pandas dataframe for each spark dataframe in the cogroup!
>> 
>> Chris 
>> 
>>> On 26 Feb 2019, at 00:38, Jonathan Winandy  
>>> wrote:
>>> 
>>> For info, in our team have defined our own cogroup on dataframe in the past 
>>> on different projects using different methods (rdd[row] based or union all 
>>> collect list based). 
>>> 
>>> I might be biased, but find the approach very useful in project to simplify 
>>> and speed up transformations, and remove a lot of intermediate stages 
>>> (distinct + join => just cogroup). 
>>> 
>>> Plus spark 2.4 introduced a lot of new operator for nested data. That's a 
>>> win! 
>>> 
>>> 
>>>> On Thu, 21 Feb 2019, 17:38 Li Jin,  wrote:
>>>> I am wondering do other people have opinion/use case on cogroup?
>>>> 
>>>>> On Wed, Feb 20, 2019 at 5:03 PM Li Jin  wrote:
>>>>> Alessandro,
>>>>> 
>>>>> Thanks for the reply. I assume by "equi-join", you mean "equality  full 
>>>>> outer join" .
>>>>> 
>>>>> Two issues I see with equity outer join is:
>>>>> (1) equity outer join will give n * m rows for each key (n and m being 
>>>>> the corresponding number of rows in df1 and df2 for each key)
>>>>> (2) User needs to do some extra processing to transform n * m back to the 
>>>>> desired shape (two sub dataframes with n and m rows) 
>>>>> 
>>>>> I think full outer join is an inefficient way to implement cogroup. If 
>>>>> the end goal is to have two separate dataframes for each key, why joining 
>>>>> them first and then unjoin them?
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Wed, Feb 20, 2019 at 5:52 AM Alessandro Solimando 
>>>>>>  wrote:
>>>>>> Hello,
>>>>>> I fail to see how an equi-join on the key columns is different than the 
>>>>>> cogroup you propose.
>>>>>> 
>>>>>> I think the accepted answer can shed some light:
>>>>>> https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark
>>>>>> 
>>>>>> Now you apply an udf on each iterable, one per key value (obtained with 
>>>>>> cogroup).
>>>>>> 
>>>>>> You can achieve the same by: 
>>>>>> 1) join df1 and df2 on the key you want, 
>>>>>> 2) apply "groupby" on such key
>>>>>> 3) finally apply a udaf (you can have a look here if you are not 
>>>>>> familiar with them 
>>>>>> https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), 
>>>>

Re: Thoughts on dataframe cogroup?

2019-04-15 Thread chris

Hi,

 As promised I’ve raised SPARK-27463 for this.

All feedback welcome!

Chris 

> On 9 Apr 2019, at 13:22, Chris Martin  wrote:
> 
> Thanks Bryan and Li, that is much appreciated.  Hopefully should have the 
> SPIP ready in the next couple of days.
> 
> thanks,
> 
> Chris
> 
> 
> 
> 
>> On Mon, Apr 8, 2019 at 7:18 PM Bryan Cutler  wrote:
>> Chirs, an SPIP sounds good to me. I agree with Li that it wouldn't be too 
>> difficult to extend the currently functionality to transfer multiple 
>> DataFrames.  For the SPIP, I would keep it more high-level and I don't think 
>> it's necessary to include details of the Python worker, we can hash that out 
>> after the SPIP is approved.
>> 
>> Bryan
>> 
>>> On Mon, Apr 8, 2019 at 10:43 AM Li Jin  wrote:
>>> Thanks Chris, look forward to it.
>>> 
>>> I think sending multiple dataframes to the python worker requires some 
>>> changes but shouldn't be too difficult. We can probably sth like:
>>> 
>>> [numberOfDataFrames][FirstDataFrameInArrowFormat][SecondDataFrameInArrowFormat]
>>> 
>>> In: 
>>> https://github.com/apache/spark/blob/86d469aeaa492c0642db09b27bb0879ead5d7166/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L70
>>> 
>>> And have ArrowPythonRunner take multiple input iterator/schema.
>>> 
>>> Li
>>> 
>>> 
>>>> On Mon, Apr 8, 2019 at 5:55 AM  wrote:
>>>> Hi,
>>>> 
>>>> Just to say, I really do think this is useful and am currently working on 
>>>> a SPIP to formally propose this. One concern I do have, however, is that 
>>>> the current arrow serialization code is tied to passing through a single 
>>>> dataframe as the udf parameter and so any modification to allow multiple 
>>>> dataframes may not be straightforward.  If anyone has any ideas as to how 
>>>> this might be achieved in an elegant manner I’d be happy to hear them!
>>>> 
>>>> Thanks,
>>>> 
>>>> Chris 
>>>> 
>>>>> On 26 Feb 2019, at 14:55, Li Jin  wrote:
>>>>> 
>>>>> Thank you both for the reply. Chris and I have very similar use cases for 
>>>>> cogroup. 
>>>>> 
>>>>> One of the goals for groupby apply + pandas UDF was to avoid things like 
>>>>> collect list and reshaping data between Spark and Pandas. Cogroup feels 
>>>>> very similar and can be an extension to the groupby apply + pandas UDF 
>>>>> functionality.
>>>>> 
>>>>> I wonder if any PMC/committers have any thoughts/opinions on this?
>>>>> 
>>>>>> On Tue, Feb 26, 2019 at 2:17 AM  wrote:
>>>>>> Just to add to this I’ve also implemented my own cogroup previously and 
>>>>>> would welcome a cogroup for datafame.
>>>>>> 
>>>>>> My specific use case was that I had a large amount of time series data. 
>>>>>> Spark has very limited support for time series (specifically as-of 
>>>>>> joins), but pandas has good support.
>>>>>> 
>>>>>> My solution was to take my two dataframes and perform a group by and 
>>>>>> collect list on each. The resulting arrays could be passed into a udf 
>>>>>> where they could be marshaled into a couple of pandas dataframes and 
>>>>>> processed using pandas excellent time series functionality.
>>>>>> 
>>>>>> If cogroup was available natively on dataframes this would have been a 
>>>>>> bit nicer. The ideal would have been some pandas udf version of cogroup 
>>>>>> that gave me a pandas dataframe for each spark dataframe in the cogroup!
>>>>>> 
>>>>>> Chris 
>>>>>> 
>>>>>>> On 26 Feb 2019, at 00:38, Jonathan Winandy  
>>>>>>> wrote:
>>>>>>> 
>>>>>>> For info, in our team have defined our own cogroup on dataframe in the 
>>>>>>> past on different projects using different methods (rdd[row] based or 
>>>>>>> union all collect list based). 
>>>>>>> 
>>>>>>> I might be biased, but find the approach very useful in project to 
>>>>>>> simplify and speed up transformations, and remove a lot of intermediate 
>>>>>>> stages (distinct + join => j

Re: PyCharm 2020 :: pyspark installation issue

2020-05-16 Thread chris

Hi,

Try installing pypandoc first (“pip install pypandoc” or “pip3 install 
pypandoc” depending on how your python is set up). Then install pyspark.

Chris 

> On 16 May 2020, at 07:25, kanchan pradhan  wrote:
> 
> 
> Hi, 
> 
> Please help me to resolve the below issue is coming while installing pyspark 
> in PyCharm 2020.
> 
> Collecting pyspark
>   Using cached 
> https://files.pythonhosted.org/packages/9a/5a/271c416c1c2185b6cb0151b29a91fff6fcaed80173c8584ff6d20e46b465/pyspark-2.4.5.tar.gz
> Complete output from command python setup.py egg_info:
> Could not import pypandoc - required to package PySpark
> 
> C:\Users\Home\AppData\Local\Programs\Python\Python36-32\lib\distutils\dist.py:261:
>  UserWarning: Unknown distribution option: 'long_description_content_type'
>   warnings.warn(msg)
> warning: build_py: byte-compiling is disabled, skipping.
> 
> warning: install_lib: byte-compiling is disabled, skipping.
> 
> zip_safe flag not set; analyzing archive contents...
> 
> Installed 
> c:\users\home\appdata\local\temp\pycharm-packaging\pyspark\.eggs\pypandoc-1.5-py3.6.egg
> Searching for wheel>=0.25.0
> Reading https://pypi.python.org/simple/wheel/
> Downloading 
> https://files.pythonhosted.org/packages/75/28/521c6dc7fef23a68368efefdcd682f5b3d1d58c2b90b06dc1d0b805b51ae/wheel-0.34.2.tar.gz#sha256=8788e9155fe14f54164c1b9eb0a319d98ef02c160725587ad60f14ddc57b6f96
> Best match: wheel 0.34.2
> Processing wheel-0.34.2.tar.gz
> Writing 
> C:\Users\Home\AppData\Local\Temp\easy_install-ic8mrpy4\wheel-0.34.2\setup.cfg
> Running wheel-0.34.2\setup.py -q bdist_egg --dist-dir 
> C:\Users\Home\AppData\Local\Temp\easy_install-ic8mrpy4\wheel-0.34.2\egg-dist-tmp-h_qsuflf
> warning: no files found matching '*.dynlib' under directory 'tests'
> no previously-included directories found matching 'tests\testdata\*\build'
> no previously-included directories found matching 'tests\testdata\*\dist'
> no previously-included directories found matching 
> 'tests\testdata\*\*.egg-info'
> warning: install_lib: 'build\lib' does not exist -- no Python modules to 
> install
> 
> zip_safe flag not set; analyzing archive contents...
> Copying unknown-0.0.0-py3.6.egg to 
> c:\users\home\appdata\local\temp\pycharm-packaging\pyspark\.eggs
> 
> Installed 
> c:\users\home\appdata\local\temp\pycharm-packaging\pyspark\.eggs\unknown-0.0.0-py3.6.egg
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "C:\Users\Home\AppData\Local\Temp\pycharm-packaging\pyspark\setup.py", line 
> 224, in 
> 'Programming Language :: Python :: Implementation :: PyPy']
>   File 
> "C:\Users\Home\AppData\Local\Programs\Python\Python36-32\lib\distutils\core.py",
>  line 108, in setup
> _setup_distribution = dist = klass(attrs)
>   File 
> "C:\Users\Home\AppData\Local\Programs\Python\Python36-32\lib\site-packages\setuptools\dist.py",
>  line 315, in __init__
> self.fetch_build_eggs(attrs['setup_requires'])
>   File 
> "C:\Users\Home\AppData\Local\Programs\Python\Python36-32\lib\site-packages\setuptools\dist.py",
>  line 361, in fetch_build_eggs
> replace_conflicting=True,
>   File 
> "C:\Users\Home\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pkg_resources\__init__.py",
>  line 853, in resolve
> raise DistributionNotFound(req, requirers)
> pkg_resources.DistributionNotFound: The 'wheel>=0.25.0' distribution was 
> not found and is required by pypandoc
> 
> 
> 
> Command "python setup.py egg_info" failed with error code 1 in 
> C:\Users\Home\AppData\Local\Temp\pycharm-packaging\pyspark\
> You are using pip version 9.0.1, however version 20.1 is available.
> You should consider upgrading via the 'python -m pip install --upgrade pip' 
> command.

Re: Thoughts on dataframe cogroup?

2019-02-25 Thread chris

Just to add to this I’ve also implemented my own cogroup previously and would 
welcome a cogroup for datafame.

My specific use case was that I had a large amount of time series data. Spark 
has very limited support for time series (specifically as-of joins), but pandas 
has good support.

My solution was to take my two dataframes and perform a group by and collect 
list on each. The resulting arrays could be passed into a udf where they could 
be marshaled into a couple of pandas dataframes and processed using pandas 
excellent time series functionality.

If cogroup was available natively on dataframes this would have been a bit 
nicer. The ideal would have been some pandas udf version of cogroup that gave 
me a pandas dataframe for each spark dataframe in the cogroup!

Chris 

> On 26 Feb 2019, at 00:38, Jonathan Winandy  wrote:
> 
> For info, in our team have defined our own cogroup on dataframe in the past 
> on different projects using different methods (rdd[row] based or union all 
> collect list based). 
> 
> I might be biased, but find the approach very useful in project to simplify 
> and speed up transformations, and remove a lot of intermediate stages 
> (distinct + join => just cogroup). 
> 
> Plus spark 2.4 introduced a lot of new operator for nested data. That's a 
> win! 
> 
> 
>> On Thu, 21 Feb 2019, 17:38 Li Jin,  wrote:
>> I am wondering do other people have opinion/use case on cogroup?
>> 
>>> On Wed, Feb 20, 2019 at 5:03 PM Li Jin  wrote:
>>> Alessandro,
>>> 
>>> Thanks for the reply. I assume by "equi-join", you mean "equality  full 
>>> outer join" .
>>> 
>>> Two issues I see with equity outer join is:
>>> (1) equity outer join will give n * m rows for each key (n and m being the 
>>> corresponding number of rows in df1 and df2 for each key)
>>> (2) User needs to do some extra processing to transform n * m back to the 
>>> desired shape (two sub dataframes with n and m rows) 
>>> 
>>> I think full outer join is an inefficient way to implement cogroup. If the 
>>> end goal is to have two separate dataframes for each key, why joining them 
>>> first and then unjoin them?
>>> 
>>> 
>>> 
>>>> On Wed, Feb 20, 2019 at 5:52 AM Alessandro Solimando 
>>>>  wrote:
>>>> Hello,
>>>> I fail to see how an equi-join on the key columns is different than the 
>>>> cogroup you propose.
>>>> 
>>>> I think the accepted answer can shed some light:
>>>> https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark
>>>> 
>>>> Now you apply an udf on each iterable, one per key value (obtained with 
>>>> cogroup).
>>>> 
>>>> You can achieve the same by: 
>>>> 1) join df1 and df2 on the key you want, 
>>>> 2) apply "groupby" on such key
>>>> 3) finally apply a udaf (you can have a look here if you are not familiar 
>>>> with them 
>>>> https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), that 
>>>> will process each group "in isolation".
>>>> 
>>>> HTH,
>>>> Alessandro
>>>> 
>>>>> On Tue, 19 Feb 2019 at 23:30, Li Jin  wrote:
>>>>> Hi,
>>>>> 
>>>>> We have been using Pyspark's groupby().apply() quite a bit and it has 
>>>>> been very helpful in integrating Spark with our existing pandas-heavy 
>>>>> libraries.
>>>>> 
>>>>> Recently, we have found more and more cases where groupby().apply() is 
>>>>> not sufficient - In some cases, we want to group two dataframes by the 
>>>>> same key, and apply a function which takes two pd.DataFrame (also returns 
>>>>> a pd.DataFrame) for each key. This feels very much like the "cogroup" 
>>>>> operation in the RDD API.
>>>>> 
>>>>> It would be great to be able to do sth like this: (not actual API, just 
>>>>> to explain the use case):
>>>>> 
>>>>> @pandas_udf(return_schema, ...)
>>>>> def my_udf(pdf1, pdf2)
>>>>>   # pdf1 and pdf2 are the subset of the original dataframes that is 
>>>>> associated with a particular key
>>>>>   result = ... # some code that uses pdf1 and pdf2
>>>>>   return result   
>>>>> 
>>>>> df3  = cogroup(df1, df2, key='some_key').apply(my_udf)
>>>>> 
>>>>> I have searched around the problem and some people have suggested to join 
>>>>> the tables first. However, it's often not the same pattern and hard to 
>>>>> get it to work by using joins.
>>>>> 
>>>>> I wonder what are people's thought on this? 
>>>>> 
>>>>> Li
>>>>>

CVE-2021-38296: Apache Spark Key Negotiation Vulnerability - 2.4 Backport?

2022-04-14 Thread Chris Nauroth

A fix for CVE-2021-38296 was committed and released in Apache Spark 3.1.3.
I'm curious, is the issue relevant to the 2.4 version line, and if so, are
there any plans for a backport?

https://lists.apache.org/thread/70x8fw2gx3g9ty7yk0f2f1dlpqml2smd

Chris Nauroth

Re: CVE-2021-38296: Apache Spark Key Negotiation Vulnerability - 2.4 Backport?

2022-04-14 Thread Chris Nauroth

Thanks for the quick reply, Sean!

Chris Nauroth


On Thu, Apr 14, 2022 at 10:15 AM Sean Owen  wrote:

> It does affect 2.4.x, yes. 2.4.x was EOL a while ago, so there wouldn't be
> a new release of 2.4.x in any event. It's recommended to update instead, at
> least to 3.1.3.
>
> On Thu, Apr 14, 2022 at 12:07 PM Chris Nauroth 
> wrote:
>
>> A fix for CVE-2021-38296 was committed and released in Apache Spark
>> 3.1.3. I'm curious, is the issue relevant to the 2.4 version line, and if
>> so, are there any plans for a backport?
>>
>> https://lists.apache.org/thread/70x8fw2gx3g9ty7yk0f2f1dlpqml2smd
>>
>> Chris Nauroth
>>
>

Re: [VOTE] Release Spark 3.3.0 (RC3)

2022-05-26 Thread Chris Nauroth

+1 (non-binding)

* Verified all checksums.
* Verified all signatures.
* Built from source, with multiple profiles, to full success, for Java 11
and Scala 2.13:
* build/mvn -Phadoop-3 -Phadoop-cloud -Phive-thriftserver -Pkubernetes
-Pscala-2.13 -Psparkr -Pyarn -DskipTests clean package
* Almost all unit tests passed. (Some tests related to LevelDB and RocksDB
failed in JNI initialization. If others aren't seeing this, then I probably
just need to work out an environment issue.)
* Ran several examples successfully:
* bin/spark-submit --class org.apache.spark.examples.SparkPi
examples/jars/spark-examples_2.12-3.3.0.jar
* bin/spark-submit --class
org.apache.spark.examples.sql.hive.SparkHiveExample
examples/jars/spark-examples_2.12-3.3.0.jar
* bin/spark-submit
examples/src/main/python/streaming/network_wordcount.py localhost 
* Tested some of the prior issues that blocked RC2:
* bin/spark-sql -e 'SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true)
t(x) UNION SELECT 1 AS a;'
* bin/spark-sql -e "select date '2018-11-17' > 1"

Chris Nauroth


On Wed, May 25, 2022 at 8:00 AM Sean Owen  wrote:

> +1 works for me as usual, with Java 8 + Scala 2.12, Java 11 + Scala 2.13.
>
> On Tue, May 24, 2022 at 12:14 PM Maxim Gekk
>  wrote:
>
>> Please vote on releasing the following candidate as
>> Apache Spark version 3.3.0.
>>
>> The vote is open until 11:59pm Pacific time May 27th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.3.0-rc3 (commit
>> a7259279d07b302a51456adb13dc1e41a6fd06ed):
>> https://github.com/apache/spark/tree/v3.3.0-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1404
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-docs/
>>
>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>
>> This release is using the release script of the tag v3.3.0-rc3.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.0?
>> ===
>> The current list of open tickets targeted at 3.3.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-07 Thread Chris Nauroth

+1 (non-binding)

* Verified all checksums.
* Verified all signatures.
* Built from source, with multiple profiles, to full success, for Java 11
and Scala 2.13:
* build/mvn -Phadoop-3 -Phadoop-cloud -Phive-thriftserver -Pkubernetes
-Pscala-2.13 -Psparkr -Pyarn -DskipTests clean package
* Tests passed.
* Ran several examples successfully:
* bin/spark-submit --class org.apache.spark.examples.SparkPi
examples/jars/spark-examples_2.12-3.3.0.jar
* bin/spark-submit --class
org.apache.spark.examples.sql.hive.SparkHiveExample
examples/jars/spark-examples_2.12-3.3.0.jar
* bin/spark-submit
examples/src/main/python/streaming/network_wordcount.py localhost 
* Tested some of the issues that blocked prior release candidates:
* bin/spark-sql -e 'SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true)
t(x) UNION SELECT 1 AS a;'
* bin/spark-sql -e "select date '2018-11-17' > 1"
* SPARK-39293 ArrayAggregate fix

Chris Nauroth


On Tue, Jun 7, 2022 at 1:30 PM Cheng Su  wrote:

> +1 (non-binding). Built and ran some internal test for Spark SQL.
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> *From: *L. C. Hsieh 
> *Date: *Tuesday, June 7, 2022 at 1:23 PM
> *To: *dev 
> *Subject: *Re: [VOTE] Release Spark 3.3.0 (RC5)
>
> +1
>
> Liang-Chi
>
> On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang  wrote:
> >
> > +1 (non-binding)
> >
> > Gengliang
> >
> > On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves 
> wrote:
> >>
> >> +1
> >>
> >> Tom Graves
> >>
> >> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
> >>  wrote:
> >> >
> >> > Please vote on releasing the following candidate as Apache Spark
> version 3.3.0.
> >> >
> >> > The vote is open until 11:59pm Pacific time June 8th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 3.3.0
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see http://spark.apache.org/
> >> >
> >> > The tag to be voted on is v3.3.0-rc5 (commit
> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
> >> > https://github.com/apache/spark/tree/v3.3.0-rc5
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
> >> >
> >> > Signatures used for Spark RCs can be found in this file:
> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1406
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
> >> >
> >> > The list of bug fixes going into 3.3.0 can be found at the following
> URL:
> >> > https://issues.apache.org/jira/projects/SPARK/versions/12350369
> >> >
> >> > This release is using the release script of the tag v3.3.0-rc5.
> >> >
> >> >
> >> > FAQ
> >> >
> >> > =
> >> > How can I help test this release?
> >> > =
> >> > If you are a Spark user, you can help us test this release by taking
> >> > an existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > If you're working in PySpark you can set up a virtual env and install
> >> > the current RC and see if anything important breaks, in the Java/Scala
> >> > you can add the staging repository to your projects resolvers and test
> >> > with the RC (make sure to clean up the artifact cache before/after so
> >> > you don't end up building with a out of date RC going forward).
> >> >
> >> > ===
> >> > What should happen to JIRA tickets still targeting 3.3.0?
> >> > ===
> >> > The current list of open tickets targeted at 3.3.0 can be found at:
> >> > https://issues.apache.org/jira/projects/SPARK  and search for
> "Target Version/s" = 3.3.0
> >> >
> >> > Committers should look at those and triage. Extremely important bug
> >> > fixes, documentatio

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Chris Nauroth

+1 (non-binding)

I repeated all checks I described for RC5:

https://lists.apache.org/thread/ksoxmozgz7q728mnxl6c2z7ncmo87vls

Maxim, thank you for your dedication on these release candidates.

Chris Nauroth


On Mon, Jun 13, 2022 at 3:21 PM Mridul Muralidharan 
wrote:

>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>
> The test "SPARK-33084: Add jar support Ivy URI in SQL" in
> sql.SQLQuerySuite fails; but other than that, rest looks good.
>
> Regards,
> Mridul
>
>
>
> On Mon, Jun 13, 2022 at 4:25 PM Tom Graves 
> wrote:
>
>> +1
>>
>> Tom
>>
>> On Thursday, June 9, 2022, 11:27:50 PM CDT, Maxim Gekk
>>  wrote:
>>
>>
>> Please vote on releasing the following candidate as
>> Apache Spark version 3.3.0.
>>
>> The vote is open until 11:59pm Pacific time June 14th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.3.0-rc6 (commit
>> f74867bddfbcdd4d08076db36851e88b15e66556):
>> https://github.com/apache/spark/tree/v3.3.0-rc6
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1407
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/
>>
>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>
>> This release is using the release script of the tag v3.3.0-rc6.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.0?
>> ===
>> The current list of open tickets targeted at 3.3.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>

Re: Missing data in spark output

2022-10-21 Thread Chris Nauroth

Some users have observed issues like what you're describing related to the
job commit algorithm, which is controlled by configuration
property spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.
Hadoop's default value for this setting is 2. You can find a description of
the algorithms in Hadoop's configuration documentation:

https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Algorithm version 2 is faster, because the final task output file renames
can be issued in parallel by individual tasks. Unfortunately, there have
been reports of it causing side effects like what you described, especially
if there are a lot of task attempt retries or speculative execution
(configuration property spark.speculation set to true instead of the
default false). You could try switching to algorithm version 1. The
drawback is that it's slower, because the final output renames are executed
single-threaded at the end of the job. The performance impact is more
noticeable for jobs with many tasks, and the effect is amplified when using
cloud storage as opposed to HDFS running in the same network.

If you are using speculative execution, then you could also potentially try
turning that off.

Chris Nauroth

On Wed, Oct 19, 2022 at 8:18 AM Martin Andersson 
wrote:

> Is your spark job batch or streaming?
> --
> *From:* Sandeep Vinayak 
> *Sent:* Tuesday, October 18, 2022 19:48
> *To:* dev@spark.apache.org 
> *Subject:* Missing data in spark output
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> Hello Everyone,
>
> We are recently observing an intermittent data loss in the spark with
> output to GCS (google cloud storage). When there are missing rows, they are
> accompanied by duplicate rows. The re-run of the job doesn't have any
> duplicate or missing rows. Since it's hard to debug, we are first trying to
> understand the potential theoretical root cause of this issue, can this be
> a GCS specific issue where GCS might not be handling the consistencies
> well? Any tips will be super helpful.
>
> Thanks,
>
>

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-16 Thread Chris Nauroth

+1 (non-binding)

* Verified all checksums.
* Verified all signatures.
* Built from source, with multiple profiles, to full success, for Java 11
and Scala 2.12:
* build/mvn -Phadoop-3.2 -Phadoop-cloud -Phive-2.3 -Phive-thriftserver
-Pkubernetes -Pscala-2.12 -Psparkr -Pyarn -DskipTests clean package
* Tests passed.
* Ran several examples successfully:
* bin/spark-submit --class org.apache.spark.examples.SparkPi
examples/jars/spark-examples_2.12-3.2.3.jar
* bin/spark-submit --class
org.apache.spark.examples.sql.hive.SparkHiveExample
examples/jars/spark-examples_2.12-3.2.3.jar
* bin/spark-submit
examples/src/main/python/streaming/network_wordcount.py localhost 

Chao, thank you for preparing the release.

Chris Nauroth


On Wed, Nov 16, 2022 at 5:22 AM Yuming Wang  wrote:

> +1
>
> On Wed, Nov 16, 2022 at 2:28 PM Yang,Jie(INF)  wrote:
>
>> I switched Scala 2.13 to Scala 2.12 today. The test is still in progress
>> and it has not been hung.
>>
>>
>>
>> Yang Jie
>>
>>
>>
>> *发件人**: *Dongjoon Hyun 
>> *日期**: *2022年11月16日 星期三 01:17
>> *收件人**: *"Yang,Jie(INF)" 
>> *抄送**: *huaxin gao , "L. C. Hsieh" <
>> vii...@gmail.com>, Chao Sun , dev <
>> dev@spark.apache.org>
>> *主题**: *Re: [VOTE] Release Spark 3.2.3 (RC1)
>>
>>
>>
>> Did you hit that in Scala 2.12, too?
>>
>>
>>
>> Dongjoon.
>>
>>
>>
>> On Tue, Nov 15, 2022 at 4:36 AM Yang,Jie(INF) 
>> wrote:
>>
>> Hi, all
>>
>>
>>
>> I test v3.2.3 with following command:
>>
>>
>>
>> ```
>>
>> dev/change-scala-version.sh 2.13
>>
>> build/mvn clean install -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn
>> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
>> -Pscala-2.13 -fn
>>
>> ```
>>
>>
>>
>> The testing environment is:
>>
>>
>>
>> OS: CentOS 6u3 Final
>>
>> Java: zulu 11.0.17
>>
>> Python: 3.9.7
>>
>> Scala: 2.13
>>
>>
>>
>> The above test command has been executed twice, and all times hang in the
>> following stack:
>>
>>
>>
>> ```
>>
>> "ScalaTest-main-running-JoinSuite" #1 prio=5 os_prio=0 cpu=312870.06ms
>> elapsed=1552.65s tid=0x7f2ddc02d000 nid=0x7132 waiting on condition
>> [0x7f2de3929000]
>>
>>java.lang.Thread.State: WAITING (parking)
>>
>>at jdk.internal.misc.Unsafe.park(java.base@11.0.17/Native Method)
>>
>>- parking to wait for  <0x000790d00050> (a
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>
>>at java.util.concurrent.locks.LockSupport.park(java.base@11.0.17
>> /LockSupport.java:194)
>>
>>at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.17
>> /AbstractQueuedSynchronizer.java:2081)
>>
>>at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.17
>> /LinkedBlockingQueue.java:433)
>>
>>at
>> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:275)
>>
>>at
>> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$Lambda$9429/0x000802269840.apply(Unknown
>> Source)
>>
>>at
>> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>>
>>at
>> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:228)
>>
>>- locked <0x000790d00208> (a java.lang.Object)
>>
>>at
>> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:370)
>>
>>at
>> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecute(AdaptiveSparkPlanExec.scala:355)
>>
>>at
>> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>>
>>at
>> org.apache.spark.sql.execution.SparkPlan$$Lambda$8573/0x000801f99c40.apply(Unknown
>> Source)
>>
>>at
>> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>>
>>at
>> org.apache.spark.sql.execution.SparkPlan$$Lambda$8574/0x000801f9a040.apply(Unknown
>> Source)
>>
>>at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>
>>at
>> or

Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Chris Nauroth

+1 (non-binding)

Gengliang, thank you for the SPIP.

Chris Nauroth


On Wed, Nov 16, 2022 at 4:27 AM Maciej  wrote:

> +1
>
> On 11/16/22 13:19, Yuming Wang wrote:
> > +1, non-binding
> >
> > On Wed, Nov 16, 2022 at 8:12 PM Yang,Jie(INF)  > <mailto:yangji...@baidu.com>> wrote:
> >
> > +1, non-binding
> >
> > __ __
> >
> > Yang Jie
> >
> > __ __
> >
> > *发件人**: *Mridul Muralidharan  > <mailto:mri...@gmail.com>>
> > *日期**: *2022年11月16日星期三17:35
> > *收件人**: *Kent Yao mailto:y...@apache.org>>
> > *抄送**: *Gengliang Wang  > <mailto:ltn...@gmail.com>>, dev  > <mailto:dev@spark.apache.org>>
> > *主题**: *Re: [VOTE][SPIP] Better Spark UI scalability and Driver
> > stability for large applications
> >
> > __ __
> >
> > __ __
> >
> > +1
> >
> > __ __
> >
> > Would be great to see history server performance improvements and
> > lower resource utilization at driver !
> >
> > __ __
> >
> > Regards,
> >
> > Mridul 
> >
> > __ __
> >
> > On Wed, Nov 16, 2022 at 2:38 AM Kent Yao  > <mailto:y...@apache.org>> wrote:
> >
> > +1, non-binding
> >
> > Gengliang Wang mailto:ltn...@gmail.com>> 于
> > 2022年11月16日周三16:36写道：
> > >
> > > Hi all,
> > >
> > > I’d like to start a vote for SPIP: "Better Spark UI
> scalability and Driver stability for large applications"
> > >
> > > The goal of the SPIP is to improve the Driver's stability by
> supporting storing Spark's UI data on RocksDB. Furthermore, to fasten the
> read and write operations on RocksDB, it introduces a new Protobuf
> serializer.
> > >
> > > Please also refer to the following:
> > >
> > > Previous discussion in the dev mailing list: [DISCUSS] SPIP:
> Better Spark UI scalability and Driver stability for large applications
> > > Design Doc: Better Spark UI scalability and Driver stability
> for large applications
> > > JIRA: SPARK-41053
> > >
> > >
> > > Please vote on the SPIP for the next 72 hours:
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > > [ ] +0
> > > [ ] -1: I don’t think this is a good idea because …
> > >
> > > Kind Regards,
> > > Gengliang
> >
> >
>  -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > <mailto:dev-unsubscr...@spark.apache.org>
> >
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-12 Thread Chris Nauroth

+1 (non-binding)

* Verified all checksums.
* Verified all signatures.
* Built from source, with multiple profiles, to full success:
* build/mvn -Phadoop-cloud -Phive-thriftserver -Pkubernetes -Psparkr
-Pyarn -DskipTests clean package
* Tests passed.
* Ran several examples successfully:
* bin/spark-submit --class org.apache.spark.examples.SparkPi
examples/jars/spark-examples_2.13-3.4.0.jar
* bin/spark-submit --class
org.apache.spark.examples.sql.hive.SparkHiveExample
examples/jars/spark-examples_2.13-3.4.0.jar
* bin/spark-submit
examples/src/main/python/streaming/network_wordcount.py localhost 

Chris Nauroth


On Tue, Apr 11, 2023 at 10:36 PM beliefer  wrote:

> +1
>
>
> At 2023-04-08 07:29:46, "Xinrong Meng"  wrote:
>
> Please vote on releasing the following candidate(RC7) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *April 12th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.4.0-rc7 (commit
> 87a5442f7ed96b11051d8a9333476d080054e5a0):
> https://github.com/apache/spark/tree/v3.4.0-rc7
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1441
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc7.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng
>
>

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-12 Thread Chris Nauroth

+1 (non-binding)

* Verified all checksums.
* Verified all signatures.
* Built from source, with multiple profiles, to full success:
* build/mvn -Phadoop-3.2 -Phadoop-cloud -Phive-2.3 -Phive-thriftserver
-Pkubernetes -Pscala-2.12 -Psparkr -Pyarn -DskipTests clean package
* Tests passed.
* Ran several examples successfully:
* bin/spark-submit --class org.apache.spark.examples.SparkPi
examples/jars/spark-examples_2.12-3.2.4.jar
* bin/spark-submit --class
org.apache.spark.examples.sql.hive.SparkHiveExample
examples/jars/spark-examples_2.12-3.2.4.jar
* bin/spark-submit
examples/src/main/python/streaming/network_wordcount.py localhost 

Thank you, Dongjoon!

Chris Nauroth


On Wed, Apr 12, 2023 at 3:49 AM Shaoyun Chen  wrote:

> +1 (non-binding)
>
> On 2023/04/12 04:36:59 Jungtaek Lim wrote:
> > +1 (non-binding)
> >
> > Thanks for driving the release!
> >
> > On Wed, Apr 12, 2023 at 3:41 AM Xinrong Meng 
> > wrote:
> >
> > > +1 non-binding
> > >
> > > Thank you Doogjoon!
> > >
> > > Wenchen Fan 于2023年4月10日 周一下午11:32写道：
> > >
> > >> +1
> > >>
> > >> On Tue, Apr 11, 2023 at 10:09 AM Hyukjin Kwon 
> > >> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> On Tue, 11 Apr 2023 at 11:04, Ruifeng Zheng 
> > >>> wrote:
> > >>>
> > >>>> +1 (non-binding)
> > >>>>
> > >>>> Thank you for driving this release!
> > >>>>
> > >>>> --
> > >>>> Ruifeng  Zheng
> > >>>> ruife...@foxmail.com
> > >>>>
> > >>>> <
> https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=Ruifeng++Zheng&icon=https%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Doidb%26k%3DTf4peOQcGSGPmJMrEjyy8A%26s%3D0&mail=ruifengz%40foxmail.com&code=jZbrY21QDsAcndKywdMVSJp1IMRfkNRuG3FZaHEiBGuqp0tP0yQoosO3ynB9_PwnV99-o_S6OBufSRUEEkqBOV5EdipJwmqkFSlUiJu0oDI
> >
> > >>>>
> > >>>>
> > >>>>
> > >>>> -- Original --
> > >>>> *From:* "Yuming Wang" ;
> > >>>> *Date:* Tue, Apr 11, 2023 09:56 AM
> > >>>> *To:* "Mridul Muralidharan";
> > >>>> *Cc:* "huaxin gao";"Chao Sun"<
> > >>>> sunc...@apache.org>;"yangjie01";"Dongjoon
> Hyun"<
> > >>>> dongj...@apache.org>;"Sean Owen";"
> > >>>> dev@spark.apache.org";
> > >>>> *Subject:* Re: [VOTE] Release Apache Spark 3.2.4 (RC1)
> > >>>>
> > >>>> +1.
> > >>>>
> > >>>> On Tue, Apr 11, 2023 at 12:17 AM Mridul Muralidharan <
> mri...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> +1
> > >>>>>
> > >>>>> Signatures, digests, etc check out fine.
> > >>>>> Checked out tag and build/tested with -Phive -Pyarn -Pmesos
> > >>>>> -Pkubernetes
> > >>>>>
> > >>>>> Regards,
> > >>>>> Mridul
> > >>>>>
> > >>>>>
> > >>>>> On Mon, Apr 10, 2023 at 10:34 AM huaxin gao <
> huaxin.ga...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> +1
> > >>>>>>
> > >>>>>> On Mon, Apr 10, 2023 at 8:17 AM Chao Sun 
> wrote:
> > >>>>>>
> > >>>>>>> +1 (non-binding)
> > >>>>>>>
> > >>>>>>> On Mon, Apr 10, 2023 at 7:07 AM yangjie01 
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> +1 (non-binding)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> *发件人**: *Sean Owen 
> > >>>>>>>> *日期**: *2023年4月10日 星期一 21:19
> > >>>>>>>> *收件人**: *Dongjoon Hyun 
> > >>>>>>>> *抄送**: *"dev@spark.apache.org" 
> > >>>>>>>> *主题**: *Re: [VOTE] Release Apache Spark 3.2.4 (RC1)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>&

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-01 Thread Chris Stevens

Hey Ankur,

I think the significant decrease in "spark.blacklist.timeout" (1 hr down to
5 minutes) in your updated suggestion is the key here.

Looking at a few *successful* runs of the application I was debugging, here
are the error rates when I did *not* have blacklisting enabled:

Run A: 8 executors with 36 total errors over the last 25 minutes of a 1
hour and 6 minute run.
Run B: 8 executors with 50 total errors over the last 30 minutes of a 1
hour run.

Increasing "spark.blacklist.application.maxFailedTasksPerExecutor" to 5
would have allowed run A (~3 failures/executor) to pass, but run B (~6
failures/executor) would not have without the change to
"spark.blacklist.timeout".

With such a small timeout of 5 minutes, the worst you get is executors
flipping between blacklisted and not blacklisted (e.g. fail 5 tasks quickly
due to disk failures, wait 5 minutes, fail 5 tasks quickly, wait 5
minutes). For catastrophic errors, this is probably OK. The executor will
fail fast each time it comes back online and will effectively be
blacklisted 90+% of the time. For transient errors, the executor will come
back online and probably be fine. The only trouble you get into is if you
run out of executors for a stage due to a high amount of transient errors,
but you're right, perhaps that many transient errors is something worth
failing for.

In the case I was debugging with fetch failures, only the 5 minute timeout
applies, but I don't think it would have mattered. Fetch task attempts were
"hanging" for 30+ minutes without failing (it took that long for the netty
channel to timeout). As such, there was no opportunity to blacklist. Even
reducing the number of fetch retry attempts didn't help, as the first
attempt occasionally stalled due to the underlying networking issues.

A few thoughts:
- Correct me if I'm wrong, but once a task fails on an executor, even if
maxTaskAttemptsPerExecutor > 1, that executor will get a failed task count
against it. It looks like "TaskSetBlacklist.updateBlacklistForFailedTask"
only adds to the executor failures. If the tasks recovers on the second
attempt on the same executor, there is no way to remove the failure. I'd
argue that if the task succeeds on a second attempt on the same executor,
then it is definitely transient and the first attempt's failure should not
count towards the executor's total stage/application failure count.
- Rather than a fixed timeout, could we do some sort of exponential
backoff? Start with a 10 or 20 second blacklist and increase from there?
The nodes with catastrophic errors should quickly hit long blacklist
intervals.
- W.r.t turning it on by default: Do we have a sense of how many teams are
using blacklisting today using the current default settings? It may be
worth changing the defaults for a release or two and gather feedback to
help make a call on turning it on by default. We could potentially get that
feedback now: two question survey "Have you enabled blacklisting?" and
"What settings did you use?"

-Chris

On Mon, Apr 1, 2019 at 9:05 AM Ankur Gupta  wrote:

> Hi Chris,
>
> Thanks for sending over the example. As far as I can understand, it seems
> that this would not have been a problem if
> "spark.blacklist.application.maxFailedTasksPerExecutor" was set to a higher
> threshold, as mentioned in my previous email.
>
> Though, with 8/7 executors and 2 failedTasksPerExecutor, if the
> application runs out of executors, that would imply at least 14 task
> failures in a short period of time. So, I am not sure if the application
> should still continue to run or fail. If this was not a transient issue,
> maybe failing was the correct outcome, as it saves lot of unnecessary
> computation and also alerts admins to look for transient/permanent hardware
> failures.
>
> Please let me know if you think, we should enable blacklisting feature by
> default with the higher threshold.
>
> Thanks,
> Ankur
>
> On Fri, Mar 29, 2019 at 3:23 PM Chris Stevens <
> chris.stev...@databricks.com> wrote:
>
>> Hey All,
>>
>> My initial reply got lost, because I wasn't on the dev list. Hopefully
>> this goes through.
>>
>> Back story for my experiments: customer was hitting network errors due to
>> cloud infrastructure problems. Basically, executor X couldn't fetch from Y.
>> The NIC backing the VM for executor Y was swallowing packets. I wanted to
>> blacklist node Y.
>>
>> What I learned:
>>
>> 1. `spark.blacklist.application.fetchFailure.enabled` requires
>> `spark.blacklist.enabled` to also be enabled (BlacklistTracker isn't
>> created
>> <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L948>
>

Re: Thoughts on dataframe cogroup?

2019-04-09 Thread Chris Martin

Thanks Bryan and Li, that is much appreciated.  Hopefully should have the
SPIP ready in the next couple of days.

thanks,

Chris




On Mon, Apr 8, 2019 at 7:18 PM Bryan Cutler  wrote:

> Chirs, an SPIP sounds good to me. I agree with Li that it wouldn't be too
> difficult to extend the currently functionality to transfer multiple
> DataFrames.  For the SPIP, I would keep it more high-level and I don't
> think it's necessary to include details of the Python worker, we can hash
> that out after the SPIP is approved.
>
> Bryan
>
> On Mon, Apr 8, 2019 at 10:43 AM Li Jin  wrote:
>
>> Thanks Chris, look forward to it.
>>
>> I think sending multiple dataframes to the python worker requires some
>> changes but shouldn't be too difficult. We can probably sth like:
>>
>>
>> [numberOfDataFrames][FirstDataFrameInArrowFormat][SecondDataFrameInArrowFormat]
>>
>> In:
>> https://github.com/apache/spark/blob/86d469aeaa492c0642db09b27bb0879ead5d7166/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L70
>>
>> And have ArrowPythonRunner take multiple input iterator/schema.
>>
>> Li
>>
>>
>> On Mon, Apr 8, 2019 at 5:55 AM  wrote:
>>
>>> Hi,
>>>
>>> Just to say, I really do think this is useful and am currently working
>>> on a SPIP to formally propose this. One concern I do have, however, is that
>>> the current arrow serialization code is tied to passing through a single
>>> dataframe as the udf parameter and so any modification to allow multiple
>>> dataframes may not be straightforward.  If anyone has any ideas as to how
>>> this might be achieved in an elegant manner I’d be happy to hear them!
>>>
>>> Thanks,
>>>
>>> Chris
>>>
>>> On 26 Feb 2019, at 14:55, Li Jin  wrote:
>>>
>>> Thank you both for the reply. Chris and I have very similar use cases
>>> for cogroup.
>>>
>>> One of the goals for groupby apply + pandas UDF was to avoid things like
>>> collect list and reshaping data between Spark and Pandas. Cogroup feels
>>> very similar and can be an extension to the groupby apply + pandas UDF
>>> functionality.
>>>
>>> I wonder if any PMC/committers have any thoughts/opinions on this?
>>>
>>> On Tue, Feb 26, 2019 at 2:17 AM  wrote:
>>>
>>>> Just to add to this I’ve also implemented my own cogroup previously and
>>>> would welcome a cogroup for datafame.
>>>>
>>>> My specific use case was that I had a large amount of time series data.
>>>> Spark has very limited support for time series (specifically as-of joins),
>>>> but pandas has good support.
>>>>
>>>> My solution was to take my two dataframes and perform a group by and
>>>> collect list on each. The resulting arrays could be passed into a udf where
>>>> they could be marshaled into a couple of pandas dataframes and processed
>>>> using pandas excellent time series functionality.
>>>>
>>>> If cogroup was available natively on dataframes this would have been a
>>>> bit nicer. The ideal would have been some pandas udf version of cogroup
>>>> that gave me a pandas dataframe for each spark dataframe in the cogroup!
>>>>
>>>> Chris
>>>>
>>>> On 26 Feb 2019, at 00:38, Jonathan Winandy 
>>>> wrote:
>>>>
>>>> For info, in our team have defined our own cogroup on dataframe in the
>>>> past on different projects using different methods (rdd[row] based or union
>>>> all collect list based).
>>>>
>>>> I might be biased, but find the approach very useful in project to
>>>> simplify and speed up transformations, and remove a lot of intermediate
>>>> stages (distinct + join => just cogroup).
>>>>
>>>> Plus spark 2.4 introduced a lot of new operator for nested data. That's
>>>> a win!
>>>>
>>>>
>>>> On Thu, 21 Feb 2019, 17:38 Li Jin,  wrote:
>>>>
>>>>> I am wondering do other people have opinion/use case on cogroup?
>>>>>
>>>>> On Wed, Feb 20, 2019 at 5:03 PM Li Jin  wrote:
>>>>>
>>>>>> Alessandro,
>>>>>>
>>>>>> Thanks for the reply. I assume by "equi-join", you mean "equality
>>>>>> full outer join" .
>>>>>>
>>>>>> Two issues I see with equity outer

Re: Thoughts on dataframe cogroup?

2019-04-15 Thread Chris Martin

I've updated the jira so that the main body is now inside a google doc.
Anyone should be able to comment- if you want/need write access please drop
me a mail and I can add you.

Ryan- regarding your specific point regarding why I'm not proposing to add
this to the Scala API, I think the main point is that Scala users can
already use Cogroup for Datasets.  For Scala this is probably a better
solution as (as far as I know) there is no Scala DataFrame library that
could be used in place of Pandas for manipulating  local DataFrames. As a
result you'd probably be left with dealing with Iterators of Row objects,
which almost certainly isn't what you'd want. This is similar to the
existing grouped map Pandas Udfs for which there is no equivalent Scala Api.

I do think there might be a place for allowing a (Scala) DataSet Cogroup to
take some sort of grouping expression as the grouping key  (this would mean
that you wouldn't have to marshal the key into a JVM object and could
possible lend itself to some catalyst optimisations) but I don't think that
this should be done as part of this SPIP.

thanks,

Chris

On Mon, Apr 15, 2019 at 6:27 PM Ryan Blue  wrote:

> I agree, it would be great to have a document to comment on.
>
> The main thing that stands out right now is that this is only for PySpark
> and states that it will not be added to the Scala API. Why not make this
> available since most of the work would be done?
>
> On Mon, Apr 15, 2019 at 7:50 AM Li Jin  wrote:
>
>> Thank you Chris, this looks great.
>>
>> Would you mind share a google doc version of the proposal? I believe
>> that's the preferred way of discussing proposals (Other people please
>> correct me if I am wrong).
>>
>> Li
>>
>> On Mon, Apr 15, 2019 at 8:20 AM  wrote:
>>
>>> Hi,
>>>
>>>  As promised I’ve raised SPARK-27463 for this.
>>>
>>> All feedback welcome!
>>>
>>> Chris
>>>
>>> On 9 Apr 2019, at 13:22, Chris Martin  wrote:
>>>
>>> Thanks Bryan and Li, that is much appreciated.  Hopefully should have
>>> the SPIP ready in the next couple of days.
>>>
>>> thanks,
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>> On Mon, Apr 8, 2019 at 7:18 PM Bryan Cutler  wrote:
>>>
>>>> Chirs, an SPIP sounds good to me. I agree with Li that it wouldn't be
>>>> too difficult to extend the currently functionality to transfer multiple
>>>> DataFrames.  For the SPIP, I would keep it more high-level and I don't
>>>> think it's necessary to include details of the Python worker, we can hash
>>>> that out after the SPIP is approved.
>>>>
>>>> Bryan
>>>>
>>>> On Mon, Apr 8, 2019 at 10:43 AM Li Jin  wrote:
>>>>
>>>>> Thanks Chris, look forward to it.
>>>>>
>>>>> I think sending multiple dataframes to the python worker requires some
>>>>> changes but shouldn't be too difficult. We can probably sth like:
>>>>>
>>>>>
>>>>> [numberOfDataFrames][FirstDataFrameInArrowFormat][SecondDataFrameInArrowFormat]
>>>>>
>>>>> In:
>>>>> https://github.com/apache/spark/blob/86d469aeaa492c0642db09b27bb0879ead5d7166/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L70
>>>>>
>>>>> And have ArrowPythonRunner take multiple input iterator/schema.
>>>>>
>>>>> Li
>>>>>
>>>>>
>>>>> On Mon, Apr 8, 2019 at 5:55 AM  wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Just to say, I really do think this is useful and am currently
>>>>>> working on a SPIP to formally propose this. One concern I do have, 
>>>>>> however,
>>>>>> is that the current arrow serialization code is tied to passing through a
>>>>>> single dataframe as the udf parameter and so any modification to allow
>>>>>> multiple dataframes may not be straightforward.  If anyone has any ideas 
>>>>>> as
>>>>>> to how this might be achieved in an elegant manner I’d be happy to hear
>>>>>> them!
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> On 26 Feb 2019, at 14:55, Li Jin  wrote:
>>>>>>
>>>>>> Thank you both for the reply. Chris and I have very similar use cases
>

Re: Thoughts on dataframe cogroup?

2019-04-15 Thread Chris Martin

Ah sorry- I've updated the link which should give you access.  Can you try
again now?

thanks,

Chris



On Mon, Apr 15, 2019 at 9:49 PM Li Jin  wrote:

> Hi Chris,
>
> Thanks! The permission to the google doc is maybe not set up properly. I
> cannot view the doc by default.
>
> Li
>
> On Mon, Apr 15, 2019 at 3:58 PM Chris Martin 
> wrote:
>
>> I've updated the jira so that the main body is now inside a google doc.
>> Anyone should be able to comment- if you want/need write access please drop
>> me a mail and I can add you.
>>
>> Ryan- regarding your specific point regarding why I'm not proposing to
>> add this to the Scala API, I think the main point is that Scala users can
>> already use Cogroup for Datasets.  For Scala this is probably a better
>> solution as (as far as I know) there is no Scala DataFrame library that
>> could be used in place of Pandas for manipulating  local DataFrames. As a
>> result you'd probably be left with dealing with Iterators of Row objects,
>> which almost certainly isn't what you'd want. This is similar to the
>> existing grouped map Pandas Udfs for which there is no equivalent Scala Api.
>>
>> I do think there might be a place for allowing a (Scala) DataSet Cogroup
>> to take some sort of grouping expression as the grouping key  (this would
>> mean that you wouldn't have to marshal the key into a JVM object and could
>> possible lend itself to some catalyst optimisations) but I don't think that
>> this should be done as part of this SPIP.
>>
>> thanks,
>>
>> Chris
>>
>> On Mon, Apr 15, 2019 at 6:27 PM Ryan Blue  wrote:
>>
>>> I agree, it would be great to have a document to comment on.
>>>
>>> The main thing that stands out right now is that this is only for
>>> PySpark and states that it will not be added to the Scala API. Why not make
>>> this available since most of the work would be done?
>>>
>>> On Mon, Apr 15, 2019 at 7:50 AM Li Jin  wrote:
>>>
>>>> Thank you Chris, this looks great.
>>>>
>>>> Would you mind share a google doc version of the proposal? I believe
>>>> that's the preferred way of discussing proposals (Other people please
>>>> correct me if I am wrong).
>>>>
>>>> Li
>>>>
>>>> On Mon, Apr 15, 2019 at 8:20 AM  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>  As promised I’ve raised SPARK-27463 for this.
>>>>>
>>>>> All feedback welcome!
>>>>>
>>>>> Chris
>>>>>
>>>>> On 9 Apr 2019, at 13:22, Chris Martin  wrote:
>>>>>
>>>>> Thanks Bryan and Li, that is much appreciated.  Hopefully should have
>>>>> the SPIP ready in the next couple of days.
>>>>>
>>>>> thanks,
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 8, 2019 at 7:18 PM Bryan Cutler  wrote:
>>>>>
>>>>>> Chirs, an SPIP sounds good to me. I agree with Li that it wouldn't be
>>>>>> too difficult to extend the currently functionality to transfer multiple
>>>>>> DataFrames.  For the SPIP, I would keep it more high-level and I don't
>>>>>> think it's necessary to include details of the Python worker, we can hash
>>>>>> that out after the SPIP is approved.
>>>>>>
>>>>>> Bryan
>>>>>>
>>>>>> On Mon, Apr 8, 2019 at 10:43 AM Li Jin  wrote:
>>>>>>
>>>>>>> Thanks Chris, look forward to it.
>>>>>>>
>>>>>>> I think sending multiple dataframes to the python worker requires
>>>>>>> some changes but shouldn't be too difficult. We can probably sth like:
>>>>>>>
>>>>>>>
>>>>>>> [numberOfDataFrames][FirstDataFrameInArrowFormat][SecondDataFrameInArrowFormat]
>>>>>>>
>>>>>>> In:
>>>>>>> https://github.com/apache/spark/blob/86d469aeaa492c0642db09b27bb0879ead5d7166/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L70
>>>>>>>
>>>>>>> And have ArrowPythonRunner take multiple input iterator/schema.
>>>>>>>
>>>>>>> Li
>>>>>>>
>>>>>>>

Re: Thoughts on dataframe cogroup?

2019-04-18 Thread Chris Martin

Yes, totally agreed with Li here.

For clarity, I'm happy to do the work to implement this, but it would be
good to get feedback from the community in general and some of the Spark
committers in particular.

thanks,

Chris

On Wed, Apr 17, 2019 at 9:17 PM Li Jin  wrote:

> I have left some comments. This looks a good proposal to me.
>
> As a heavy pyspark user, this is a pattern that we see over and over again
> and I think could be pretty high value to other pyspark users as well. The
> fact that Chris and I come to same ideas sort of verifies my intuition.
> Also, this isn't really something new, RDD has cogroup function from very
> early on.
>
> With that being said, I'd like to call out again for community's feedback
> on the proposal.
>
> On Mon, Apr 15, 2019 at 4:57 PM Chris Martin 
> wrote:
>
>> Ah sorry- I've updated the link which should give you access.  Can you
>> try again now?
>>
>> thanks,
>>
>> Chris
>>
>>
>>
>> On Mon, Apr 15, 2019 at 9:49 PM Li Jin  wrote:
>>
>>> Hi Chris,
>>>
>>> Thanks! The permission to the google doc is maybe not set up properly. I
>>> cannot view the doc by default.
>>>
>>> Li
>>>
>>> On Mon, Apr 15, 2019 at 3:58 PM Chris Martin 
>>> wrote:
>>>
>>>> I've updated the jira so that the main body is now inside a google
>>>> doc.  Anyone should be able to comment- if you want/need write access
>>>> please drop me a mail and I can add you.
>>>>
>>>> Ryan- regarding your specific point regarding why I'm not proposing to
>>>> add this to the Scala API, I think the main point is that Scala users can
>>>> already use Cogroup for Datasets.  For Scala this is probably a better
>>>> solution as (as far as I know) there is no Scala DataFrame library that
>>>> could be used in place of Pandas for manipulating  local DataFrames. As a
>>>> result you'd probably be left with dealing with Iterators of Row objects,
>>>> which almost certainly isn't what you'd want. This is similar to the
>>>> existing grouped map Pandas Udfs for which there is no equivalent Scala 
>>>> Api.
>>>>
>>>> I do think there might be a place for allowing a (Scala) DataSet
>>>> Cogroup to take some sort of grouping expression as the grouping key  (this
>>>> would mean that you wouldn't have to marshal the key into a JVM object and
>>>> could possible lend itself to some catalyst optimisations) but I don't
>>>> think that this should be done as part of this SPIP.
>>>>
>>>> thanks,
>>>>
>>>> Chris
>>>>
>>>> On Mon, Apr 15, 2019 at 6:27 PM Ryan Blue  wrote:
>>>>
>>>>> I agree, it would be great to have a document to comment on.
>>>>>
>>>>> The main thing that stands out right now is that this is only for
>>>>> PySpark and states that it will not be added to the Scala API. Why not 
>>>>> make
>>>>> this available since most of the work would be done?
>>>>>
>>>>> On Mon, Apr 15, 2019 at 7:50 AM Li Jin  wrote:
>>>>>
>>>>>> Thank you Chris, this looks great.
>>>>>>
>>>>>> Would you mind share a google doc version of the proposal? I believe
>>>>>> that's the preferred way of discussing proposals (Other people please
>>>>>> correct me if I am wrong).
>>>>>>
>>>>>> Li
>>>>>>
>>>>>> On Mon, Apr 15, 2019 at 8:20 AM  wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>  As promised I’ve raised SPARK-27463 for this.
>>>>>>>
>>>>>>> All feedback welcome!
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>> On 9 Apr 2019, at 13:22, Chris Martin  wrote:
>>>>>>>
>>>>>>> Thanks Bryan and Li, that is much appreciated.  Hopefully should
>>>>>>> have the SPIP ready in the next couple of days.
>>>>>>>
>>>>>>> thanks,
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Apr 8, 2019 at 7:18 PM Bryan Cutler 
>>>>>>>

Fwd: Sample date_trunc error for webpage （https://spark.apache.org/docs/2.3.0/api/sql/#date_trunc ）

2019-07-07 Thread Chris Lambertus

Spark,

We received this message. I have not ACKd it.

-Chris
INFRA


> Begin forwarded message:
> 
> From: "binggan1989" 
> Subject: Sample date_trunc error for webpage 
> （https://spark.apache.org/docs/2.3.0/api/sql/#date_trunc ）
> Date: July 5, 2019 at 2:54:54 AM PDT
> To: "webmaster" 
> Reply-To: "binggan1989" 
> 
> 
> 
> I found an example of the function usage given on the website is incorrect 
> and needs to be fixed.
> 
> https://spark.apache.org/docs/2.3.0/api/sql/#date_trunc 
> <https://spark.apache.org/docs/2.3.0/api/sql/#date_trunc>
>

[no subject]

2021-01-08 Thread Chris Brown

Unsubscribe

Using Spark 2.2.0 SparkSession extensions to optimize file filtering

2017-10-24 Thread Chris Luby

I have an external catalog that has additional information on my Parquet files 
that I want to match up with the parsed filters from the plan to prune the list 
of files included in the scan.  I’m looking at doing this using the Spark 2.2.0 
SparkSession extensions similar to the built in partition pruning:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

and this other project that is along the lines of what I want:

https://github.com/lightcopy/parquet-index/blob/master/src/main/scala/org/apache/spark/sql/execution/datasources/IndexSourceStrategy.scala

but isn’t caught up to 2.2.0, but I’m struggling to understand what type of 
extension I would use to do something like the above:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.SparkSessionExtensions

and if this is the appropriate strategy for this.

Are there any examples out there for using the new extension hooks to alter the 
files included in the plan?

Thanks.

Re: Revisiting Online serving of Spark models?

2018-05-31 Thread Chris Fregly

Hey everyone!

@Felix:  thanks for putting this together.  i sent some of you a quick calendar 
event - mostly for me, so i don’t forget!  :)

Coincidentally, this is the focus of June 6th's Advanced Spark and TensorFlow 
Meetup 
<https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> 
@5:30pm on June 6th (same night) here in SF!

Everybody is welcome to come.  Here’s the link to the meetup that includes the 
signup link:  
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/ 
<https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>

We have an awesome lineup of speakers covered a lot of deep, technical ground.

For those who can’t attend in person, we’ll be broadcasting live - and posting 
the recording afterward.  

All details are in the meetup link above…

@holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than welcome to 
give a talk. I can move things around to make room.

@joseph:  I’d personally like an update on the direction of the Databricks 
proprietary ML Serving export format which is similar to PMML but not a 
standard in any way.

Also, the Databricks ML Serving Runtime is only available to Databricks 
customers.  This seems in conflict with the community efforts described here.  
Can you comment on behalf of Databricks?

Look forward to your response, joseph.

See you all soon!

—

Chris Fregly
Founder @ PipelineAI <https://pipeline.ai/> (100,000 Users)
Organizer @ Advanced Spark and TensorFlow Meetup 
<https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000 Global 
Members)

San Francisco - Chicago - Austin - 
Washington DC - London - Dusseldorf

Try our PipelineAI Community Edition with GPUs and TPUs!! 
<http://community.pipeline.ai/>

> On May 30, 2018, at 9:32 AM, Felix Cheung  wrote:
> 
> Hi!
> 
> Thank you! Let’s meet then
> 
> June 6 4pm
> 
> Moscone West Convention Center
> 800 Howard Street, San Francisco, CA 94103
> 
> Ground floor (outside of conference area - should be available for all) - we 
> will meet and decide where to go
> 
> (Would not send invite because that would be too much noise for dev@)
> 
> To paraphrase Joseph, we will use this to kick off the discusssion and post 
> notes after and follow up online. As for Seattle, I would be very interested 
> to meet in person lateen and discuss ;) 
> 
> 
> _
> From: Saikat Kanjilal 
> Sent: Tuesday, May 29, 2018 11:46 AM
> Subject: Re: Revisiting Online serving of Spark models?
> To: Maximiliano Felice 
> Cc: Felix Cheung , Holden Karau 
> , Joseph Bradley , Leif Walsh 
> , dev 
> 
> 
> Would love to join but am in Seattle, thoughts on how to make this work?
> 
> Regards
> 
> Sent from my iPhone
> 
> On May 29, 2018, at 10:35 AM, Maximiliano Felice  <mailto:maximilianofel...@gmail.com>> wrote:
> 
>> Big +1 to a meeting with fresh air.
>> 
>> Could anyone send the invites? I don't really know which is the place Holden 
>> is talking about.
>> 
>> 2018-05-29 14:27 GMT-03:00 Felix Cheung > <mailto:felixcheun...@hotmail.com>>:
>> You had me at blue bottle!
>> 
>> _
>> From: Holden Karau mailto:hol...@pigscanfly.ca>>
>> Sent: Tuesday, May 29, 2018 9:47 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Felix Cheung > <mailto:felixcheun...@hotmail.com>>
>> Cc: Saikat Kanjilal mailto:sxk1...@hotmail.com>>, 
>> Maximiliano Felice > <mailto:maximilianofel...@gmail.com>>, Joseph Bradley > <mailto:jos...@databricks.com>>, Leif Walsh > <mailto:leif.wa...@gmail.com>>, dev > <mailto:dev@spark.apache.org>>
>> 
>> 
>> 
>> I'm down for that, we could all go for a walk maybe to the mint plazaa blue 
>> bottle and grab coffee (if the weather holds have our design meeting outside 
>> :p)?
>> 
>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung > <mailto:felixcheun...@hotmail.com>> wrote:
>> Bump.
>> 
>> From: Felix Cheung > <mailto:felixcheun...@hotmail.com>>
>> Sent: Saturday, May 26, 2018 1:05:29 PM
>> To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>> Cc: Leif Walsh; Holden Karau; dev
>> 
>> Subject: Re: Revisiting Online serving of Spark models?
>>  
>> Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
>> Summit?
>> 
>> (I propose we meet at the venue entrance so we could accommodate people 
>> might not be in the conference)
>> 
>> From: Saikat Kanjilal mailto:sxk1...@hotmail.com>>
>> Sent: Tuesday, May 22, 2018

Hive Bucketing Support

2018-06-06 Thread Chris Martin

Hi All,


first off apologies if this is not the correct place to ask this!

I've been following SPARK-19256
<https://issues.apache.org/jira/browse/SPARK-19256> (Hive Bucketing
Support) with interest for some time now as we do a relatively large amount
of our data processing in Spark but use Hive for business analytics.  As a
result we end up writing a non-trivial amount of data out twice; once in
parquet optimized for Spark and once in once in orc optimized for Hive!
The hope is that SPARK-19256 will put an end to this.

I've noticed that there a PR (https://github.com/apache/spark/pull/19001)
that's been open for almost a year now, with the last comment being over a
month ago.  Does anyone know if I should remain hopeful that this support
will be added in the near future or is it one of those things that's
realistically going to be some distance off.

thanks,

Chris

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly

alrighty then!

bcc'ing user list.  cc'ing dev list.

@user list people:  do not read any further or you will be in violation of
ASF policies!

On Tue, Aug 9, 2016 at 11:50 AM, Mark Hamstra 
wrote:

> That's not going to happen on the user list, since that is against ASF
> policy (http://www.apache.org/dev/release.html):
>
> During the process of developing software and preparing a release, various
>> packages are made available to the developer community for testing
>> purposes. Do not include any links on the project website that might
>> encourage non-developers to download and use nightly builds, snapshots,
>> release candidates, or any other similar package. The only people who
>> are supposed to know about such packages are the people following the dev
>> list (or searching its archives) and thus aware of the conditions placed on
>> the package. If you find that the general public are downloading such test
>> packages, then remove them.
>>
>
> On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly  wrote:
>
>> this is a valid question.  there are many people building products and
>> tooling on top of spark and would like access to the latest snapshots and
>> such.  today's ink is yesterday's news to these people - including myself.
>>
>> what is the best way to get snapshot releases including nightly and
>> specially-blessed "preview" releases so that we, too, can say "try the
>> latest release in our product"?
>>
>> there was a lot of chatter during the 2.0.0/2.0.1 release that i largely
>> ignored because of conflicting/confusing/changing responses.  and i'd
>> rather not dig through jenkins builds to figure this out as i'll likely get
>> it wrong.
>>
>> please provide the relevant snapshot/preview/nightly/whatever repos (or
>> equivalent) that we need to include in our builds to have access to the
>> absolute latest build assets for every major and minor release.
>>
>> thanks!
>>
>> -chris
>>
>>
>> On Tue, Aug 9, 2016 at 10:00 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> LOL
>>>
>>> Ink has not dried on Spark 2 yet so to speak :)
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 9 August 2016 at 17:56, Mark Hamstra  wrote:
>>>
>>>> What are you expecting to find?  There currently are no releases beyond
>>>> Spark 2.0.0.
>>>>
>>>> On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma 
>>>> wrote:
>>>>
>>>>> If we want to use versions of Spark beyond the official 2.0.0 release,
>>>>> specifically on Maven + Java, what steps should we take to upgrade? I 
>>>>> can't
>>>>> find the newer versions on Maven central.
>>>>>
>>>>> Thank you!
>>>>> Jestin
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Chris Fregly*
>> Research Scientist @ PipelineIO
>> San Francisco, CA
>> pipeline.io
>> advancedspark.com
>>
>>
>


-- 
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
pipeline.io
advancedspark.com

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Chris Fregly

this is exactly what my http://pipeline.io project is addressing.  check it out 
and send me feedback or create issues at that github location.

> On Aug 11, 2016, at 7:42 AM, Nicholas Chammas  
> wrote:
> 
> Thanks Michael for the reference, and thanks Nick for the comprehensive 
> overview of existing JIRA discussions about this. I've added myself as a 
> watcher on the various tasks.
> 
>> On Thu, Aug 11, 2016 at 3:02 AM Nick Pentreath  
>> wrote:
>> Currently there is no direct way in Spark to serve models without bringing 
>> in all of Spark as a dependency.
>> 
>> For Spark ML, there is actually no way to do it independently of DataFrames 
>> either (which for single-instance prediction makes things sub-optimal). That 
>> is covered here: https://issues.apache.org/jira/browse/SPARK-10413
>> 
>> So, your options are (in Scala) things like MLeap, PredictionIO, or "roll 
>> your own". Or you can try to export to some other format such as PMML or 
>> PFA. Some MLlib models support PMML export, but for ML it is still missing 
>> (see https://issues.apache.org/jira/browse/SPARK-11171).
>> 
>> There is an external project for PMML too (note licensing) - 
>> https://github.com/jpmml/jpmml-sparkml - which is by now actually quite 
>> comprehensive. It shows that PMML can represent a pretty large subset of 
>> typical ML pipeline functionality.
>> 
>> On the Python side sadly there is even less - I would say your options are 
>> pretty much "roll your own" currently, or export in PMML or PFA.
>> 
>> Finally, part of the "mllib-local" idea was around enabling this local 
>> model-serving (for some initial discussion about the future see 
>> https://issues.apache.org/jira/browse/SPARK-16365).
>> 
>> N
>> 
>> 
>>> On Thu, 11 Aug 2016 at 06:28 Michael Allman  wrote:
>>> Nick,
>>> 
>>> Check out MLeap: https://github.com/TrueCar/mleap. It's not python, but we 
>>> use it in production to serve a random forest model trained by a Spark ML 
>>> pipeline.
>>> 
>>> Thanks,
>>> 
>>> Michael
>>> 
 On Aug 10, 2016, at 7:50 PM, Nicholas Chammas  
 wrote:

 Are there any existing JIRAs covering the possibility of serving up Spark 
 ML models via, for example, a regular Python web app?

 The story goes like this: You train your model with Spark on several TB of 
 data, and now you want to use it in a prediction service that you’re 
 building, say with Flask. In principle, you don’t need Spark anymore since 
 you’re just passing individual data points to your model and looking for 
 it to spit some prediction back.

 I assume this is something people do today, right? I presume Spark needs 
 to run in their web service to serve up the model. (Sorry, I’m new to the 
 ML side of Spark. 😅)

 Are there any JIRAs discussing potential improvements to this story? I did 
 a search, but I’m not sure what exactly to look for. SPARK-4587 (model 
 import/export) looks relevant, but doesn’t address the story directly.

 Nick

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Chris Fregly

And here's a recent slide deck on the pipeline.io that summarizes what we're 
working on (all open source):  

https://www.slideshare.net/mobile/cfregly/advanced-spark-and-tensorflow-meetup-08042016-one-click-spark-ml-pipeline-deploy-to-production

mleap is heading the wrong direction and reinventing the wheel.  not quite sure 
where that project will go.  doesn't seem like it will have a long shelf-life 
in my opinion.

check out pipeline.io.  some cool stuff in there.

> On Aug 11, 2016, at 9:35 AM, Chris Fregly  wrote:
> 
> this is exactly what my http://pipeline.io project is addressing.  check it 
> out and send me feedback or create issues at that github location.
> 
>> On Aug 11, 2016, at 7:42 AM, Nicholas Chammas  
>> wrote:
>> 
>> Thanks Michael for the reference, and thanks Nick for the comprehensive 
>> overview of existing JIRA discussions about this. I've added myself as a 
>> watcher on the various tasks.
>> 
>>> On Thu, Aug 11, 2016 at 3:02 AM Nick Pentreath  
>>> wrote:
>>> Currently there is no direct way in Spark to serve models without bringing 
>>> in all of Spark as a dependency.
>>> 
>>> For Spark ML, there is actually no way to do it independently of DataFrames 
>>> either (which for single-instance prediction makes things sub-optimal). 
>>> That is covered here: https://issues.apache.org/jira/browse/SPARK-10413
>>> 
>>> So, your options are (in Scala) things like MLeap, PredictionIO, or "roll 
>>> your own". Or you can try to export to some other format such as PMML or 
>>> PFA. Some MLlib models support PMML export, but for ML it is still missing 
>>> (see https://issues.apache.org/jira/browse/SPARK-11171).
>>> 
>>> There is an external project for PMML too (note licensing) - 
>>> https://github.com/jpmml/jpmml-sparkml - which is by now actually quite 
>>> comprehensive. It shows that PMML can represent a pretty large subset of 
>>> typical ML pipeline functionality.
>>> 
>>> On the Python side sadly there is even less - I would say your options are 
>>> pretty much "roll your own" currently, or export in PMML or PFA.
>>> 
>>> Finally, part of the "mllib-local" idea was around enabling this local 
>>> model-serving (for some initial discussion about the future see 
>>> https://issues.apache.org/jira/browse/SPARK-16365).
>>> 
>>> N
>>> 
>>> 
>>>> On Thu, 11 Aug 2016 at 06:28 Michael Allman  wrote:
>>>> Nick,
>>>> 
>>>> Check out MLeap: https://github.com/TrueCar/mleap. It's not python, but we 
>>>> use it in production to serve a random forest model trained by a Spark ML 
>>>> pipeline.
>>>> 
>>>> Thanks,
>>>> 
>>>> Michael
>>>> 
>>>>> On Aug 10, 2016, at 7:50 PM, Nicholas Chammas 
>>>>>  wrote:
>>>>> 
>>>>> Are there any existing JIRAs covering the possibility of serving up Spark 
>>>>> ML models via, for example, a regular Python web app?
>>>>> 
>>>>> The story goes like this: You train your model with Spark on several TB 
>>>>> of data, and now you want to use it in a prediction service that you’re 
>>>>> building, say with Flask. In principle, you don’t need Spark anymore 
>>>>> since you’re just passing individual data points to your model and 
>>>>> looking for it to spit some prediction back.
>>>>> 
>>>>> I assume this is something people do today, right? I presume Spark needs 
>>>>> to run in their web service to serve up the model. (Sorry, I’m new to the 
>>>>> ML side of Spark. 😅)
>>>>> 
>>>>> Are there any JIRAs discussing potential improvements to this story? I 
>>>>> did a search, but I’m not sure what exactly to look for. SPARK-4587 
>>>>> (model import/export) looks relevant, but doesn’t address the story 
>>>>> directly.
>>>>> 
>>>>> Nick

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-28 Thread Chris Fregly

i seem to remember a large spark user (tencent, i believe) chiming in late 
during these discussions 6-12 months ago and squashing any sort of deprecation 
given the massive effort that would be required to upgrade their environment.

i just want to make sure these convos take into consideration large spark users 
- and reflect the real world versus ideal world.

otherwise, this is all for naught like last time.

> On Oct 28, 2016, at 10:43 AM, Sean Owen  wrote:
> 
> If the subtext is vendors, then I'd have a look at what recent distros look 
> like. I'll write about CDH as a representative example, but I think other 
> distros are naturally similar.
> 
> CDH has been on Java 8, Hadoop 2.6, Python 2.7 for almost two years (CDH 5.3 
> / Dec 2014). Granted, this depends on installing on an OS with that Java / 
> Python version. But Java 8 / Python 2.7 is available for all of the supported 
> OSes. The population that isn't on CDH 4, because that supported was dropped 
> a long time ago in Spark, and who is on a version released 2-2.5 years ago, 
> and won't update, is a couple percent of the installed base. They do not in 
> general want anything to change at all.
> 
> I assure everyone that vendors too are aligned in wanting to cater to the 
> crowd that wants the most recent version of everything. For example, CDH 
> offers both Spark 2.0.1 and 1.6 at the same time.
> 
> I wouldn't dismiss support for these supporting components as a relevant 
> proxy for whether they are worth supporting in Spark. Java 7 is long since 
> EOL (no, I don't count paying Oracle for support). No vendor is supporting 
> Hadoop < 2.6. Scala 2.10 was EOL at the end of 2014. Is there a criteria here 
> that reaches a different conclusion about these things just for Spark? This 
> was roughly the same conversation that happened 6 months ago.
> 
> I imagine we're going to find that in about 6 months it'll make more sense 
> all around to remove these. If we can just give a heads up with deprecation 
> and then kick the can down the road a bit more, that sounds like enough for 
> now.
> 
>> On Fri, Oct 28, 2016 at 8:58 AM Matei Zaharia  
>> wrote:
>> Deprecating them is fine (and I know they're already deprecated), the 
>> question is just whether to remove them. For example, what exactly is the 
>> downside of having Python 2.6 or Java 7 right now? If it's high, then we can 
>> remove them, but I just haven't seen a ton of details. It also sounded like 
>> fairly recent versions of CDH, HDP, RHEL, etc still have old versions of 
>> these.
>> 
>> Just talking with users, I've seen many of people who say "we have a Hadoop 
>> cluster from $VENDOR, but we just download Spark from Apache and run newer 
>> versions of that". That's great for Spark IMO, and we need to stay 
>> compatible even with somewhat older Hadoop installs because they are 
>> time-consuming to update. Having the whole community on a small set of 
>> versions leads to a better experience for everyone and also to more of a 
>> "network effect": more people can battle-test new versions, answer questions 
>> about them online, write libraries that easily reach the majority of Spark 
>> users, etc.

Evolutionary algorithm (EA) in Spark

2016-11-02 Thread Chris Lin

Hi All,

I would like to know if there is any plan to implement evolutionary
algorithm in Spark ML, such as particle swarm optimization, genetic
algorithm, ant colony optimization, etc.
Therefore, if someone is working on this in Spark or has already done, I
would like to contribute to it and get some guidance on how to go about it.

Regards,
Chris Lin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Evolutionary-algorithm-EA-in-Spark-tp19715.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Evolutionary algorithm (EA) in Spark

2016-11-02 Thread Chris Lin

Hi All,

I would like to know if there is any plan to implement evolutionary
algorithm in Spark ML, such as particle swarm optimization, genetic
algorithm, ant colony optimization, etc.
Therefore, if someone is working on this in Spark or has already done, I
would like to contribute to it and get some guidance on how to go about it.

Regards,
Chris Lin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Evolutionary-algorithm-EA-in-Spark-tp19716.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

unhelpful exception thrown on predict() when ALS trained model doesn't contain user or product?

2016-12-06 Thread chris snow

I'm using the MatrixFactorizationModel.predict() method and encountered the
following exception:

Name: java.util.NoSuchElementException
Message: next on empty iterator
StackTrace: scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
scala.collection.IterableLike$class.head(IterableLike.scala:91)
scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47)
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:81)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:79)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:81)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:83)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:85)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:87)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:89)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:91)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:93)
$line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:95)
$line78.$read$$iwC$$iwC$$iwC$$iwC.(:97)
$line78.$read$$iwC$$iwC$$iwC.(:99)
$line78.$read$$iwC$$iwC.(:101)
$line78.$read$$iwC.(:103)
$line78.$read.(:105)
$line78.$read$.(:109)
$line78.$read$.()
$line78.$eval$.(:7)
$line78.$eval$.()
$line78.$eval.$print()
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
java.lang.reflect.Method.invoke(Method.java:507)
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1$$anonfun$apply$3.apply(ScalaInterpreter.scala:296)
com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1$$anonfun$apply$3.apply(ScalaInterpreter.scala:291)
com.ibm.spark.global.StreamState$.withStreams(StreamState.scala:80)
com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1.apply(ScalaInterpreter.scala:290)
com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1.apply(ScalaInterpreter.scala:290)
com.ibm.spark.utils.TaskManager$$anonfun$add$2$$anon$1.run(TaskManager.scala:123)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.lang.Thread.run(Thread.java:785)

This took some debugging to figure out why I received the Exception, but
when looking at the predict() implementation, I seems to assume that there
will always be features found for the provided user and product ids:


  /** Predict the rating of one user for one product. */
  @Since("0.8.0")
  def predict(user: Int, product: Int): Double = {
val userVector = userFeatures.lookup(user).head
val productVector = productFeatures.lookup(product).head
blas.ddot(rank, userVector, 1, productVector, 1)
  }

It would be helpful if a more useful exception was raised, e.g.

MissingUserFeatureException : "User ID ${user} not found in model"
MissingProductFeatureException : "Product ID ${product} not found in model"

WDYT?

Re: unhelpful exception thrown on predict() when ALS trained model doesn't contain user or product?

2016-12-06 Thread chris snow

Ah cool, thanks for the link!

On 6 December 2016 at 12:25, Nick Pentreath 
wrote:

> Indeed, it's being tracked here: https://issues.apache.
> org/jira/browse/SPARK-18230 though no Pr has been opened yet.
>
>
> On Tue, 6 Dec 2016 at 13:36 chris snow  wrote:
>
>> I'm using the MatrixFactorizationModel.predict() method and encountered
>> the following exception:
>>
>> Name: java.util.NoSuchElementException
>> Message: next on empty iterator
>> StackTrace: scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>> scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
>> scala.collection.IterableLike$class.head(IterableLike.scala:91)
>> scala.collection.mutable.ArrayBuffer.scala$collection$
>> IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
>> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.
>> scala:120)
>> scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47)
>> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(
>> MatrixFactorizationModel.scala:81)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$
>> iwC$$iwC$$iwC$$iwC$$iwC.(:74)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$
>> iwC$$iwC$$iwC$$iwC.(:79)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$
>> iwC$$iwC$$iwC.(:81)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$
>> iwC$$iwC.(:83)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$
>> iwC.(:85)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<
>> init>(:87)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.
>> (:89)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:91)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:93)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:95)
>> $line78.$read$$iwC$$iwC$$iwC$$iwC.(:97)
>> $line78.$read$$iwC$$iwC$$iwC.(:99)
>> $line78.$read$$iwC$$iwC.(:101)
>> $line78.$read$$iwC.(:103)
>> $line78.$read.(:105)
>> $line78.$read$.(:109)
>> $line78.$read$.()
>> $line78.$eval$.(:7)
>> $line78.$eval$.()
>> $line78.$eval.$print()
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> sun.reflect.NativeMethodAccessorImpl.invoke(
>> NativeMethodAccessorImpl.java:95)
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:55)
>> java.lang.reflect.Method.invoke(Method.java:507)
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(
>> SparkIMain.scala:1065)
>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(
>> SparkIMain.scala:1346)
>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$
>> interpretAddTask$1$$anonfun$apply$3.apply(ScalaInterpreter.scala:296)
>> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$
>> interpretAddTask$1$$anonfun$apply$3.apply(ScalaInterpreter.scala:291)
>> com.ibm.spark.global.StreamState$.withStreams(StreamState.scala:80)
>> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$
>> interpretAddTask$1.apply(ScalaInterpreter.scala:290)
>> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$
>> interpretAddTask$1.apply(ScalaInterpreter.scala:290)
>> com.ibm.spark.utils.TaskManager$$anonfun$add$2$$
>> anon$1.run(TaskManager.scala:123)
>> java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1153)
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:628)
>> java.lang.Thread.run(Thread.java:785)
>>
>> This took some debugging to figure out why I received the Exception, but
>> when looking at the predict() implementation, I seems to assume that there
>> will always be features found for the provided user and product ids:
>>
>>
>>   /** Predict the rating of one user for one product. */
>>   @Since("0.8.0")
>>   def predict(user: Int, product: Int): Double = {
>> val userVector = userFeatures.lookup(user).head
>> val productVector = productFeatures.lookup(product).head
>> blas.ddot(rank, userVector, 1, productVector, 1)
>>   }
>>
>> It would be helpful if a more useful exception was raised, e.g.
>>
>> MissingUserFeatureException : "User ID ${user} not found in model"
>> MissingProductFeatureException : "Product ID ${product} not found in
>> model"
>>
>> WDYT?
>>
>>
>>
>>

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-15 Thread Chris Fregly

Just be warned:  the last time I asked a question about a non-working 
Databricks Keynote Demo from Spark Summit on the forum mentioned here, they 
deleted my question!  And i’m a major contributor to those forums!!

Often times, those on-stage demos don’t actually work until many months after 
they’re presented on stage - especially the proprietary demos involving 
dbutils() and display().

Chris Fregly
Research Scientist @ PipelineIO
Founder @ Advanced Spark and TensorFlow Meetup
San Francisco - Chicago - Washington DC - London

On Feb 15, 2017, 12:14 PM -0800, Nicholas Chammas , 
wrote:
> I don't think this is the right place for questions about Databricks. I'm 
> pretty sure they have their own website with a forum for questions about 
> their product.
>
> Maybe this? https://forums.databricks.com/
>
> > On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin  wrote:
> > > Hey folks
> > >
> > > This one is mainly aimed at the databricks folks, I have been trying to 
> > > replicate the cloudtrail demo Micheal did at Spark Summit. The code for 
> > > it can be found here
> > >
> > > My question is how did you get the results to be displayed and updated 
> > > continusly in real time
> > >
> > > I am also using databricks to duplicate it but I noticed the code link 
> > > mentions
> > >
> > >  "If you count the number of rows in the table, you should find the value 
> > > increasing over time. Run the following every few minutes."
> > > This leads me to believe that the version of Databricks that Micheal was 
> > > using for the demo is still not released, or at-least the functionality 
> > > to display those changes in real time aren't
> > >
> > > Is this the case? or am I completely wrong?
> > >
> > > Can I display the results of a structured streaming query in realtime 
> > > using the databricks "display" function?
> > >
> > >
> > > Regards
> > > Sam

Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Chris Fregly

@reynold:

Databricks runs their proprietary product on Kubernetes.  how about 
contributing some of that work back to the Open Source Community?

—

Chris Fregly
Founder and Research Engineer @ PipelineAI <http://pipeline.io/>
Founder @ Advanced Spark and TensorFlow Meetup 
<http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup>
San Francisco - Chicago - Washington DC - London

> On Aug 17, 2017, at 10:55 AM, Reynold Xin  wrote:
> 
> +1 on adding Kubernetes support in Spark (as a separate module similar to how 
> YARN is done)
> 
> I talk with a lot of developers and teams that operate cloud services, and 
> k8s in the last year has definitely become one of the key projects, if not 
> the one with the strongest momentum in this space. I'm not 100% sure we can 
> make it into 2.3 but IMO based on the activities in the forked repo and 
> claims that certain deployments are already running in production, this could 
> already be a solid project and will have everlasting positive impact.
> 
> 
> 
> On Wed, Aug 16, 2017 at 10:24 AM, Alexander Bezzubov  <mailto:b...@apache.org>> wrote:
> +1 (non-binding)
> 
> 
> Looking forward using it as part of Apache Spark release, instead of 
> Standalone cluster deployed on top of k8s.
> 
> 
> --
> Alex
> 
> On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  <mailto:ieme...@gmail.com>> wrote:
> +1 (non-binding)
> 
> This is something really great to have. More schedulers and runtime
> environments are a HUGE win for the Spark ecosystem.
> Amazing work, Big kudos for the guys who created and continue working on this.
> 
> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com 
> <mailto:lucas.g...@gmail.com>
> mailto:lucas.g...@gmail.com>> wrote:
> > From our perspective, we have invested heavily in Kubernetes as our cluster
> > manager of choice.
> >
> > We also make quite heavy use of spark.  We've been experimenting with using
> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
> > already 'paid the price' to operate Kubernetes in AWS it seems rational to
> > move our jobs over to spark on k8s.  Having this project merged into the
> > master will significantly ease keeping our Data Munging toolchain primarily
> > on Spark.
> >
> >
> > Gary Lucas
> > Data Ops Team Lead
> > Unbounce
> >
> > On 15 August 2017 at 15:52, Andrew Ash  > <mailto:and...@andrewash.com>> wrote:
> >>
> >> +1 (non-binding)
> >>
> >> We're moving large amounts of infrastructure from a combination of open
> >> source and homegrown cluster management systems to unify on Kubernetes and
> >> want to bring Spark workloads along with us.
> >>
> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  >> <mailto:liyinan...@gmail.com>> wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
> >>>  
> >>> <http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html>
> >>> Sent from the Apache Spark Developers List mailing list archive at
> >>> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >>> <mailto:dev-unsubscr...@spark.apache.org>
> >>>
> >>
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> 
> 
>

Re: Spark + Kinesis

2015-05-09 Thread Chris Fregly

hey vadim-

sorry for the delay.

if you're interested in trying to get Kinesis working one-on-one, shoot me
a direct email and we'll get it going off-list.

we can circle back and summarize our findings here.

lots of people are using Spark Streaming+Kinesis successfully.

would love to help you through this - albeit a month later!  the goal is to
have this working out of the box, so i'd like to implement anything i can
do to make that happen.

lemme know.

btw, Spark 1.4 will have some improvements to the Kinesis Spark Streaming.

TD and I have been working together on this.

thanks!

-chris

On Tue, Apr 7, 2015 at 6:17 PM, Vadim Bichutskiy  wrote:

> Hey y'all,
>
> While I haven't been able to get Spark + Kinesis integration working, I
> pivoted to plan B: I now push data to S3 where I set up a DStream to
> monitor an S3 bucket with textFileStream, and that works great.
>
> I <3 Spark!
>
> Best,
> Vadim
>
>
> ᐧ
>
> On Mon, Apr 6, 2015 at 12:23 PM, Vadim Bichutskiy <
> vadim.bichuts...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am wondering, has anyone on this list been able to successfully
>> implement Spark on top of Kinesis?
>>
>> Best,
>> Vadim
>>
>> On Sun, Apr 5, 2015 at 1:50 PM, Vadim Bichutskiy <
>> vadim.bichuts...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Below is the output that I am getting. My Kinesis stream has 1 shard,
>>> and my Spark cluster on EC2 has 2 slaves (I think that's fine?).
>>> I should mention that my Kinesis producer is written in Python where I
>>> followed the example
>>> http://blogs.aws.amazon.com/bigdata/post/Tx2Z24D4T99AN35/Snakes-in-the-Stream-Feeding-and-Eating-Amazon-Kinesis-Streams-with-Python
>>>
>>> I also wrote a Python consumer, again using the example at the above
>>> link, that works fine. But I am unable to display output from my Spark
>>> consumer.
>>>
>>> I'd appreciate any help.
>>>
>>> Thanks,
>>> Vadim
>>>
>>> ---
>>>
>>> Time: 142825409 ms
>>>
>>> ---
>>>
>>>
>>> 15/04/05 17:14:50 INFO scheduler.JobScheduler: Finished job streaming
>>> job 142825409 ms.0 from job set of time 142825409 ms
>>>
>>> 15/04/05 17:14:50 INFO scheduler.JobScheduler: Total delay: 0.099 s for
>>> time 142825409 ms (execution: 0.090 s)
>>>
>>> 15/04/05 17:14:50 INFO rdd.ShuffledRDD: Removing RDD 63 from persistence
>>> list
>>>
>>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 63
>>>
>>> 15/04/05 17:14:50 INFO rdd.MapPartitionsRDD: Removing RDD 62 from
>>> persistence list
>>>
>>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 62
>>>
>>> 15/04/05 17:14:50 INFO rdd.MapPartitionsRDD: Removing RDD 61 from
>>> persistence list
>>>
>>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 61
>>>
>>> 15/04/05 17:14:50 INFO rdd.UnionRDD: Removing RDD 60 from persistence
>>> list
>>>
>>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 60
>>>
>>> 15/04/05 17:14:50 INFO rdd.BlockRDD: Removing RDD 59 from persistence
>>> list
>>>
>>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 59
>>>
>>> 15/04/05 17:14:50 INFO dstream.PluggableInputDStream: Removing blocks of
>>> RDD BlockRDD[59] at createStream at MyConsumer.scala:56 of time
>>> 142825409 ms
>>>
>>> ***
>>>
>>> 15/04/05 17:14:50 INFO scheduler.ReceivedBlockTracker: Deleting batches
>>> ArrayBuffer(142825407 ms)
>>> On Sat, Apr 4, 2015 at 3:13 PM, Vadim Bichutskiy <
>>> vadim.bichuts...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> More good news! I was able to utilize mergeStrategy to assembly my
>>>> Kinesis consumer into an "uber jar"
>>>>
>>>> Here's what I added to* build.sbt:*
>>>>
>>>> *mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>*
>>>> *  {*
>>>> *  case PathList("com", "esotericsoftware", "minlog", xs @ _*) =>
>>>> MergeStrategy.first*
>>&

Spark Packages: using sbt-spark-package tool with R

2015-06-04 Thread Chris Freeman

Hey everyone,

I’m looking to develop a package for use with SparkR. This package would 
include custom R and Scala code and I was wondering if anyone had any insight 
into how I might be able to use the sbt-spark-package tool to publish something 
that needs to include an R package as well as a JAR created via SBT assembly.  
I know there’s an existing option for including Python files but I haven’t been 
able to crack the code on how I might be able to include R files.

Any advice is appreciated!

-Chris Freeman

Re: Sidebar: issues targeted for 1.4.0

2015-06-17 Thread Heller, Chris

I appreciate targets having the strong meaning you suggest, as its useful
to get a sense of what will realistically be included in a release.


Would it make sense (speaking as a relative outsider here) that we would
not enter into the RC phase of a release until all JIRA targeting that
release were complete?

If a JIRA targeting a release is blocking entry to the RC phase, and its
determined that the JIRA should not hold up the release, than it should
get re-targeted to the next release.

-Chris

On 6/17/15, 3:55 PM, "Patrick Wendell"  wrote:

>Hey Sean,
>
>Thanks for bringing this up - I went through and fixed about 10 of
>them. Unfortunately there isn't a hard and fast way to resolve them. I
>found all of the following:
>
>- Features that missed the release and needed to be retargeted to 1.5.
>- Bugs that missed the release and needed to be retargeted to 1.4.1.
>- Issues that were not properly targeted (e.g. someone randomly set
>the target version) and should probably be untargeted.
>
>I'd like to encourage others to do this, especially the more active
>developers on different components (Streaming, ML, etc).
>
>One other question is what the semantics of target version are, which
>I don't think we've defined clearly. Is it the target of the person
>contributing the feature? Or in some sense the target of the
>committership? My preference would be that targeting a JIRA has some
>strong semantics - i.e. it means the commiter targeting has mentally
>allocated time to review a patch for that feature in the timeline of
>that release. I.e. prefer to have fewer targeted JIRA's for a release,
>and also expect to get most of the targeted features merged into a
>release. In the past I think targeting has meant different things to
>different people.
>
>- Patrick
>
>On Tue, Jun 16, 2015 at 8:09 AM, Josh Rosen  wrote:
>> Whatever you do, DO NOT use the built-in JIRA 'releases' feature to
>>migrate
>> issues from 1.4.0 to another version: the JIRA feature will have the
>> side-effect of automatically changing the target versions for issues
>>that
>> have been closed, which is going to be really confusing. I've made this
>> mistake once myself and it was a bit of a hassle to clean up.
>>
>> On Tue, Jun 16, 2015 at 5:24 AM, Sean Owen  wrote:
>>>
>>> Question: what would happen if I cleared Target Version for everything
>>> still marked Target Version = 1.4.0? There are 76 right now, and
>>> clearly that's not correct.
>>>
>>> 56 were opened by committers, including issues like "Do X for 1.4".
>>> I'd like to understand whether these are resolved but just weren't
>>> closed, or else why so many issues are being filed as a todo and not
>>> resolved? Slipping things here or there is OK, but these weren't even
>>> slipped, just forgotten.
>>>
>>> On Sat, May 30, 2015 at 3:55 PM, Sean Owen  wrote:
>>> > In an ideal world,  Target Version really is what's going to go in as
>>> > far as anyone knows and when new stuff comes up, we all have to
>>>figure
>>> > out what gets dropped to fit by the release date. Boring, standard
>>> > software project management practice. I don't know how realistic that
>>> > is, but, I'm wondering how people feel about this, who have filed
>>> > these JIRAs?
>>> >
>>> > Concretely, should non-Critical issues for 1.4.0 be un-Targeted?
>>> > should they all be un-Targeted after the release?
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>For additional commands, e-mail: dev-h...@spark.apache.org
>


smime.p7s
Description: S/MIME cryptographic signature

A proposal for Test matrix decompositions for speed/stability (SPARK-7210)

2015-07-02 Thread Chris Harvey

Hello,

I am new to the Apache Spark project but I would like to contribute to
issue SPARK-7210. There has been come conversation on that issue and I
would like to take a shot at it. Before doing so, I want to run my plan by
everyone.

>From the description and the comments, the goal is to test other methods of
computing the  MVN pdf. The stated concern is that the SVD used is slow
despite it being numerically stable, and that speed and stability may
become problematic as the number of features grow.

In the comments, Feynman posted an R recipe for computing the pdf using a
Cholesky trick. I would like to compute the pdf by following that recipe
while using the Cholesky implementation found in Scalanlp Breeze. To test
speed I would estimate the pdf using the original method and the Cholesky
method across a range of simulated datasets with growing n and p. To test
stability I would estimate the pdf on simulated features with some
multicollinearity.

Does this sound like a good starting point? Am I thinking of this correctly?

Given that this is my first attempt at contributing to an Apache project,
might it be a good idea to do this through the Mentor Programme?

Please let me know how this sounds, and I can provide some personal details
about my experience and motivations.

Thanks,

Chris

Maven issues with 1.5-RC

2015-08-26 Thread Chris Freeman

Currently trying to compile 1.5-RC2 (from 
https://github.com/apache/spark/commit/727771352855dbb780008c449a877f5aaa5fc27a)
 and running into issues with the new Maven requirement. I have 3.0.4 installed 
at the system level, 1.5 requires 3.3.3. As Patrick has pointed out in other 
places, this should be a non-issue since Spark can download and use its own 
version of Maven, and you can guarantee this happens by using the —force flag 
when calling build/mvn. However, this doesn’t appear to be working as intended 
(or I just have really bad luck).

When I run build/mvn --force -DskipTests -Psparkr package, the first thing I 
see is this:

Using `mvn` from path: /home/cloudera/spark/build/apache-maven-3.3.3/bin/mvn

Looks good. However, after initializing the build order and starting on Spark 
Project Parent POM, I still see this:

[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @ 
spark-parent_2.10 ---
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion failed 
with message:
Detected Maven Version: 3.0.4 is not in the allowed range 3.3.3.

And then the build fails. Has anyone else experienced this/anyone have any idea 
what I’m missing here? Running on CentOS with Java 7, for what it’s worth.

--
Chris Freeman
Senior Content Engineer - Alteryx
(657) 900 5462

Re: Please reply if you use Mesos fine grained mode

2015-11-04 Thread Heller, Chris

We’ve been making use of both. Fine-grain mode makes sense for more ad-hoc work 
loads, and coarse-grained for more job like loads on a common data set. My 
preference is the fine-grain mode in all cases, but the overhead associated 
with its startup and the possibility that an overloaded cluster would be 
starved for resources makes coarse grain mode a reality at the moment.

On Wednesday, 4 November 2015 5:24 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:


If you are using Spark with Mesos fine grained mode, can you please respond to 
this email explaining why you use it over the coarse grained mode?

Thanks.

Re: Please reply if you use Mesos fine grained mode

2015-11-04 Thread Heller, Chris

Correct. Its just that with coarse mode we grab the resources up front, so its 
either available or not. But using resources on demand, as with a fine grained 
mode, just means the potential to starve out an individual job. There is also 
the sharing of RDDs that coarse gives you which would need something like 
Tachyon to achieve in fine grain mode.

From: Timothy Chen mailto:tnac...@gmail.com>>
Date: Wednesday, November 4, 2015 at 11:05 AM
To: "Heller, Chris" mailto:chel...@akamai.com>>
Cc: Reynold Xin mailto:r...@databricks.com>>, 
"dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>
Subject: Re: Please reply if you use Mesos fine grained mode

Hi Chris,

How does coarse grain mode gives you less starvation in your overloaded 
cluster? Is it just because it allocates all resources at once (which I think 
in a overloaded cluster allows less things to run at once).

Tim

On Nov 4, 2015, at 4:21 AM, Heller, Chris 
mailto:chel...@akamai.com>> wrote:

We’ve been making use of both. Fine-grain mode makes sense for more ad-hoc work 
loads, and coarse-grained for more job like loads on a common data set. My 
preference is the fine-grain mode in all cases, but the overhead associated 
with its startup and the possibility that an overloaded cluster would be 
starved for resources makes coarse grain mode a reality at the moment.

On Wednesday, 4 November 2015 5:24 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:

If you are using Spark with Mesos fine grained mode, can you please respond to 
this email explaining why you use it over the coarse grained mode?

Thanks.

Re: Removing the Mesos fine-grained mode

2015-11-19 Thread Heller, Chris

I was one that argued for fine-grain mode, and there is something I still 
appreciate about how fine-grain mode operates in terms of the way one would 
define a Mesos framework. That said, with dyn-allocation and Mesos support for 
both resource reservation, oversubscription and revocation, I think the 
direction is clear that the coarse mode is the proper way forward, and having 
the two code paths is just noise.

-Chris

From: Iulian Dragoș 
mailto:iulian.dra...@typesafe.com>>
Date: Thursday, November 19, 2015 at 6:42 AM
To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>
Subject: Removing the Mesos fine-grained mode

Hi all,

Mesos is the only cluster manager that has a fine-grained mode, but it's more 
often than not problematic, and it's a maintenance burden. I'd like to suggest 
removing it in the 2.0 release.

A few reasons:

- code/maintenance complexity. The two modes duplicate a lot of functionality 
(and sometimes code) that leads to subtle differences or bugs. See 
SPARK-10444<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D10444&d=CwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY&m=36NeiiniCnBgPZ3AKAvvSJYBLQNxvpOcLoAi-VwXbtc&s=4_2dJBDiLqTcfXfX1LZluOo1U6tRKR2wKGGzfwiKdVY&e=>
 and also this 
thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__mail-2Darchives.apache.org_mod-5Fmbox_spark-2Duser_201510.mbox_-253CCALxMP-2DA-2BaygNwSiyTM8ff20-2DMGWHykbhct94a2hwZTh1jWHp-5Fg-40mail.gmail.com-253E&d=CwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY&m=36NeiiniCnBgPZ3AKAvvSJYBLQNxvpOcLoAi-VwXbtc&s=SNFPzodGw7sgp3km9NKYM46gZHLguvxVNzCIeUlJzOw&e=>
 and 
MESOS-3202<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_MESOS-2D3202&d=CwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY&m=36NeiiniCnBgPZ3AKAvvSJYBLQNxvpOcLoAi-VwXbtc&s=d-U4CohYsiZc0Zmj4KETn2dT_2ZFe5s3_IIbMm2tjJo&e=>
- it's not widely used (Reynold's previous 
thread<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_Please-2Dreply-2Dif-2Dyou-2Duse-2DMesos-2Dfine-2Dgrained-2Dmode-2Dtd14930.html&d=CwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY&m=36NeiiniCnBgPZ3AKAvvSJYBLQNxvpOcLoAi-VwXbtc&s=HGMiKyzxFDhpbomduKVIIRHWk9RDGDCk7tneJVQqTwo&e=>
 got very few responses from people relying on it)
- similar functionality can be achieved with dynamic allocation + 
coarse-grained mode

I suggest that Spark 1.6 already issues a warning if it detects fine-grained 
use, with removal in the 2.0 release.

Thoughts?

iulian

Specifying Scala types when calling methods from SparkR

2015-12-09 Thread Chris Freeman

Hey everyone,

I’m currently looking at ways to save out SparkML model objects from SparkR and 
I’ve had some luck putting the model into an RDD and then saving the RDD as an 
Object File. Once it’s saved, I’m able to load it back in with something like:

sc.objectFile[LinearRegressionModel](“path/to/model”)

I’d like to try and replicate this same process from SparkR using the JVM 
backend APIs (e.g. “callJMethod”), but so far I haven’t been able to replicate 
my success and I’m guessing that it’s (at least in part) due to the necessity 
of specifying the type when calling the objectFile method.

Does anyone know if this is actually possible? For example, here’s what I’ve 
come up with so far:

loadModel <- function(sc, modelPath) {
  modelRDD <- SparkR:::callJMethod(sc,

"objectFile[PipelineModel]",
modelPath,
SparkR:::getMinPartitions(sc, NULL))
  return(modelRDD)
}

Any help is appreciated!

--
Chris Freeman

RE: Specifying Scala types when calling methods from SparkR

2015-12-10 Thread Chris Freeman

Hi Sun Rui,

I’ve had some luck simply using “objectFile” when saving from SparkR directly. 
The problem is that if you do it that way, the model object will only work if 
you continue to use the current Spark Context, and I think model persistence 
should really enable you to use the model at a later time. That’s where I found 
that I could drop down to the JVM level and interact with the Scala object 
directly, but that seems to only work if you specify the type.

On December 9, 2015 at 7:59:43 PM, Sun, Rui 
(rui@intel.com<mailto:rui@intel.com>) wrote:

Hi,

Just use ""objectFile" instead of "objectFile[PipelineModel]" for callJMethod. 
You can take the objectFile() in context.R as example.

Since the SparkContext created in SparkR is actually a JavaSparkContext, there 
is no need to pass the implicit ClassTag.

-Original Message-
From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
Sent: Thursday, December 10, 2015 8:21 AM
To: Chris Freeman
Cc: dev@spark.apache.org
Subject: Re: Specifying Scala types when calling methods from SparkR

The SparkR callJMethod can only invoke methods as they show up in the Java byte 
code. So in this case you'll need to check the SparkContext byte code (with 
javap or something like that) to see how that method looks. My guess is the 
type is passed in as a class tag argument, so you'll need to do something like 
create a class tag for the LinearRegressionModel and pass that in as the first 
or last argument etc.

Thanks
Shivaram

On Wed, Dec 9, 2015 at 10:11 AM, Chris Freeman  wrote:
> Hey everyone,
>
> I’m currently looking at ways to save out SparkML model objects from
> SparkR and I’ve had some luck putting the model into an RDD and then
> saving the RDD as an Object File. Once it’s saved, I’m able to load it
> back in with something like:
>
> sc.objectFile[LinearRegressionModel](“path/to/model”)
>
> I’d like to try and replicate this same process from SparkR using the
> JVM backend APIs (e.g. “callJMethod”), but so far I haven’t been able
> to replicate my success and I’m guessing that it’s (at least in part)
> due to the necessity of specifying the type when calling the objectFile 
> method.
>
> Does anyone know if this is actually possible? For example, here’s
> what I’ve come up with so far:
>
> loadModel <- function(sc, modelPath) {
> modelRDD <- SparkR:::callJMethod(sc,
>
> "objectFile[PipelineModel]",
> modelPath,
> SparkR:::getMinPartitions(sc, NULL))
> return(modelRDD)
> }
>
> Any help is appreciated!
>
> --
> Chris Freeman
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Chris Fregly

perhaps renaming to Spark ML would actually clear up code and documentation 
confusion?

+1 for rename 

> On Apr 5, 2016, at 7:00 PM, Reynold Xin  wrote:
> 
> +1
> 
> This is a no brainer IMO.
> 
> 
>> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley  wrote:
>> +1  By the way, the JIRA for tracking (Scala) API parity is: 
>> https://issues.apache.org/jira/browse/SPARK-4591
>> 
>>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia  
>>> wrote:
>>> This sounds good to me as well. The one thing we should pay attention to is 
>>> how we update the docs so that people know to start with the spark.ml 
>>> classes. Right now the docs list spark.mllib first and also seem more 
>>> comprehensive in that area than in spark.ml, so maybe people naturally move 
>>> towards that.
>>> 
>>> Matei
>>> 
 On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
 
 Yes, DB (cc'ed) is working on porting the local linear algebra library 
 over (SPARK-13944). There are also frequent pattern mining algorithms we 
 need to port over in order to reach feature parity. -Xiangrui
 
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman 
>  wrote:
> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml
> namespace in the 2.x series ?
> 
> Thanks
> Shivaram
> 
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API 
> >> built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based 
> >> API has
> >> been developed under the spark.ml package, while the old RDD-based API 
> >> has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, 
> >> it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs 
> >> with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API 
> >> in
> >> Spark 1.5 for its versatility and flexibility, and we saw the 
> >> development
> >> and the usage gradually shifting to the DataFrame-based API. Just 
> >> counting
> >> the lines of Scala code, from 1.5 to the current master we added ~1
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, 
> >> to
> >> gather more resources on the development of the DataFrame-based API 
> >> and to
> >> help users migrate over sooner, I want to propose switching RDD-based 
> >> MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package, 
> >> unless
> >> they block implementing new features in the DataFrame-based spark.ml
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x 
> >> series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will 
> >> deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark 
> >> 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib 
> >> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also 
> >> causes
> >> confusion. To be clear, “Spark ML” is not an official name and there 
> >> are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Chris Fregly

gt;
> >> Depending on the interest here, we could follow the steps of (Apache
> >> Arrow) and start this directly as a TLP, or start as an incubator
> project. I
> >> would consider the first option first.
> >>
> >> Who would participate
> >>
> >> Have thought about this for a bit, and if we go to the direction of
> TLP, I
> >> would say Spark Committers and Apache Members can request to
> participate as
> >> PMC members, while other committers can request to become committers.
> Non
> >> committers would be added based on meritocracy after the start of the
> >> project.
> >>
> >> Project Name
> >>
> >> It would be ideal if we could have a project name that shows close ties
> to
> >> Spark (e.g. Spark Extras or Spark Connectors) but we will need
> permission
> >> and support from whoever is going to evaluate the project proposal (e.g.
> >> Apache Board)
> >>
> >>
> >> Thoughts ?
> >>
> >> Does anyone have any big disagreement or objection to moving into this
> >> direction ?
> >>
> >> Otherwise, who would be interested in joining the project, so I can
> start
> >> working on some concrete proposal ?
> >>
> >>
> >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com

Re: Return binary mode in ThriftServer

2016-06-14 Thread Chris Fregly

+1 on bringing it back.  causing all sorts of problems on my end that was not 
obvious without digging in

I was having problems building spark, as well, with the --hive-thriftserver 
flag.  also thought I was doing something wrong on my end.

> On Jun 13, 2016, at 9:11 PM, Reynold Xin  wrote:
> 
> Thanks for the email. Things like this (and bugs) are exactly the reason the 
> preview releases exist. It seems like enough people have run into problem 
> with this one that maybe we should just bring it back for backward 
> compatibility. 
> 
>> On Monday, June 13, 2016, Egor Pahomov  wrote:
>> In May due to the SPARK-15095 binary mode was "removed" (code is there, but 
>> you can not turn it on) from Spark-2.0. In 1.6.1 binary was default and in 
>> 2.0.0-preview it was removed. It's really annoying: 
>> I can not use Tableau+Spark anymore
>> I need to change connection URL in SQL client for every analyst in my 
>> organization. And with Squirrel I experiencing problems with that.
>> We have parts of infrastructure, which connected to data infrastructure 
>> though ThriftServer. And of course format was binary.
>> I've created a ticket to get binary 
>> back(https://issues.apache.org/jira/browse/SPARK-15934), but that's not the 
>> point. I've experienced this problem a month ago, but haven't done anything 
>> about it, because I believed, that I'm stupid and doing something wrong. But 
>> documentation was release recently and it contained no information about 
>> this new thing and it made me digging. 
>> 
>> Most of what I describe is just annoying, but Tableau+Spark new 
>> incompatibility I believe is big deal. Maybe I'm wrong and there are ways to 
>> make things work, it's just I wouldn't expect move to 2.0.0 to be so time 
>> consuming. 
>> 
>> My point: Do we have any guidelines regarding doing such radical things?
>> 
>> -- 
>> Sincerely yours
>> Egor Pakhomov

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Chris Fregly

+1 for 0.10 support.  this is huge.

On Wed, Jun 22, 2016 at 8:17 AM, Cody Koeninger  wrote:

> Luciano knows there are publicly available examples of how to use the
> 0.10 connector, including TLS support, because he asked me about it
> and I gave him a link
>
>
> https://github.com/koeninger/kafka-exactly-once/blob/kafka-0.9/src/main/scala/example/TlsStream.scala
>
> If any committer at any time had said "I'd accept this PR, if only it
> included X", I'd be happy to provide X.  Documentation updates and
> python support for the 0.8 direct stream connector were done after the
> original PR.
>
>
>
> On Wed, Jun 22, 2016 at 9:55 AM, Luciano Resende 
> wrote:
> >
> >
> > On Wed, Jun 22, 2016 at 7:46 AM, Cody Koeninger 
> wrote:
> >>
> >> As far as I know the only thing blocking it at this point is lack of
> >> committer review / approval.
> >>
> >> It's technically adding a new feature after spark code-freeze, but it
> >> doesn't change existing code, and the kafka project didn't release
> >> 0.10 until the end of may.
> >>
> >
> >
> > To be fair with the Kafka 0.10 PR assessment :
> >
> > I was expecting somewhat an easy transition from customer using 0.80 to
> 0.10
> > connector, but the 0.10 seems to have been treated as a completely new
> > extension, also, there is no python support, no samples on the pr
> > demonstrating how to use security capabilities and no documentation
> updates.
> >
> > Thanks
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
pipeline.io
advancedspark.com

Re: Data Locality In Spark

2014-08-19 Thread Chris Fregly

and even the same process where the data might be cached.


these are the different locality levels:

PROCESS_LOCAL
NODE_LOCAL
RACK_LOCAL
ANY

relevant code:
https://github.com/apache/spark/blob/7712e724ad69dd0b83754e938e9799d13a4d43b9/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala#L150

https://github.com/apache/spark/blob/63bdb1f41b4895e3a9444f7938094438a94d3007/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L250

relevant docs:
see the spark.locality configuration attributes here:
https://spark.apache.org/docs/latest/configuration.html


On Tue, Jul 8, 2014 at 1:13 PM, Sandy Ryza  wrote:

> Hi Anish,
>
> Spark, like MapReduce, makes an effort to schedule tasks on the same nodes
> and racks that the input blocks reside on.
>
> -Sandy
>
>
> On Tue, Jul 8, 2014 at 12:27 PM, anishs...@yahoo.co.in <
> anishs...@yahoo.co.in> wrote:
>
> > Hi All
> >
> > My apologies for very basic question, do we have full support of data
> > locality in Spark MapReduce.
> >
> > Please suggest.
> >
> > --
> > Anish Sneh
> > "Experience is the best teacher."
> > http://in.linkedin.com/in/anishsneh
> >
> >
>

Re: [VOTE] Release Spark 3.5.5 (RC1)

2025-02-26 Thread Chris Nauroth

+1 (non-binding)

* Verified all checksums.
* Verified all signatures.
* Built from source, with multiple profiles, to full success:
* build/mvn -T 8 -Phadoop-cloud -Phive-thriftserver -Pkubernetes -Pyarn
-DskipTests clean package
* Tests passed.
* Ran several examples successfully:
* bin/spark-submit --class org.apache.spark.examples.SparkPi
examples/jars/spark-examples_2.13-3.5.5.jar
* bin/spark-submit --class
org.apache.spark.examples.sql.hive.SparkHiveExample
examples/jars/spark-examples_2.13-3.5.5.jar
* bin/spark-submit
examples/src/main/python/streaming/network_wordcount.py localhost 

Chris Nauroth


On Wed, Feb 26, 2025 at 2:13 PM Sakthi  wrote:

> +1 (non-binding)
>
> On Wed, Feb 26, 2025 at 9:20 AM Dongjoon Hyun  wrote:
>
>> Thank you for the explicit casting, Vlad.
>>
>> BTW, the Apache Spark 3.5.5 RC1 vote will continue like all the previous
>> Apache Spark RC votes because there is no agreement on those test resources
>> (jar files) among Apache Spark PMC members. The following discussion thread
>> will lead us subsequent community decisions in 2025. I believe we need an
>> official and independent vote on them to finalize it and to pave our path
>> forward.
>>
>> https://lists.apache.org/thread/0ro5yn6lbbpmvmqp2px3s2pf7cwljlc4
>> ([DISCUSS] SPARK-51318: Remove `jar` files from Apache Spark repository
>> and disable affected tests)
>>
>> Dongjoon.
>>
>> On 2025/02/26 16:41:51 "Rozov, Vlad" wrote:
>> > -0 (non-binding).
>> >
>> > IMO, it will be good to address
>> https://issues.apache.org/jira/browse/SPARK-51318 to avoid legal issues
>> and meet ASF source release policy.
>> >
>> > Thank you,
>> >
>> > Vlad
>> >
>> > On Feb 25, 2025, at 2:51 AM, Kent Yao  wrote:
>> >
>> > +1
>> >
>> > Kent
>> >
>> > On 2025/02/25 10:26:38 Max Gekk wrote:
>> > +1, since SPARK-51281 is not a release blocker.
>> >
>> > On Mon, Feb 24, 2025 at 7:12 AM Yang Jie  wrote:
>> >
>> > +1
>> >
>> > On 2025/02/24 04:04:22 Dongjoon Hyun wrote:
>> > Thank you for your voting.
>> >
>> > I have been aware of SPARK-51281 since Wenchen pinged me three days ago.
>> >
>> > I just thought it was not going to be ready because the PR was idle for
>> > the
>> > last two days after my comment.
>> >
>> > Since SPARK-51281 is not a release blocker of Apache Spark 3.5.5 RC1
>> vote
>> > because
>> > - it's not a regression at 3.5.5 (as Wenchen mentioned) and
>> > - it didn't have a proper target version and priority field.
>> >
>> > I don't think we should stop this vote.
>> >
>> > SPARK-51281 can be a part of Apache Spark 3.5.6 release.
>> >
>> > I'll keep this vote open.
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> >
>> > On Sun, Feb 23, 2025 at 6:46 PM Wenchen Fan 
>> wrote:
>> >
>> > -0 as I just found a long-standing correctness bug:
>> > https://github.com/apache/spark/pull/50040
>> >
>> > It's not a regression in 3.5 so technically it's not a release blocker,
>> > but it's better to include it as we are just about to release 3.5.5.
>> >
>> > On Mon, Feb 24, 2025 at 9:11 AM Mich Talebzadeh <
>> > mich.talebza...@gmail.com>
>> > wrote:
>> >
>> > +1 on the basis of Dongjoon statement which I trust
>> >
>> > HTH
>> >
>> > Dr Mich Talebzadeh,
>> > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>> >
>> >   view my Linkedin profile
>> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> >
>> >
>> >
>> >
>> >
>> > On Mon, 24 Feb 2025 at 00:47, Dongjoon Hyun 
>> > wrote:
>> >
>> > I'll start with my +1.
>> >
>> > I have thoroughly verified all test results, signatures, checksums,
>> > and
>> > the recently deprecated configuration.
>> >
>> > Dongjoon.
>> >
>> > On 2025/02/24 00:37:57 Dongjoon Hyun wrote:
>> > Please vote on releasing the following candidate as Apache Spark
>> > version
>> > 3.5.5.
>> >
>> > The vote is open until February 27th 1AM (PST) and passes if a
>> > majority +1
>> > PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apach

Re: [VOTE] Release Spark 4.0.0 (RC2)

2025-03-04 Thread Chris Nauroth

-1 (non-binding)

I think I found some missing license information in the binary
distribution. We may want to include this in the next RC:

https://github.com/apache/spark/pull/50158

Thank you for putting together this RC, Wenchen.

Chris Nauroth


On Mon, Mar 3, 2025 at 6:10 AM Wenchen Fan  wrote:

> Thanks for bringing up these blockers! I know RC2 isn’t fully ready yet,
> but with over 70 commits since RC1, it’s time to have a new RC so people
> can start testing the latest changes. Please continue testing and keep the
> feedback coming!
>
> On Mon, Mar 3, 2025 at 6:06 PM beliefer  wrote:
>
>> -1
>>
>> https://github.com/apache/spark/pull/50112 should be merged before
>> release.
>>
>>
>>
>> At 2025-03-01 15:25:06, "Wenchen Fan"  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 4.0.0.
>>
>> The vote is open until March 5 (PST) and passes if a majority +1 PMC
>> votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 4.0.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v4.0.0-rc2 (commit
>> 85188c07519ea809012db24421714bb75b45ab1b)
>> https://github.com/apache/spark/tree/v4.0.0-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1478/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc2-docs/
>>
>> The list of bug fixes going into 4.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>
>> This release is using the release script of the tag v4.0.0-rc2.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>>

Re: [VOTE] Release Spark 4.0.0 (RC2)

2025-03-05 Thread Chris Nauroth

Here is one more problem I found during RC2 verification:

https://github.com/apache/spark/pull/50173

This one is just a test issue.

Chris Nauroth


On Tue, Mar 4, 2025 at 2:55 PM Jules Damji  wrote:

> - 1 (non-binding)
>
> A ran into number of installation and launching problems. May be it’s my
> enviornment, even though I removed any old binaries and packages.
>
> 1. Pip installing pyspark4.0.0 and pyspark-connect-4.0 from .tz file
> workedl, launching pyspark results into
>
> 25/03/04 14:00:26 ERROR SparkContext: Error initializing SparkContext.
>
> java.lang.ClassNotFoundException:
> org.apache.spark.sql.connect.SparkConnectPlugin
>
>
> 2. Similary installing the tar balls of either distribution and launch
> spark-shell goes into a loop and terminated by the shutdown hook.
>
>
> Thank you Wenchen for leading these release onerous manager efforts, but
> slowly we should be able to install and launch seamlessly.
>
>
> Keep up the good work & tireless effort for the Spark community!
>
>
> cheers
>
> Jules
>
>
> WARNING: Using incubator modules: jdk.incubator.vector
>
> 25/03/04 14:49:35 INFO BaseAllocator: Debug mode disabled. Enable with the
> VM option -Darrow.memory.debug.allocator=true.
>
> 25/03/04 14:49:35 INFO DefaultAllocationManagerOption: allocation manager
> type not specified, using netty as the default type
>
> 25/03/04 14:49:35 INFO CheckAllocator: Using DefaultAllocationManager at
> memory/netty/DefaultAllocationManagerFactory.class
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j2-defaults.properties
>
> 25/03/04 14:49:35 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=50 ms, currentRetryNum=1, policy=DefaultPolicy).
>
> 25/03/04 14:49:35 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=200 ms, currentRetryNum=2, policy=DefaultPolicy).
>
> 25/03/04 14:49:35 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=800 ms, currentRetryNum=3, policy=DefaultPolicy).
>
> 25/03/04 14:49:36 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=3275 ms, currentRetryNum=4, policy=DefaultPolicy).
>
> 25/03/04 14:49:39 WARN GrpcRetryHandler: Non-Fatal error during RPC
> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception, retrying (wait=12995 ms, currentRetryNum=5,
> policy=DefaultPolicy).
>
> ^C25/03/04 14:49:40 INFO ShutdownHookManager: Shutdown hook called
>
>
>
> On Mar 4, 2025, at 2:24 PM, Chris Nauroth  wrote:
>
> -1 (non-binding)
>
> I think I found some missing license information in the binary
> distribution. We may want to include this in the next RC:
>
> https://github.com/apache/spark/pull/50158
>
> Thank you for putting together this RC, Wenchen.
>
> Chris Nauroth
>
>
> On Mon, Mar 3, 2025 at 6:10 AM Wenchen Fan  wrote:
>
>> Thanks for bringing up these blockers! I know RC2 isn’t fully ready yet,
>> but with over 70 commits since RC1, it’s time to have a new RC so people
>> can start testing the latest changes. Please continue testing and keep the
>> feedback coming!
>>
>> On Mon, Mar 3, 2025 at 6:06 PM beliefer  wrote:
>>
>>> -1
>>> https://github.com/apache/spark/pull/50112 should be merged before
>>> release.
>>>
>>>
>>> At 2025-03-01 15:25:06, "Wenchen Fan"  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 4.0.0.
>>>
>>> The vote is open until March 5 (PST) and passes if a majority +1 PMC
>>> votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 4.0.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v4.0.0-rc2 (commit
>>> 85188c07519ea809012db24421714bb75b45ab1b)
>>> https://github.com/apache/spark/tree/v4.0.0-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc2-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS

SciSpark: NASA AIST14 proposal

2015-01-14 Thread Mattmann, Chris A (3980)

Hi Spark Devs,

Just wanted to FYI that I was funded on a 2 year NASA proposal
to build out the concept of a scientific RDD (create by space/time,
and other operations) for use in some neat climate related NASA
use cases.

http://esto.nasa.gov/files/solicitations/AIST_14/ROSES2014_AIST_A41_awards.
html


I will keep everyone posted and plan on interacting with the list
over here to get it done. I expect that we’ll start work in March.
In the meanwhile you guys can scope the abstract at the link provided.
Happy
to chat about it if you have any questions too.

Cheers!

Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

FW: Trouble posting to the list

2015-02-13 Thread Mattmann, Chris A (3980)

FYI

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Dima Zhiyanov 
Date: Thursday, February 12, 2015 at 7:04 AM
To: "user-ow...@spark.apache.org" 
Subject: Trouble posting to the list

>Hello
>
>After numerous attempts I am still unable to post to the list. After I
>click Subscribe I do not get an e-mail which allows me to confirm my
>subscription. Could you please add me manually?
>
>Thanks a lot
>Dima
>
>Sent from my iPhone


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Fwd: posts are not accepted

2015-07-23 Thread Mattmann, Chris A (3980)

Sent from my iPhone

Begin forwarded message:

From: Rob Sargent mailto:rob.sarg...@utah.edu>>
Date: July 23, 2015 at 1:14:04 PM PDT
To: mailto:user-ow...@spark.apache.org>>
Subject: posts are not accepted

Hello,

my user name is iceback and my email is 
rob.sarg...@utah.edu.

There seems to be a problem with my account as my posts are never accepted.

Any information would be appreciated,

rjs

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Mattmann, Chris A (3980)

Hi Marcelo,

Thanks for your reply. As a committer on the project, you *can* VETO
code. For sure. Unfortunately you don’t have a binding vote on adding
new PMC members/committers, and/or on releasing the software, but do
have the ability to VETO.

That said, if that’s not your intent, sorry for misreading your intent.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Marcelo Vanzin 
Date: Friday, March 18, 2016 at 3:24 PM
To: jpluser 
Cc: "dev@spark.apache.org" 
Subject: Re: SPARK-13843 and future of streaming backends

>On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann 
>wrote:
>> So, my comment here is that any code *cannot* be removed from an Apache
>> project if there is a VETO issued which so far I haven't seen, though
>>maybe
>> Marcelo can clarify that.
>
>No, my intention was not to veto the change. I'm actually for the
>removal of components if the community thinks they don't add much to
>the project. (I'm also not sure I can even veto things, not being a
>PMC member.)
>
>I mainly wanted to know what was the path forward for those components
>because, with Cloudera's hat on, we care about one of them (streaming
>integration with flume), and we'd prefer if that code remained under
>the ASF umbrella in some way.
>
>-- 
>Marcelo

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mattmann, Chris A (3980)

Yeah, so it’s the *Apache Spark* project. Just to clarify.
Not once did you say Apache Spark below.






On 4/15/16, 9:50 AM, "Sean Owen"  wrote:

>On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende  wrote:
>> I know the name might be confusing, but I also think that the projects have
>> a very big synergy, more like sibling projects, where "Spark Extras" extends
>> the Spark community and develop/maintain components for, and pretty much
>> only for, Apache Spark.  Based on your comment above, if making the project
>> "Spark-Extras" a more acceptable name, I believe this is ok as well.
>
>This also grants special status to a third-party project. It's not
>clear this should be *the* official unofficial third-party Spark
>project over some other one. If something's to be blessed, it should
>be in the Spark project.
>
>And why isn't it in the Spark project? the argument was that these
>bits were not used and pretty de minimis as code. It's not up to me or
>anyone else to tell you code X isn't useful to you. But arguing X
>should be a TLP asserts it is substantial and of broad interest, since
>there's non-zero effort for volunteers to deal with it. I am not sure
>I've heard anyone argue that -- or did I miss it? because removing
>bits of unused code happens all the time and isn't a bad precedent or
>even unusual.
>
>It doesn't actually enable any more cooperation than is already
>possible with any other project (like Kafka, Mesos, etc). You can run
>the same governance model anywhere you like. I realize literally being
>operated under the ASF banner is something different.
>
>What I hear here is a proposal to make an unofficial official Spark
>project as a TLP, that begins with these fairly inconsequential
>extras. I question the value of that on its face. Example: what goes
>into this project? deleted Spark code only? or is this a glorified
>"contrib" folder with a lower and somehow different bar determined by
>different people?
>
>And at that stage... is it really helping to give that special status?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mattmann, Chris A (3980)

Hey Reynold,

Thanks. Getting to the heart of this, I think that this project would
be successful if the Apache Spark PMC decided to participate and there
was some overlap. As much as I think it would be great to stand up another
project, the goal here from Luciano and crew (myself included) would be
to suggest it’s just as easy to start an Apache Incubator project to 
manage “extra” pieces of Apache Spark code outside of the release cycle
and the other reasons stated that it made sense to move this code out of
the code base. This isn’t a competing effort to some code on GitHub that
was moved out of Apache source control from Apache Spark - it’s meant to 
be an enabler to suggest that code could be managed here just as easily
(see the difference?)

Let me know what you think thanks Reynold.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 4/15/16, 9:47 AM, "Reynold Xin"  wrote:

>
>
>
>Anybody is free and welcomed to create another ASF project, but I don't think 
>"Spark extras" is a good name. It unnecessarily creates another tier of code 
>that ASF is "endorsing".
>On Friday, April 15, 2016, Mattmann, Chris A (3980) 
> wrote:
>
>Yeah in support of this statement I think that my primary interest in
>this Spark Extras and the good work by Luciano here is that anytime we
>take bits out of a code base and “move it to GitHub” I see a bad precedent
>being set.
>
>Creating this project at the ASF creates a synergy between *Apache Spark*
>which is *at the ASF*.
>
>We welcome comments and as Luciano said, this is meant to invite and be
>open to those in the Apache Spark PMC to join and help.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: 
>chris.a.mattm...@nasa.gov 
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>
>On 4/15/16, 9:39 AM, "Luciano Resende" > 
>wrote:
>
>>
>>
>>On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger
>>> wrote:
>>
>>Given that not all of the connectors were removed, I think this
>>creates a weird / confusing three tier system
>>
>>1. connectors in the official project's spark/extras or spark/external
>>2. connectors in "Spark Extras"
>>3. connectors in some random organization's github
>>
>>
>>
>>
>>
>>
>>
>>Agree Cody, and I think this is one of the goals of "Spark Extras", 
>>centralize the development of these connectors under one central place at 
>>Apache, and that's why one of our asks is to invite the Spark PMC to continue 
>>developing the remaining connectors
>> that stayed in Spark proper, in "Spark Extras". We will also discuss some 
>> process policies on enabling lowering the bar to allow proposal of these 
>> other github extensions to be part of "Spark Extras" while also considering 
>> a way to move code to a maintenance
>> mode location.
>>
>>
>>
>>
>>--
>>Luciano Resende
>>http://twitter.com/lresende1975
>>http://lresende.blogspot.com/
>>
>>
>>
>>
>
>
>

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mattmann, Chris A (3980)

Yeah in support of this statement I think that my primary interest in
this Spark Extras and the good work by Luciano here is that anytime we
take bits out of a code base and “move it to GitHub” I see a bad precedent
being set.

Creating this project at the ASF creates a synergy between *Apache Spark*
which is *at the ASF*.

We welcome comments and as Luciano said, this is meant to invite and be
open to those in the Apache Spark PMC to join and help.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/15/16, 9:39 AM, "Luciano Resende"  wrote:

>
>
>On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger 
> wrote:
>
>Given that not all of the connectors were removed, I think this
>creates a weird / confusing three tier system
>
>1. connectors in the official project's spark/extras or spark/external
>2. connectors in "Spark Extras"
>3. connectors in some random organization's github
>
>
>
>
>
>
>
>Agree Cody, and I think this is one of the goals of "Spark Extras", centralize 
>the development of these connectors under one central place at Apache, and 
>that's why one of our asks is to invite the Spark PMC to continue developing 
>the remaining connectors
> that stayed in Spark proper, in "Spark Extras". We will also discuss some 
> process policies on enabling lowering the bar to allow proposal of these 
> other github extensions to be part of "Spark Extras" while also considering a 
> way to move code to a maintenance
> mode location.
>
> 
>
>
>-- 
>Luciano Resende
>http://twitter.com/lresende1975
>http://lresende.blogspot.com/
>
>
>
>

Re: [jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread Mattmann, Chris A (3980)

Guys I fixed this by adding j...@apache.org to the mailing list, no
more moderation required.

Cheers,
Chris





-Original Message-
From: "ASF GitHub Bot   (JIRA)" 
Reply-To: "dev@spark.apache.org" 
Date: Saturday, March 29, 2014 10:14 AM
To: "dev@spark.apache.org" 
Subject: [jira] [Commented] (SPARK-732) Recomputation of RDDs may result
in duplicated accumulator updates

>
>[ 
>https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.pl
>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954345#comm
>ent-13954345 ] 
>
>ASF GitHub Bot commented on SPARK-732:
>--
>
>Github user AmplabJenkins commented on the pull request:
>
>https://github.com/apache/spark/pull/228#issuecomment-39002053
>  
>Merged build finished. Build is starting -or- tests failed to
>complete.
>
>
>> Recomputation of RDDs may result in duplicated accumulator updates
>> --
>>
>> Key: SPARK-732
>> URL: https://issues.apache.org/jira/browse/SPARK-732
>> Project: Apache Spark
>>  Issue Type: Bug
>>Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1,
>>0.9.0, 0.8.2
>>Reporter: Josh Rosen
>>Assignee: Nan Zhu
>> Fix For: 1.0.0
>>
>>
>> Currently, Spark doesn't guard against duplicated updates to the same
>>accumulator due to recomputations of an RDD.  For example:
>> {code}
>> val acc = sc.accumulator(0)
>> data.map(x => acc += 1; f(x))
>> data.count()
>> // acc should equal data.count() here
>> data.foreach{...}
>> // Now, acc = 2 * data.count() because the map() was recomputed.
>> {code}
>> I think that this behavior is incorrect, especially because this
>>behavior allows the additon or removal of a cache() call to affect the
>>outcome of a computation.
>> There's an old TODO to fix this duplicate update issue in the
>>[DAGScheduler 
>>code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9
>>d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
>> I haven't tested whether recomputation due to blocks being dropped from
>>the cache can trigger duplicate accumulator updates.
>> Hypothetically someone could be relying on the current behavior to
>>implement performance counters that track the actual number of
>>computations performed (including recomputations).  To be safe, we could
>>add an explicit warning in the release notes that documents the change
>>in behavior when we fix this.
>> Ignoring duplicate updates shouldn't be too hard, but there are a few
>>subtleties.  Currently, we allow accumulators to be used in multiple
>>transformations, so we'd need to detect duplicate updates at the
>>per-transformation level.  I haven't dug too deeply into the scheduler
>>internals, but we might also run into problems where pipelining causes
>>what is logically one set of accumulator updates to show up in two
>>different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x;
>>...).count() may cause what's logically the same accumulator update to
>>be applied from two different contexts, complicating the detection of
>>duplicate updates).
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.2#6252)

Re: Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Mattmann, Chris A (3980)

Patrick,

No problem -- at the same time realize that I and the other
moderators were getting spammed by moderation emails from JIRA,

so you should take that into consideration as well.

Cheers,
Chris


-Original Message-
From: Patrick Wendell 
Date: Saturday, March 29, 2014 11:59 AM
To: Chris Mattmann 
Cc: "d...@spark.incubator.apache.org" 
Subject: Re: Could you undo the JIRA dev list e-mails?

>Okay I think I managed to revert this by just removing jira@a.o from our
>dev list.
>
>
>On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell
> wrote:
>
>Hey Chris,
>
>
>I don't think our JIRA has been fully migrated to Apache infra, so it's
>really confusing to send people e-mails referring to the new JIRA since
>we haven't announced it yet. There is some content there because we've
>been trying to do the migration, but
> I'm not sure it's entirely finished.
>
>
>Also, right now our github comments go to a commits@ list. I'm actually
>-1 copying all of these to JIRA because we do a bunch of review level
>comments that are going to pollute the JIRA a bunch.
>
>
>In any case, can you revert the change whatever it was that sent these to
>the dev list? We should have a coordinated plan about this transition and
>the e-mail changes we plan to make.
>
>
>- Patrick
>
>
>
>
>
>

Re: Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Mattmann, Chris A (3980)

I "reverted" this Patrick, per your request:

[hermes] 8:21pm spark.apache.org > ezmlm-list dev | grep jira
j...@apache.org
[hermes] 8:21pm spark.apache.org > ezmlm-unsub dev j...@apache.org
[hermes] 8:21pm spark.apache.org > ezmlm-list dev | grep jira
[hermes] 8:21pm spark.apache.org >

Note, that I an other moderators will now receive moderation

emails until the infra ticket is fixed but others will not.
I'll set up a mail filter.

Chris


-Original Message-
From: , Chris Mattmann 
Date: Saturday, March 29, 2014 1:11 PM
To: Patrick Wendell , Chris Mattmann

Cc: "dev@spark.apache.org" 
Subject: Re: Could you undo the JIRA dev list e-mails?

>Patrick,
>
>No problem -- at the same time realize that I and the other
>moderators were getting spammed by moderation emails from JIRA,
>
>so you should take that into consideration as well.
>
>Cheers,
>Chris
>
>
>-Original Message-----
>From: Patrick Wendell 
>Date: Saturday, March 29, 2014 11:59 AM
>To: Chris Mattmann 
>Cc: "d...@spark.incubator.apache.org" 
>Subject: Re: Could you undo the JIRA dev list e-mails?
>
>>Okay I think I managed to revert this by just removing jira@a.o from our
>>dev list.
>>
>>
>>On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell
>> wrote:
>>
>>Hey Chris,
>>
>>
>>I don't think our JIRA has been fully migrated to Apache infra, so it's
>>really confusing to send people e-mails referring to the new JIRA since
>>we haven't announced it yet. There is some content there because we've
>>been trying to do the migration, but
>> I'm not sure it's entirely finished.
>>
>>
>>Also, right now our github comments go to a commits@ list. I'm actually
>>-1 copying all of these to JIRA because we do a bunch of review level
>>comments that are going to pollute the JIRA a bunch.
>>
>>
>>In any case, can you revert the change whatever it was that sent these to
>>the dev list? We should have a coordinated plan about this transition and
>>the e-mail changes we plan to make.
>>
>>
>>- Patrick
>>
>>
>>
>>
>>
>>
>

Re: Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Mattmann, Chris A (3980)

No worries, thanks Patrick, agreed.


-Original Message-
From: Patrick Wendell 
Date: Saturday, March 29, 2014 1:47 PM
To: Chris Mattmann 
Cc: Chris Mattmann , "dev@spark.apache.org"

Subject: Re: Could you undo the JIRA dev list e-mails?

>Okay cool - sorry about that. Infra should be able to migrate these over
>to an issues@ list shortly. I'd rather bother a few moderators than the
>entire dev list... but ya I realize it's annoying :/
>
>
>On Sat, Mar 29, 2014 at 1:22 PM, Mattmann, Chris A (3980)
> wrote:
>
>I "reverted" this Patrick, per your request:
>
>[hermes] 8:21pm spark.apache.org <http://spark.apache.org> > ezmlm-list
>dev | grep jira
>j...@apache.org
>[hermes] 8:21pm spark.apache.org <http://spark.apache.org> > ezmlm-unsub
>dev
>j...@apache.org
>[hermes] 8:21pm spark.apache.org <http://spark.apache.org> > ezmlm-list
>dev | grep jira
>[hermes] 8:21pm spark.apache.org <http://spark.apache.org> >
>
>Note, that I an other moderators will now receive moderation
>
>emails until the infra ticket is fixed but others will not.
>I'll set up a mail filter.
>
>Chris
>
>
>-Original Message-
>From: , Chris Mattmann 
>Date: Saturday, March 29, 2014 1:11 PM
>To: Patrick Wendell , Chris Mattmann
>
>Cc: "dev@spark.apache.org" 
>Subject: Re: Could you undo the JIRA dev list e-mails?
>
>>Patrick,
>>
>>No problem -- at the same time realize that I and the other
>>moderators were getting spammed by moderation emails from JIRA,
>>
>>so you should take that into consideration as well.
>>
>>Cheers,
>>Chris
>>
>>
>>-Original Message-
>>From: Patrick Wendell 
>>Date: Saturday, March 29, 2014 11:59 AM
>>To: Chris Mattmann 
>>Cc: "d...@spark.incubator.apache.org" 
>>Subject: Re: Could you undo the JIRA dev list e-mails?
>>
>>>Okay I think I managed to revert this by just removing jira@a.o from our
>>>dev list.
>>>
>>>
>>>On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell
>>> wrote:
>>>
>>>Hey Chris,
>>>
>>>
>>>I don't think our JIRA has been fully migrated to Apache infra, so it's
>>>really confusing to send people e-mails referring to the new JIRA since
>>>we haven't announced it yet. There is some content there because we've
>>>been trying to do the migration, but
>>> I'm not sure it's entirely finished.
>>>
>>>
>>>Also, right now our github comments go to a commits@ list. I'm actually
>>>-1 copying all of these to JIRA because we do a bunch of review level
>>>comments that are going to pollute the JIRA a bunch.
>>>
>>>
>>>In any case, can you revert the change whatever it was that sent these
>>>to
>>>the dev list? We should have a coordinated plan about this transition
>>>and
>>>the e-mail changes we plan to make.
>>>
>>>
>>>- Patrick
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>
>
>
>
>

2nd Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2)

2014-07-04 Thread Mattmann, Chris A (3980)

ubmission at
https://www.easychair.org/conferences/?conf=3Dwssspe2, and entering:

=80 author information for all authors
=80 title
=80 abstract (with the identifier as the first line of the abstract, for
example, http://dx.doi.org/10.6084/m9.figshare.791606 or
http://arxiv.org/abs/1404.7414 or alternative)
=80 at least three keywords
=80 tick the abstract only box
Do not submit the paper itself through EasyChair; the identifier in the
abstract that points to the paper is sufficient.

Deadline for Submission:

14 July 2014 (any time of day, no extensions)

Travel Support

Funds are available to support participation in WSSSPE2 by 1) US-based
students, early-career researchers, and members of underrepresented
groups; and 2) participants who would not otherwise attend the SC14
conference. Priority will be given to those who have submitted papers and
can make a compelling case for how their participation will strengthen the
overall workshop and/or positively impact their future research or
educational activities.

Submissions for travel support will be accepted from September 1st to
September 15th 2014 following instructions posted on the workshop web site.

Financial support to enable this has been generously provided by 1) the
National Science Foundation and 2) the Gordon and Betty Moore Foundation.

Important Dates:

July 14, 2014 Paper submission deadline
September 1, 2014 Author notification
September 15, 2014  Funding request submission deadline
September 22, 2014  Funding decision notification
November 16, 2014 WSSSPE2 Workshop

Organizers:

=80 Daniel S. Katz, d.k...@ieee.org, National Science Foundation, USA
=80 Gabrielle Allen, gdal...@illinois.edu, University of Illinois
Urbana-Champaign, USA
=80 Neil Chue Hong, n.chueh...@software.ac.uk, Software Sustainability
Institute, University of Edinburgh, UK
=80 Karen Cranston, karen.crans...@nescent.org, National Evolutionary
Synthesis Center (NESCent), USA
=80 Manish Parashar, paras...@rutgers.edu, Rutgers University, USA
=80 David Proctor, djproc...@gmail.com, National Science Foundation, USA
=80 Matthew Turk, matthewt...@gmail.com, Columbia University, USA
=80 Colin C. Venters, colin.vent...@googlemail.com, University of
Huddersfield, UK
=80 Nancy Wilkins-Diehr, wilki...@sdsc.edu, San Diego Supercomputer Center,
University of California, San Diego, USA

Program Committee:

=80 Aron Ahmadia, U.S. Army Engineer Research and Development Center, USA
=80 Liz Allen, Wellcome Trust, UK
=80 Lorena A. Barba, The George Washington University, USA
=80 C. Titus Brown, Michigan State University, USA
=80 Coral Calero, Universidad Castilla La Mancha, Spain
=80 Jeffrey Carver, University of Alabama, USA
=80 Ewa Deelman, University of Southern California, USA
=80 Gabriel A. Devenyi, McMaster University, Canada
=80 Charlie E. Dibsdale, O-Sys, Rolls Royce PLC, UK
=80 Alberto Di Meglio, CERN, Switzerland
=80 Anshu Dubey, Lawrence Berkeley National Laboratory, USA
=80 David Gavaghan, University of Oxford, UK
=80 Paul Ginsparg, Cornell University, USA
=80 Josh Greenberg, Alfred P. Sloan Foundation, USA
=80 Sarah Harris, University of Leeds, UK
=80 James Herbsleb, Carnegie Mellon University, USA
=80 James Howison, University of Texas at Austin, USA
=80 Caroline Jay, University of Manchester, UK
=80 Matthew B. Jones, National Center for Ecological Analysis and Synthesis
(NCEAS), University of California, Santa Barbara, USA
=80 Jong-Suk Ruth Lee, National Institute of Supercomputing and Networking,
KISTI (Korea Institute of Science and Technology Information), Korea
=80 James Lin, Shanghai Jiao Tong University, China
=80 Frank L=F6ffler, Louisiana State University, USA
=80 Chris A. Mattmann, NASA JPL & University of Southern California, USA
=80 Robert H. McDonald, Indiana University, USA
=80 Lois Curfman McInnes, Argonne National Laboratory, USA
=80 Chris Mentzel, Gordon and Betty Moore Foundation, USA
=80 Kenneth M. Merz, Jr., Michigan State University, USA
=80 Marek T. Michalewicz, A*STAR Computational Resource Centre, Singapore
=80 Peter E. Murray, LYRASIS, USA
=80 Kenjo Nakajima, University of Tokyo, Japan
=80 Cameron Neylon, PLOS, UK
=80 Aleksandra Pawlik, Software Sustainability Institute, Manchester
University, UK
=80 Birgit Penzenstadler, University of California, Irvine, USA
=80 Marian Petre, The Open University, UK
=80 Mark D. Plumbley, Queen Mary University of London, UK
=80 Andreas Prlic, University of California, San Diego, USA
=80 Victoria Stodden, Columbia University, USA
=80 Kaitlin Thaney, Mozilla Science Lab, USA
=80 Greg Watson, IBM, USA
=80 Theresa Windus, Iowa State University and Ames Laboratory, USA





++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.u

[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread Mattmann, Chris A (388J) (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954349#comment-13954349
 ] 

Mattmann, Chris A (388J) commented on SPARK-732:


Guys I fixed this by adding j...@apache.org to the mailing list, no
more moderation required.

Cheers,
Chris







> Recomputation of RDDs may result in duplicated accumulator updates
> --
>
> Key: SPARK-732
> URL: https://issues.apache.org/jira/browse/SPARK-732
> Project: Apache Spark
>  Issue Type: Bug
>Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
> 0.8.2
>Reporter: Josh Rosen
>Assignee: Nan Zhu
> Fix For: 1.0.0
>
>
> Currently, Spark doesn't guard against duplicated updates to the same 
> accumulator due to recomputations of an RDD.  For example:
> {code}
> val acc = sc.accumulator(0)
> data.map(x => acc += 1; f(x))
> data.count()
> // acc should equal data.count() here
> data.foreach{...}
> // Now, acc = 2 * data.count() because the map() was recomputed.
> {code}
> I think that this behavior is incorrect, especially because this behavior 
> allows the additon or removal of a cache() call to affect the outcome of a 
> computation.
> There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
> code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
> I haven't tested whether recomputation due to blocks being dropped from the 
> cache can trigger duplicate accumulator updates.
> Hypothetically someone could be relying on the current behavior to implement 
> performance counters that track the actual number of computations performed 
> (including recomputations).  To be safe, we could add an explicit warning in 
> the release notes that documents the change in behavior when we fix this.
> Ignoring duplicate updates shouldn't be too hard, but there are a few 
> subtleties.  Currently, we allow accumulators to be used in multiple 
> transformations, so we'd need to detect duplicate updates at the 
> per-transformation level.  I haven't dug too deeply into the scheduler 
> internals, but we might also run into problems where pipelining causes what 
> is logically one set of accumulator updates to show up in two different tasks 
> (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
> what's logically the same accumulator update to be applied from two different 
> contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

65 matches

Mail list logo