Re: What's the meaning when the partitions is zero?

2016-09-16 Thread Sean Owen
There are almost no cases in which you'd want a zero-partition RDD.
The only one I can think of is an empty RDD, where the number of
partitions is irrelevant. Still, I would not be surprised if other
parts of the code assume at least 1 partition.

Maybe this check could be tightened. It would be interesting to see if
the tests catch any scenario where a 0-partition RDD is created, and
why.

On Fri, Sep 16, 2016 at 7:54 AM, WangJianfei
 wrote:
> class HashPartitioner(partitions: Int) extends Partitioner {
>   require(partitions >= 0, s"Number of partitions ($partitions) cannot be
> negative.")
>
> the soruce code require(partitions >=0) ,but I don't know why it makes sense
> when the partitions is 0.
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-meaning-when-the-partitions-is-zero-tp18957.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why we get 0 when the key is null?

2016-09-16 Thread Sean Owen
"null" is a valid value in an RDD, so it has to be partition-able.

On Fri, Sep 16, 2016 at 2:26 AM, WangJianfei
 wrote:
> When the key is not In the rdd, I can also get an value , I just feel a
> little strange.
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-we-get-0-when-the-key-is-null-tp18952p18955.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What's the meaning when the partitions is zero?

2016-09-16 Thread Reynold Xin
They are valid, especially in partition pruning.

On Friday, September 16, 2016, Sean Owen  wrote:

> There are almost no cases in which you'd want a zero-partition RDD.
> The only one I can think of is an empty RDD, where the number of
> partitions is irrelevant. Still, I would not be surprised if other
> parts of the code assume at least 1 partition.
>
> Maybe this check could be tightened. It would be interesting to see if
> the tests catch any scenario where a 0-partition RDD is created, and
> why.
>
> On Fri, Sep 16, 2016 at 7:54 AM, WangJianfei
> > wrote:
> > class HashPartitioner(partitions: Int) extends Partitioner {
> >   require(partitions >= 0, s"Number of partitions ($partitions) cannot be
> > negative.")
> >
> > the soruce code require(partitions >=0) ,but I don't know why it makes
> sense
> > when the partitions is 0.
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/What-s-the-meaning-
> when-the-partitions-is-zero-tp18957.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>


Spark 2.0.1 release?

2016-09-16 Thread Ewan Leith
Hi all,

Apologies if I've missed anything, but is there likely to see a 2.0.1 bug fix 
release, or does a jump to 2.1.0 with additional features seem more probable?

The issues for 2.0.1 seem pretty much done here 
https://issues.apache.org/jira/browse/SPARK/fixforversion/12336857/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel

And there's a lot of good bugfixes in there that I'd love to be able to use 
without deploying our own builds.

Thanks,
Ewan



This email and any attachments to it may contain confidential information and 
are intended solely for the addressee.



If you are not the intended recipient of this email or if you believe you have 
received this email in error, please contact the sender and remove it from your 
system.Do not use, copy or disclose the information contained in this email or 
in any attachment.

RealityMine Limited may monitor email traffic data including the content of 
email for the purposes of security.

RealityMine Limited is a company registered in England and Wales. Registered 
number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, 
Trafford Park, Manchester M17 1LB


Re: Spark 2.0.1 release?

2016-09-16 Thread Reynold Xin
2.0.1 is definitely coming soon.  Was going to tag a rc yesterday but ran
into some issue. I will try to do it early next week for rc.


On Fri, Sep 16, 2016 at 11:16 AM, Ewan Leith 
wrote:

> Hi all,
>
> Apologies if I've missed anything, but is there likely to see a 2.0.1 bug
> fix release, or does a jump to 2.1.0 with additional features seem more
> probable?
>
> The issues for 2.0.1 seem pretty much done here https://issues.apache.org/
> jira/browse/SPARK/fixforversion/12336857/?selectedTab=com.atlassian.
> jira.jira-projects-plugin:version-issues-panel
>
> And there's a lot of good bugfixes in there that I'd love to be able to
> use without deploying our own builds.
>
> Thanks,
> Ewan
>
>
> This email and any attachments to it may contain confidential information
> and are intended solely for the addressee.
>
>
> If you are not the intended recipient of this email or if you believe you
> have received this email in error, please contact the sender and remove it
> from your system.Do not use, copy or disclose the information contained in
> this email or in any attachment.
>
> RealityMine Limited may monitor email traffic data including the content
> of email for the purposes of security.
>
> RealityMine Limited is a company registered in England and Wales.
> Registered number: 07920936 Registered office: Warren Bruce Court, Warren
> Bruce Road, Trafford Park, Manchester M17 1LB
>


Re: Compatibility of 1.6 spark.eventLog with a 2.0 History Server

2016-09-16 Thread Parth Brahmbhatt
The problem is we backported the Sql tab ui changes from 2.0 in our 1.6.1. They 
changed a parameter name in SQLMetricInfo. Still the community version is 
compatible, ours is not.

> On Sep 15, 2016, at 11:08 AM, Mario Ds Briggs  wrote:
> 
> I had checked in 1.6.2 and it doesnt. I didnt check in lower versions. The 
> history server logs do show a 'Replaying log path: file:xxx.inprogress' when 
> the file is changed , but a refresh on UI doesnt show the new 
> jobs/stages/tasks whatever
> 
> 
> 
> thanks
> Mario
> 
> Ryan Williams ---15/09/2016 11:19:56 pm---What is meant by: """
> 
> From: Ryan Williams 
> To: Reynold Xin , Mario Ds Briggs/India/IBM@IBMIN
> Cc: "dev@spark.apache.org" 
> Date: 15/09/2016 11:19 pm
> Subject: Re: Compatibility of 1.6 spark.eventLog with a 2.0 History Server
> 
> 
> 
> 
> What is meant by:
> 
> """
> (This is because clicking the refresh button in browser, updates the UI with 
> latest events, where-as in the 1.6 code base, this does not happen)
> """
> 
> Hasn't refreshing the page updated all the information in the UI through the 
> 1.x line?
> 
> 


Re: Spark 2.0.1 release?

2016-09-16 Thread Sean Owen
There are a few blockers for 2.0.1, but just two. For example
https://issues.apache.org/jira/browse/SPARK-17418 must be resolved
before another release.

On Fri, Sep 16, 2016 at 7:23 PM, Reynold Xin  wrote:
> 2.0.1 is definitely coming soon.  Was going to tag a rc yesterday but ran
> into some issue. I will try to do it early next week for rc.
>
>
> On Fri, Sep 16, 2016 at 11:16 AM, Ewan Leith 
> wrote:
>>
>> Hi all,
>>
>> Apologies if I've missed anything, but is there likely to see a 2.0.1 bug
>> fix release, or does a jump to 2.1.0 with additional features seem more
>> probable?
>>
>> The issues for 2.0.1 seem pretty much done here
>> https://issues.apache.org/jira/browse/SPARK/fixforversion/12336857/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel
>>
>> And there's a lot of good bugfixes in there that I'd love to be able to
>> use without deploying our own builds.
>>
>> Thanks,
>> Ewan
>>
>>
>>
>> This email and any attachments to it may contain confidential information
>> and are intended solely for the addressee.
>>
>>
>>
>> If you are not the intended recipient of this email or if you believe you
>> have received this email in error, please contact the sender and remove it
>> from your system.Do not use, copy or disclose the information contained in
>> this email or in any attachment.
>>
>> RealityMine Limited may monitor email traffic data including the content
>> of email for the purposes of security.
>>
>> RealityMine Limited is a company registered in England and Wales.
>> Registered number: 07920936 Registered office: Warren Bruce Court, Warren
>> Bruce Road, Trafford Park, Manchester M17 1LB
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.0.1 release?

2016-09-16 Thread Ewan Leith
That's great news, since it's that close I'll get started on building and 
testing the branch myself

Thanks,
Ewan

On 16 Sep 2016 19:23, Reynold Xin  wrote:
2.0.1 is definitely coming soon.  Was going to tag a rc yesterday but ran into 
some issue. I will try to do it early next week for rc.


On Fri, Sep 16, 2016 at 11:16 AM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:

Hi all,

Apologies if I've missed anything, but is there likely to see a 2.0.1 bug fix 
release, or does a jump to 2.1.0 with additional features seem more probable?

The issues for 2.0.1 seem pretty much done here 
https://issues.apache.org/jira/browse/SPARK/fixforversion/12336857/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel

And there's a lot of good bugfixes in there that I'd love to be able to use 
without deploying our own builds.

Thanks,
Ewan



This email and any attachments to it may contain confidential information and 
are intended solely for the addressee.



If you are not the intended recipient of this email or if you believe you have 
received this email in error, please contact the sender and remove it from your 
system.Do not use, copy or disclose the information contained in this email or 
in any attachment.

RealityMine Limited may monitor email traffic data including the content of 
email for the purposes of security.

RealityMine Limited is a company registered in England and Wales. Registered 
number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, 
Trafford Park, Manchester M17 1LB





This email and any attachments to it may contain confidential information and 
are intended solely for the addressee.



If you are not the intended recipient of this email or if you believe you have 
received this email in error, please contact the sender and remove it from your 
system.Do not use, copy or disclose the information contained in this email or 
in any attachment.

RealityMine Limited may monitor email traffic data including the content of 
email for the purposes of security.

RealityMine Limited is a company registered in England and Wales. Registered 
number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, 
Trafford Park, Manchester M17 1LB


Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-09-16 Thread Michael Heuer
On Wed, Aug 24, 2016 at 12:12 PM, Sean Owen  wrote:

> If you're just varying versions (or things that can be controlled by a
> profile, which is most everything including dependencies), you don't
> need and probably don't want multiple POM files. Even that wouldn't
> mean you can't use classifiers.
>

It is worse (or better) than that, profiles didn't work for us in
combination with Scala 2.10/2.11, so we modify the POM in place as part of
CI and the release process.



> I have seen it used for HBase, core Hadoop. I am not sure I've seen it
> used for Spark 2 vs 1 but no reason it couldn't be. Frequently
> projects would instead declare that as of some version, Spark 2 is
> required, rather than support both. Or shim over an API difference
> with reflection if that's all there was to it. Spark does both of
> those sorts of things itself to avoid having to publish multiple
> variants at all. (Well, except for Scala 2.10 vs 2.11!)
>

We shim over Hadoop changes where necessary but the Spark changes between
1.x and 2.x are too much.

We have since resolved to deploy separate Spark 1.x and 2.x artifactIds as
described below.  Relevant pull requests:

https://github.com/bigdatagenomics/adam/pull/1123
https://github.com/bigdatagenomics/utils/pull/78

Thanks!

   michael



>
> On Wed, Aug 24, 2016 at 6:02 PM, Michael Heuer  wrote:
> > Have you seen any successful applications of this for Spark 1.x/2.x?
> >
> > From the doc "The classifier allows to distinguish artifacts that were
> built
> > from the same POM but differ in their content."
> >
> > We'd be building from different POMs, since we'd be modifying the Spark
> > dependency version (and presumably any other dependencies that needed the
> > same Spark 1.x/2.x distinction).
> >
> >
> > On Wed, Aug 24, 2016 at 11:49 AM, Sean Owen  wrote:
> >>
> >> This is also what "classifiers" are for in Maven, to have variations
> >> on one artifact and version. https://maven.apache.org/pom.html
> >>
> >> It has been used to ship code for Hadoop 1 vs 2 APIs.
> >>
> >> In a way it's the same idea as Scala's "_2.xx" naming convention, with
> >> a less unfortunate implementation.
> >>
> >>
> >> On Wed, Aug 24, 2016 at 5:41 PM, Michael Heuer 
> wrote:
> >> > Hello,
> >> >
> >> > We're a project downstream of Spark and need to provide separate
> >> > artifacts
> >> > for Spark 1.x and Spark 2.x.  Has any convention been established or
> >> > even
> >> > proposed for artifact names and/or qualifiers?
> >> >
> >> > We are currently thinking
> >> >
> >> > org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1]  for Spark 1.x and
> >> > Scala
> >> > 2.10 & 2.11
> >> >
> >> >   and
> >> >
> >> > org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark
> 1.x
> >> > and
> >> > Scala 2.10 & 2.11
> >> >
> >> > https://github.com/bigdatagenomics/adam/issues/1093
> >> >
> >> >
> >> > Thanks in advance,
> >> >
> >> >michael
> >
> >
>


Re: What's the meaning when the partitions is zero?

2016-09-16 Thread WangJianfei
if so, we will get exception when the numPartitions is 0.
 def getPartition(key: Any): Int = key match {
case null => 0
//case None => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-meaning-when-the-partitions-is-zero-tp18957p18967.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What's the meaning when the partitions is zero?

2016-09-16 Thread Mridul Muralidharan
When numPartitions is 0, there is no data in the rdd: so getPartition is
never invoked.

-  Mridul

On Friday, September 16, 2016, WangJianfei 
wrote:

> if so, we will get exception when the numPartitions is 0.
>  def getPartition(key: Any): Int = key match {
> case null => 0
> //case None => 0
> case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
>   }
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/What-s-the-meaning-
> when-the-partitions-is-zero-tp18957p18967.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>


Doubt about ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread WangJianfei
We can see that when the number of been written objects equals
serializerBatchSize, the flush() will be called.  But if the objects written
exceeds the  default buffer size, what will happen? if this situation
happens,will the flush() be called automatelly?

private[this] def spillMemoryIteratorToDisk(inMemoryIterator:
WritablePartitionedIterator)
  : SpilledFile = {

// ignore some code here
try {
  while (inMemoryIterator.hasNext) {
val partitionId = inMemoryIterator.nextPartition()
require(partitionId >= 0 && partitionId < numPartitions,
  s"partition Id: ${partitionId} should be in the range [0,
${numPartitions})")
inMemoryIterator.writeNext(writer)
elementsPerPartition(partitionId) += 1
objectsWritten += 1
if (objectsWritten == serializerBatchSize) {
  flush()
}
  }
 
 // ignore some code here
SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)
  }



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Doubt-about-ExternalSorter-spillMemoryIteratorToDisk-tp18969.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org