Re: more uniform exception handling?

2016-04-19 Thread Sean Owen
We already have SparkException, indeed. The ID is an interesting idea;
simple to implement and might help disambiguate.

Does it solve a lot of problems of this form? if something is
squelching Exception or SparkException the result will be the same. #2
is something we can sniff out with static analysis pretty easily, but
not as much #1. Ideally we'd just fix blocks like this but I bet there
are lots of them.

I like the idea but for a different reason, and that's that it's
probably best to control exceptions that propagate from the public
API, since in some cases they're a meaningful part of the API (see
https://issues.apache.org/jira/browse/SPARK-8393 which I'm hoping to
fix now)

And the catch there is -- throwing checked exceptions from Scala code
in a way that Java code can catch requires annotating lots of methods.

On Mon, Apr 18, 2016 at 8:16 PM, Reynold Xin  wrote:
> Josh's pull request on rpc exception handling got me to think ...
>
> In my experience, there have been a few things related exceptions that
> created a lot of trouble for us in production debugging:
>
> 1. Some exception is thrown, but is caught by some try/catch that does not
> do any logging nor rethrow.
> 2. Some exception is thrown, but is caught by some try/catch that does not
> do any logging, but do rethrow. But the original exception is now masked.
> 2. Multiple exceptions are logged at different places close to each other,
> but we don't know whether they are caused by the same problem or not.
>
>
> To mitigate some of the above, here's an idea ...
>
> (1) Create a common root class for all the exceptions (e.g. call it
> SparkException) used in Spark. We should make sure every time we catch an
> exception from a 3rd party library, we rethrow them as SparkException (a lot
> of places already do that). In SparkException's constructor, log the
> exception and the stacktrace.
>
> (2) SparkException has a monotonically increasing ID, and this ID appears in
> the exception error message (say at the end).
>
>
> I think (1) will eliminate most of the cases that an exception gets
> swallowed. The main downside I can think of is we might log an exception
> multiple times. However, I'd argue exceptions should be rare, and it is not
> that big of a deal to log them twice or three times. The unique ID (2) can
> help us correlate exceptions if they appear multiple times.
>
> Thoughts?
>
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: auto closing pull requests that have been inactive > 30 days?

2016-04-19 Thread Sean Owen
I support this. We used to do this, right? Anecdotally, from watching
the stream most days, most stale PRs are, in descending order of
frequency:

1. Probably not a good change, and not looked at (as a result)
2. Abandoned by submitter at some stage
3. Not an important change, not so bad, not really reviewed
4. A good change that needs review

Whether your PR is #1 or #4 is a matter of perspective. But, I
disagree with the tacit assumption that we're mostly talking about
good PRs being closed because nobody could be bothered; #4 is, I
think, well under 10%.

So generating reports and warnings etc don't seem to address that.
Closing merely means "at the moment there's not a reason to expect
this to proceed, but that could change". Unlike JIRA we don't have
more nuanced resolutions like "WontFix" vs "Later". Welcome the
submitter to reopen if they really think it should be kept alive in
good faith.

As for always stating a close reason: well, a lot of PRs are simply
not very good code, or features that just don't look that useful
relative to their cost. Is it more polite to soft-close or honestly
say "I don't think this is worth adding"?

There is a carrying cost to not doing this. Right now being "Open" is
fairly meaningless, and I've long since stopped bothering reviewing
the backlog of open PRs since it's noise, instead sifting for good new
ones to fast-track.


I agree with comments here that suggest more can be pushed back on the
contributors. We've started with a campaign to get people to read
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
first, which would solve a lot of the "mystery pull request" problems
if followed: no real description, no test, no real problem statement.

Put another way, any contribution that is clearly explained, cleanly
implemented, and makes a good case why its pros outweigh its cons, is
pretty consistently reviewed and quickly. Pushing on contributors to
do these things won't harm good contributions, which already do these
things; it'll make it harder for bad contributions to distract from
them.

And I think the effect of a change like this is, in the main, to push
back mostly on less good contributions.

On Mon, Apr 18, 2016 at 8:02 PM, Reynold Xin  wrote:
> We have hit a new high in open pull requests: 469 today. While we can
> certainly get more review bandwidth, many of these are old and still open
> for other reasons. Some are stale because the original authors have become
> busy and inactive, and some others are stale because the committers are not
> sure whether the patch would be useful, but have not rejected the patch
> explicitly. We can cut down the signal to noise ratio by closing pull
> requests that have been inactive for greater than 30 days, with a nice
> message. I just checked and this would close ~ half of the pull requests.
>
> For example:
>
> "Thank you for creating this pull request. Since this pull request has been
> inactive for 30 days, we are automatically closing it. Closing the pull
> request does not remove it from history and will retain all the diff and
> review comments. If you have the bandwidth and would like to continue
> pushing this forward, please reopen it. Thanks again!"
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Steve Loughran

> On 18 Apr 2016, at 23:05, Marcelo Vanzin  wrote:
> 
> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin  wrote:
>> The bigger problem is that it is much easier to maintain backward
>> compatibility rather than dictating forward compatibility. For example, as
>> Marcin said, if we come up with a slightly different shuffle layout to
>> improve shuffle performance, we wouldn't be able to do that if we want to
>> allow Spark 1.6 shuffle service to read something generated by Spark 2.1.
> 
> And I think that's really what Mark is proposing. Basically, "don't
> intentionally break backwards compatibility unless it's really
> required" (e.g. SPARK-12130). That would allow option B to work.
> 
> If a new shuffle manager is created, then neither option A nor option
> B would really work. Moving all the shuffle-related classes to a
> different package, to support option A, would be really messy. At that
> point, you're better off maintaining the new shuffle service outside
> of YARN, which is rather messy too.
> 


There's a WiP in YARN to move Aux NM services into their own CP, though that 
doesn't address shared native libs, such as the leveldb support that went into 
1.6


There's already been some fun with Jackson versions and that of Hadoop — 
SPARK-12807; something that per-service classpaths would fix.

would having separate CPs allow multiple spark shuffle JARs to be loaded, as 
long as everything bonded to the right one?

Re: more uniform exception handling?

2016-04-19 Thread Steve Loughran

On 18 Apr 2016, at 20:16, Reynold Xin 
mailto:r...@databricks.com>> wrote:

Josh's pull request on rpc 
exception handling got me to think ...

In my experience, there have been a few things related exceptions that created 
a lot of trouble for us in production debugging:

1. Some exception is thrown, but is caught by some try/catch that does not do 
any logging nor rethrow.
2. Some exception is thrown, but is caught by some try/catch that does not do 
any logging, but do rethrow. But the original exception is now masked.
2. Multiple exceptions are logged at different places close to each other, but 
we don't know whether they are caused by the same problem or not.


To mitigate some of the above, here's an idea ...

(1) Create a common root class for all the exceptions (e.g. call it 
SparkException) used in Spark. We should make sure every time we catch an 
exception from a 3rd party library, we rethrow them as SparkException (a lot of 
places already do that). In SparkException's constructor, log the exception and 
the stacktrace.

(2) SparkException has a monotonically increasing ID, and this ID appears in 
the exception error message (say at the end).


I think (1) will eliminate most of the cases that an exception gets swallowed. 
The main downside I can think of is we might log an exception multiple times. 
However, I'd argue exceptions should be rare, and it is not that big of a deal 
to log them twice or three times. The unique ID (2) can help us correlate 
exceptions if they appear multiple times.

Thoughts?






1. unique IDs is a nice touch
2. there are some exceptions where code really needs to match on them, usually 
in the network layer, also interruptedException. Its dangerous to swallow them.
3. I've done work on other projects (Slider, with YARN-679 to get them  into 
Hadoop) where exceptions can also declare an exit code. This means system exits 
can have different exit codes for different problems —and the exception raising 
code gets to choose it. For extra fun, the set of exit codes attempt to lift 
numbers from HTTP errors, so "41" is Unauthed, from HTTP 401: 
https://slider.incubator.apache.org/docs/exitcodes.html
4. Once you have different exit codes, then you can start writing tests for the 
scripts designed to trigger failures —asserting about the exit code as way to 
assess the outcome

Something else to consider is "what can be added atop the classic runtime 
exceptions to make them useful. Hadoop's NetUtils.wrapException() does this: 
catches things coming up from the network stack and rethrows an exception of 
the same type (where possible), but now with source/dest hostnames and ports. 
That is incredibly useful. The exceptions also tack in wiki references to what 
the exceptions mean in a desparate attempt to reduce the #of JIRAs complaining 
about services refusing connections. Its hard to tell how often that works 
—some people do now just paste in the stack trace without reading the wiki 
link. At least now there's somewhere to point them at when the issue is closed 
as invalid. [ see: 
http://steveloughran.blogspot.co.uk/2011/09/note-on-distributed-computing.html

I'm now considering what could be done at the Kerberos layer too, though there 
the problem is that the JVM Exception is invariably a meaningless "Failure 
Unspecified at GSS API Level" + text which varies across JVM vendor and 
versions. Maybe the wiki URL should just point to page saying "nobody 
understands kerberos —sorry"


Re: [spark.ml] Why is private class ColumnPruner?

2016-04-19 Thread Jacek Laskowski
Hi Yanbo,

https://issues.apache.org/jira/browse/SPARK-14730

Thanks!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Apr 19, 2016 at 8:55 AM, Yanbo Liang  wrote:
> Hi Jacek,
>
> This is due to ColumnPruner is only used for RFormula currently, we did not
> expose it as a feature transformer.
> Please feel free to create JIRA and work on it.
>
> Thanks
> Yanbo
>
> 2016-03-25 8:50 GMT-07:00 Jacek Laskowski :
>>
>> Hi,
>>
>> Came across `private class ColumnPruner` with "TODO(ekl) make this a
>> public transformer" in scaladoc, cf.
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L317.
>>
>> Why is this private and is there a JIRA for the TODO(ekl)?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: auto closing pull requests that have been inactive > 30 days?

2016-04-19 Thread Hyukjin Kwon
I agree with you, Sean, almost all.

If feedback can be given enough, it might be okay to apply the rule as
Reynold said it looks they are not mutually exclusive (although I still
think there should be more committers or should be more active because the
main reason for this looks there are PRs not reviewed or closed without
feedback enough. I think there should be some comments even though it
closes the issue and the comments are "I don't think it is worth adding
because...". This could help develop contributer's logical process which
would eventually decrease committer's help).

FWIW, I strongly feel at least those 10% of other remaining "good" PRs
should be given feedback and comments enough because **Personally** I am
pretty sure that the PRs closed by the rule would never be reopened
including those 10% and I believe the copy-and-pasted message would do
nothing but just notifying "closing this as it is expired". We don't really
care how nice the messages left by Jenkins are.

Lastly, I think being open and closed might mean something and that might
be the reason why there are some PRs not closed even though comitters think
they don't think it is worth adding.
On 19 Apr 2016 1:14 p.m., "Hyukjin Kwon"  wrote:

> I don't think asking committers to be more active is impractical. I am
> not too sure if other projects apply the same rules here but
>
> I think if a project is being more popular, then I think it is appropriate
> that there should be more committers or they are asked to be more active.
>
>
> In addition, I believe there are a lot of PRs waiting for committer's
> comments.
>
>
> If committers are too busy to review a PR, then I think they better ask
> authors to provide the evidence to decide, maybe with a message such as
>
> "I am currently too busy to review or decide. Could you please add some 
> evidence/benchmark/performance
> test or survey for demands?"
>
>
> If the evidence is not enough or not easy to see, then they can ask to
> simplify the evidence or make a proper conclusion, maybe with a message
> such as
>
> "I think the evidence is not enough/trustable because  Could you
> please simplify/provide some more evidence?".
>
>
>
> Or, I think they can be manually closed with a explicit message such as
>
> "This is closed for now because we are not sure for this patch because.."
>
>
> I think they can't be closed only because it is "expired" with a copy and
> pasted message.
>
>
>
> 2016-04-19 12:46 GMT+09:00 Nicholas Chammas :
>
>> Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1
>>
>> A lot of this was discussed a while back when the PR Dashboard was first
>> introduced, and several times before and after that as well. (e.g. August
>> 2014
>> 
>> )
>>
>> If there is not enough momentum to build the tooling that people are
>> discussing here, then perhaps Reynold's suggestion is the most practical
>> one that is likely to see the light of day.
>>
>> I think asking committers to be more active in commenting on PRs is
>> theoretically the correct thing to do, but impractical. I'm not a
>> committer, but I would guess that most of them are already way
>> overcommitted (ha!) and asking them to do more just won't yield results.
>>
>> We've had several instances in the past where we all tried to rally
>> 
>> and be more proactive about giving feedback, closing PRs, and nudging
>> contributors who have gone silent. My observation is that the level of
>> energy required to "properly" curate PR activity in that way is simply not
>> sustainable. People can do it for a few weeks and then things revert to the
>> way they are now.
>>
>> Perhaps the missing link that would make this sustainable is better
>> tooling. If you think so and can sling some Javascript, you might want
>> to contribute to the PR Dashboard .
>>
>> Perhaps the missing link is something else: A different PR review
>> process; more committers; a higher barrier to contributing; a combination
>> thereof; etc...
>>
>> Also relevant: http://danluu.com/discourage-oss/
>>
>> By the way, some people noted that closing PRs may discourage
>> contributors. I think our open PR count alone is very discouraging. Under
>> what circumstances would you feel encouraged to open a PR against a project
>> that has hundreds of open PRs, some from many, many months ago
>> 
>> ?
>>
>> Nick
>>
>>
>> 2016년 4월 18일 (월) 오후 10:30, Ted Yu 님이 작성:
>>
>>> During the months of November / December, the 30 day period should be
>>> relaxed.
>>>
>>> Some people(at least in US) may take extended vacation during that time.
>>>
>>> For Chinese developers, Spring Festival would bear similar circumstance.
>>>
>>> On

Introduction to Spark workshop, May 9, New York

2016-04-19 Thread Rich Bowen
Hi, folks, I received the following request:

---
The guy who was going to teach the Introduction to Spark workshop at Data 
Summit on May 9th has changed jobs and can no longer do the workshop. Know 
anybody in the New York area who could fill in? It's scheduled from 9 to 12 at 
the New York Hilton in Midtown.

Any suggestions welcome!
-

Thanks to anybody that can step up for this, or can suggest someone else. Feel 
free to contact me off-list - rbo...@apache.org - if you prefer.

Thanks!

--Rich
--
Sent via Pony Mail for dev@spark.apache.org. 
View this email online at:
https://pony-poc.apache.org/list.html?dev@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Question about storage memory in unified memory manager

2016-04-19 Thread Patrick Woody
Hey all,

I had a question about the MemoryStore for the BlockManager with the
unified memory manager v.s. the legacy mode.

In the unified format, I would expect the max size of the MemoryStore to be

 *  *


 in the same way that when using the StaticMemoryManager it is

 *  *
.

Instead it appears to be

( * ) -
onHeapExecutionMemoryPool.memoryUsed

I would expect onHeapExecutionMemoryPool.memoryUsed to be around size 0 at
the time of initialization, so this may be overcommitting memory to the
MemoryStore. Does this line up with expectations?

For context, I have a very large SQL query that does a significant amount
of broadcast joins (~100) so I need my driver to be able to toss those into
the DiskStore without going into GC hell. Since my MemoryStore seems to not
be bounded by the storage pool, this fills up my heap and causes my
application to OOM. Simply reducing spark.memory.fraction alleviates the
problem, but I'd love to understand if that is actually the correct fix
here rather than simply lowering the storageFraction.

Thanks!
-Pat


Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Tom Graves
It would be nice if we could keep this compatible between 1.6 and 2.0 so I'm 
more for Option B at this point since the change made seems minor and we can 
change to have shuffle service do internally like Marcelo mention. Then lets 
try to keep compatible, but if there is a forcing function lets figure out a 
good way to run 2 at once.

Tom 

On Monday, April 18, 2016 5:23 PM, Marcelo Vanzin  
wrote:
 

 On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin  wrote:
> IIUC, the reason for that PR is that they found the string comparison to
> increase the size in large shuffles. Maybe we should add the ability to
> support the short name to Spark 1.6.2?

Is that something that really yields noticeable gains in performance?

If it is, it seems like it would be simple to allow executors register
with the full class name, and map the long names to short names in the
shuffle service itself.

You could even get fancy and have different ExecutorShuffleInfo
implementations for each shuffle service, with an abstract
"getBlockData" method that gets called instead of the current if/else
in ExternalShuffleBlockResolver.java.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



  

Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Mark Grover
On Tue, Apr 19, 2016 at 2:26 AM, Steve Loughran 
wrote:

>
> > On 18 Apr 2016, at 23:05, Marcelo Vanzin  wrote:
> >
> > On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
> wrote:
> >> The bigger problem is that it is much easier to maintain backward
> >> compatibility rather than dictating forward compatibility. For example,
> as
> >> Marcin said, if we come up with a slightly different shuffle layout to
> >> improve shuffle performance, we wouldn't be able to do that if we want
> to
> >> allow Spark 1.6 shuffle service to read something generated by Spark
> 2.1.
> >
> > And I think that's really what Mark is proposing. Basically, "don't
> > intentionally break backwards compatibility unless it's really
> > required" (e.g. SPARK-12130). That would allow option B to work.
> >
> > If a new shuffle manager is created, then neither option A nor option
> > B would really work. Moving all the shuffle-related classes to a
> > different package, to support option A, would be really messy. At that
> > point, you're better off maintaining the new shuffle service outside
> > of YARN, which is rather messy too.
> >
>
>
> There's a WiP in YARN to move Aux NM services into their own CP, though
> that doesn't address shared native libs, such as the leveldb support that
> went into 1.6
>
>
> There's already been some fun with Jackson versions and that of Hadoop —
> SPARK-12807; something that per-service classpaths would fix.
>
> would having separate CPs allow multiple spark shuffle JARs to be loaded,
> as long as everything bonded to the right one?

I just checked out https://issues.apache.org/jira/browse/YARN-1593. It's
hard to say if it'd help or not, I wasn't able to find any design doc or
patch attached to that JIRA. If there were a way to specify different JAR
names/locations for starting the separate process, it would work but if the
start happened by pointing to a full class name, that comes back to Option
A, and we'd have to do a good chunk of name/version spacing in order to
isolate.


Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Mark Grover
Thanks.

I'm more than happy to wait for more people to chime in here but I do feel
that most of us are leaning towards Option B anyways. So, I created a JIRA
(SPARK-14731) for reverting SPARK-12130 in Spark 2.0 and file a PR shortly.
Mark

On Tue, Apr 19, 2016 at 7:44 AM, Tom Graves 
wrote:

> It would be nice if we could keep this compatible between 1.6 and 2.0 so
> I'm more for Option B at this point since the change made seems minor and
> we can change to have shuffle service do internally like Marcelo mention.
> Then lets try to keep compatible, but if there is a forcing function lets
> figure out a good way to run 2 at once.
>
>
> Tom
>
>
> On Monday, April 18, 2016 5:23 PM, Marcelo Vanzin 
> wrote:
>
>
> On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin  wrote:
> > IIUC, the reason for that PR is that they found the string comparison to
> > increase the size in large shuffles. Maybe we should add the ability to
> > support the short name to Spark 1.6.2?
>
> Is that something that really yields noticeable gains in performance?
>
> If it is, it seems like it would be simple to allow executors register
> with the full class name, and map the long names to short names in the
> shuffle service itself.
>
> You could even get fancy and have different ExecutorShuffleInfo
> implementations for each shuffle service, with an abstract
> "getBlockData" method that gets called instead of the current if/else
> in ExternalShuffleBlockResolver.java.
>
>
> --
> Marcelo
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>
>


RFC: Remote "HBaseTest" from examples?

2016-04-19 Thread Marcelo Vanzin
Hey all,

Two reasons why I think we should remove that from the examples:

- HBase now has Spark integration in its own repo, so that really
should be the template for how to use HBase from Spark, making that
example less useful, even misleading.

- It brings up a lot of extra dependencies that make the size of the
Spark distribution grow.

Any reason why we shouldn't drop that example?

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
Corrected typo in subject.

I want to note that the hbase-spark module in HBase is incomplete. Zhan has
several patches pending review.

hbase-spark module is currently only in master branch which would be
released as 2.0
However the release date for 2.0 is unclear - probably half a year from now.

If we remove the examples now, there would be no release from either
project which can show users how to access hbase.

On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
wrote:

> Hey all,
>
> Two reasons why I think we should remove that from the examples:
>
> - HBase now has Spark integration in its own repo, so that really
> should be the template for how to use HBase from Spark, making that
> example less useful, even misleading.
>
> - It brings up a lot of extra dependencies that make the size of the
> Spark distribution grow.
>
> Any reason why we shouldn't drop that example?
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Marcelo Vanzin
On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu  wrote:
> I want to note that the hbase-spark module in HBase is incomplete. Zhan has
> several patches pending review.

I wouldn't call it "incomplete". Lots of functionality is there, which
doesn't mean new ones, or more efficient implementations of existing
ones, can't be added.

> hbase-spark module is currently only in master branch which would be
> released as 2.0

Just as a side note, it's part of CDH 5.7.0, not that it matters much
for upstream HBase.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
bq. I wouldn't call it "incomplete".

I would call it incomplete.

Please see HBASE-15333 'Enhance the filter to handle short, integer, long,
float and double' which is a bug fix.

Please exclude presence of related of module in vendor distro from this
discussion.

Thanks

On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin 
wrote:

> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu  wrote:
> > I want to note that the hbase-spark module in HBase is incomplete. Zhan
> has
> > several patches pending review.
>
> I wouldn't call it "incomplete". Lots of functionality is there, which
> doesn't mean new ones, or more efficient implementations of existing
> ones, can't be added.
>
> > hbase-spark module is currently only in master branch which would be
> > released as 2.0
>
> Just as a side note, it's part of CDH 5.7.0, not that it matters much
> for upstream HBase.
>
> --
> Marcelo
>


Re: RFC: Remote "HBaseTest" from examples?

2016-04-19 Thread Reynold Xin
Yea in general I feel examples that bring in a large amount of dependencies
should be outside Spark.


On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
wrote:

> Hey all,
>
> Two reasons why I think we should remove that from the examples:
>
> - HBase now has Spark integration in its own repo, so that really
> should be the template for how to use HBase from Spark, making that
> example less useful, even misleading.
>
> - It brings up a lot of extra dependencies that make the size of the
> Spark distribution grow.
>
> Any reason why we shouldn't drop that example?
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Marcelo Vanzin
On Tue, Apr 19, 2016 at 10:28 AM, Reynold Xin  wrote:
> Yea in general I feel examples that bring in a large amount of dependencies
> should be outside Spark.

Another option to avoid the dependency problem is to not ship examples
in the distribution, and maybe create a separate tarball for them;
removing HBaseTest only solves one of the dependency problems. Since
we have examples for flume and kafka, for example, the Spark
distribution ends up shipping flume and kafka jars (and a bunch of
other things).

> On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
> wrote:
>>
>> Hey all,
>>
>> Two reasons why I think we should remove that from the examples:
>>
>> - HBase now has Spark integration in its own repo, so that really
>> should be the template for how to use HBase from Spark, making that
>> example less useful, even misleading.
>>
>> - It brings up a lot of extra dependencies that make the size of the
>> Spark distribution grow.
>>
>> Any reason why we shouldn't drop that example?



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
bq. create a separate tarball for them

Probably another thread can be started for the above.
I am fine with it.

On Tue, Apr 19, 2016 at 10:34 AM, Marcelo Vanzin 
wrote:

> On Tue, Apr 19, 2016 at 10:28 AM, Reynold Xin  wrote:
> > Yea in general I feel examples that bring in a large amount of
> dependencies
> > should be outside Spark.
>
> Another option to avoid the dependency problem is to not ship examples
> in the distribution, and maybe create a separate tarball for them;
> removing HBaseTest only solves one of the dependency problems. Since
> we have examples for flume and kafka, for example, the Spark
> distribution ends up shipping flume and kafka jars (and a bunch of
> other things).
>
> > On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
> > wrote:
> >>
> >> Hey all,
> >>
> >> Two reasons why I think we should remove that from the examples:
> >>
> >> - HBase now has Spark integration in its own repo, so that really
> >> should be the template for how to use HBase from Spark, making that
> >> example less useful, even misleading.
> >>
> >> - It brings up a lot of extra dependencies that make the size of the
> >> Spark distribution grow.
> >>
> >> Any reason why we shouldn't drop that example?
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Marcelo Vanzin
Alright, if you prefer, I'll say "it's actually in use right now in
spite of not being in any upstream HBase release", and it's more
useful than a single example file in the Spark repo for those who
really want to integrate with HBase.

Spark's example is really very trivial (just uses one of HBase's input
formats), which makes it not very useful as a blueprint for developing
HBase apps with Spark.

On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu  wrote:
> bq. I wouldn't call it "incomplete".
>
> I would call it incomplete.
>
> Please see HBASE-15333 'Enhance the filter to handle short, integer, long,
> float and double' which is a bug fix.
>
> Please exclude presence of related of module in vendor distro from this
> discussion.
>
> Thanks
>
> On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin 
> wrote:
>>
>> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu  wrote:
>> > I want to note that the hbase-spark module in HBase is incomplete. Zhan
>> > has
>> > several patches pending review.
>>
>> I wouldn't call it "incomplete". Lots of functionality is there, which
>> doesn't mean new ones, or more efficient implementations of existing
>> ones, can't be added.
>>
>> > hbase-spark module is currently only in master branch which would be
>> > released as 2.0
>>
>> Just as a side note, it's part of CDH 5.7.0, not that it matters much
>> for upstream HBase.
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
bq. it's actually in use right now in spite of not being in any upstream
HBase release

If it is not in upstream, then it is not relevant for discussion on Apache
mailing list.

On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin 
wrote:

> Alright, if you prefer, I'll say "it's actually in use right now in
> spite of not being in any upstream HBase release", and it's more
> useful than a single example file in the Spark repo for those who
> really want to integrate with HBase.
>
> Spark's example is really very trivial (just uses one of HBase's input
> formats), which makes it not very useful as a blueprint for developing
> HBase apps with Spark.
>
> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu  wrote:
> > bq. I wouldn't call it "incomplete".
> >
> > I would call it incomplete.
> >
> > Please see HBASE-15333 'Enhance the filter to handle short, integer,
> long,
> > float and double' which is a bug fix.
> >
> > Please exclude presence of related of module in vendor distro from this
> > discussion.
> >
> > Thanks
> >
> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin 
> > wrote:
> >>
> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu  wrote:
> >> > I want to note that the hbase-spark module in HBase is incomplete.
> Zhan
> >> > has
> >> > several patches pending review.
> >>
> >> I wouldn't call it "incomplete". Lots of functionality is there, which
> >> doesn't mean new ones, or more efficient implementations of existing
> >> ones, can't be added.
> >>
> >> > hbase-spark module is currently only in master branch which would be
> >> > released as 2.0
> >>
> >> Just as a side note, it's part of CDH 5.7.0, not that it matters much
> >> for upstream HBase.
> >>
> >> --
> >> Marcelo
> >
> >
>
>
>
> --
> Marcelo
>


Re: RFC: Remote "HBaseTest" from examples?

2016-04-19 Thread Josh Rosen
+1; I think that it's preferable for code examples, especially third-party
integration examples, to live outside of Spark.

On Tue, Apr 19, 2016 at 10:29 AM Reynold Xin  wrote:

> Yea in general I feel examples that bring in a large amount of
> dependencies should be outside Spark.
>
>
> On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
> wrote:
>
>> Hey all,
>>
>> Two reasons why I think we should remove that from the examples:
>>
>> - HBase now has Spark integration in its own repo, so that really
>> should be the template for how to use HBase from Spark, making that
>> example less useful, even misleading.
>>
>> - It brings up a lot of extra dependencies that make the size of the
>> Spark distribution grow.
>>
>> Any reason why we shouldn't drop that example?
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Reynold Xin
Ted - what's the "bq" thing? Are you using some 3rd party (e.g. Atlassian)
syntax? They are not being rendered in email.


On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu  wrote:

> bq. it's actually in use right now in spite of not being in any upstream
> HBase release
>
> If it is not in upstream, then it is not relevant for discussion on Apache
> mailing list.
>
> On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin 
> wrote:
>
>> Alright, if you prefer, I'll say "it's actually in use right now in
>> spite of not being in any upstream HBase release", and it's more
>> useful than a single example file in the Spark repo for those who
>> really want to integrate with HBase.
>>
>> Spark's example is really very trivial (just uses one of HBase's input
>> formats), which makes it not very useful as a blueprint for developing
>> HBase apps with Spark.
>>
>> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu  wrote:
>> > bq. I wouldn't call it "incomplete".
>> >
>> > I would call it incomplete.
>> >
>> > Please see HBASE-15333 'Enhance the filter to handle short, integer,
>> long,
>> > float and double' which is a bug fix.
>> >
>> > Please exclude presence of related of module in vendor distro from this
>> > discussion.
>> >
>> > Thanks
>> >
>> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin 
>> > wrote:
>> >>
>> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu  wrote:
>> >> > I want to note that the hbase-spark module in HBase is incomplete.
>> Zhan
>> >> > has
>> >> > several patches pending review.
>> >>
>> >> I wouldn't call it "incomplete". Lots of functionality is there, which
>> >> doesn't mean new ones, or more efficient implementations of existing
>> >> ones, can't be added.
>> >>
>> >> > hbase-spark module is currently only in master branch which would be
>> >> > released as 2.0
>> >>
>> >> Just as a side note, it's part of CDH 5.7.0, not that it matters much
>> >> for upstream HBase.
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Marcelo Vanzin
You're entitled to your own opinions.

While you're at it, here's some much better documentation, from the
HBase project themselves, than what the Spark example provides:
http://hbase.apache.org/book.html#spark

On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu  wrote:
> bq. it's actually in use right now in spite of not being in any upstream
> HBase release
>
> If it is not in upstream, then it is not relevant for discussion on Apache
> mailing list.
>
> On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin 
> wrote:
>>
>> Alright, if you prefer, I'll say "it's actually in use right now in
>> spite of not being in any upstream HBase release", and it's more
>> useful than a single example file in the Spark repo for those who
>> really want to integrate with HBase.
>>
>> Spark's example is really very trivial (just uses one of HBase's input
>> formats), which makes it not very useful as a blueprint for developing
>> HBase apps with Spark.
>>
>> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu  wrote:
>> > bq. I wouldn't call it "incomplete".
>> >
>> > I would call it incomplete.
>> >
>> > Please see HBASE-15333 'Enhance the filter to handle short, integer,
>> > long,
>> > float and double' which is a bug fix.
>> >
>> > Please exclude presence of related of module in vendor distro from this
>> > discussion.
>> >
>> > Thanks
>> >
>> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin 
>> > wrote:
>> >>
>> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu  wrote:
>> >> > I want to note that the hbase-spark module in HBase is incomplete.
>> >> > Zhan
>> >> > has
>> >> > several patches pending review.
>> >>
>> >> I wouldn't call it "incomplete". Lots of functionality is there, which
>> >> doesn't mean new ones, or more efficient implementations of existing
>> >> ones, can't be added.
>> >>
>> >> > hbase-spark module is currently only in master branch which would be
>> >> > released as 2.0
>> >>
>> >> Just as a side note, it's part of CDH 5.7.0, not that it matters much
>> >> for upstream HBase.
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
'bq.' is used in JIRA to quote what other people have said.

On Tue, Apr 19, 2016 at 10:42 AM, Reynold Xin  wrote:

> Ted - what's the "bq" thing? Are you using some 3rd party (e.g. Atlassian)
> syntax? They are not being rendered in email.
>
>
> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu  wrote:
>
>> bq. it's actually in use right now in spite of not being in any upstream
>> HBase release
>>
>> If it is not in upstream, then it is not relevant for discussion on
>> Apache mailing list.
>>
>> On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin 
>> wrote:
>>
>>> Alright, if you prefer, I'll say "it's actually in use right now in
>>> spite of not being in any upstream HBase release", and it's more
>>> useful than a single example file in the Spark repo for those who
>>> really want to integrate with HBase.
>>>
>>> Spark's example is really very trivial (just uses one of HBase's input
>>> formats), which makes it not very useful as a blueprint for developing
>>> HBase apps with Spark.
>>>
>>> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu  wrote:
>>> > bq. I wouldn't call it "incomplete".
>>> >
>>> > I would call it incomplete.
>>> >
>>> > Please see HBASE-15333 'Enhance the filter to handle short, integer,
>>> long,
>>> > float and double' which is a bug fix.
>>> >
>>> > Please exclude presence of related of module in vendor distro from this
>>> > discussion.
>>> >
>>> > Thanks
>>> >
>>> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin 
>>> > wrote:
>>> >>
>>> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu  wrote:
>>> >> > I want to note that the hbase-spark module in HBase is incomplete.
>>> Zhan
>>> >> > has
>>> >> > several patches pending review.
>>> >>
>>> >> I wouldn't call it "incomplete". Lots of functionality is there, which
>>> >> doesn't mean new ones, or more efficient implementations of existing
>>> >> ones, can't be added.
>>> >>
>>> >> > hbase-spark module is currently only in master branch which would be
>>> >> > released as 2.0
>>> >>
>>> >> Just as a side note, it's part of CDH 5.7.0, not that it matters much
>>> >> for upstream HBase.
>>> >>
>>> >> --
>>> >> Marcelo
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
There is an Open JIRA for fixing the documentation: HBASE-15473

I would say the refguide link you provided should not be considered as
complete.

Note it is marked as Blocker by Sean B.

On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin 
wrote:

> You're entitled to your own opinions.
>
> While you're at it, here's some much better documentation, from the
> HBase project themselves, than what the Spark example provides:
> http://hbase.apache.org/book.html#spark
>
> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu  wrote:
> > bq. it's actually in use right now in spite of not being in any upstream
> > HBase release
> >
> > If it is not in upstream, then it is not relevant for discussion on
> Apache
> > mailing list.
> >
> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin 
> > wrote:
> >>
> >> Alright, if you prefer, I'll say "it's actually in use right now in
> >> spite of not being in any upstream HBase release", and it's more
> >> useful than a single example file in the Spark repo for those who
> >> really want to integrate with HBase.
> >>
> >> Spark's example is really very trivial (just uses one of HBase's input
> >> formats), which makes it not very useful as a blueprint for developing
> >> HBase apps with Spark.
> >>
> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu  wrote:
> >> > bq. I wouldn't call it "incomplete".
> >> >
> >> > I would call it incomplete.
> >> >
> >> > Please see HBASE-15333 'Enhance the filter to handle short, integer,
> >> > long,
> >> > float and double' which is a bug fix.
> >> >
> >> > Please exclude presence of related of module in vendor distro from
> this
> >> > discussion.
> >> >
> >> > Thanks
> >> >
> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin  >
> >> > wrote:
> >> >>
> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu 
> wrote:
> >> >> > I want to note that the hbase-spark module in HBase is incomplete.
> >> >> > Zhan
> >> >> > has
> >> >> > several patches pending review.
> >> >>
> >> >> I wouldn't call it "incomplete". Lots of functionality is there,
> which
> >> >> doesn't mean new ones, or more efficient implementations of existing
> >> >> ones, can't be added.
> >> >>
> >> >> > hbase-spark module is currently only in master branch which would
> be
> >> >> > released as 2.0
> >> >>
> >> >> Just as a side note, it's part of CDH 5.7.0, not that it matters much
> >> >> for upstream HBase.
> >> >>
> >> >> --
> >> >> Marcelo
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Marcelo
> >
> >
>
>
>
> --
> Marcelo
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Marcelo Vanzin
You're completely missing my point. I'm saying that HBase's current
support, even if there are bugs or things that still need to be done,
is much better than the Spark example, which is basically a call to
"SparkContext.hadoopRDD".

Spark's example is not helpful in learning how to build an HBase
application on Spark, and clashes head on with how the HBase
developers think it should be done. That, and because it brings too
many dependencies for something that is not really useful, is why I'm
suggesting removing it.


On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu  wrote:
> There is an Open JIRA for fixing the documentation: HBASE-15473
>
> I would say the refguide link you provided should not be considered as
> complete.
>
> Note it is marked as Blocker by Sean B.
>
> On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin 
> wrote:
>>
>> You're entitled to your own opinions.
>>
>> While you're at it, here's some much better documentation, from the
>> HBase project themselves, than what the Spark example provides:
>> http://hbase.apache.org/book.html#spark
>>
>> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu  wrote:
>> > bq. it's actually in use right now in spite of not being in any upstream
>> > HBase release
>> >
>> > If it is not in upstream, then it is not relevant for discussion on
>> > Apache
>> > mailing list.
>> >
>> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin 
>> > wrote:
>> >>
>> >> Alright, if you prefer, I'll say "it's actually in use right now in
>> >> spite of not being in any upstream HBase release", and it's more
>> >> useful than a single example file in the Spark repo for those who
>> >> really want to integrate with HBase.
>> >>
>> >> Spark's example is really very trivial (just uses one of HBase's input
>> >> formats), which makes it not very useful as a blueprint for developing
>> >> HBase apps with Spark.
>> >>
>> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu  wrote:
>> >> > bq. I wouldn't call it "incomplete".
>> >> >
>> >> > I would call it incomplete.
>> >> >
>> >> > Please see HBASE-15333 'Enhance the filter to handle short, integer,
>> >> > long,
>> >> > float and double' which is a bug fix.
>> >> >
>> >> > Please exclude presence of related of module in vendor distro from
>> >> > this
>> >> > discussion.
>> >> >
>> >> > Thanks
>> >> >
>> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
>> >> > 
>> >> > wrote:
>> >> >>
>> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu 
>> >> >> wrote:
>> >> >> > I want to note that the hbase-spark module in HBase is incomplete.
>> >> >> > Zhan
>> >> >> > has
>> >> >> > several patches pending review.
>> >> >>
>> >> >> I wouldn't call it "incomplete". Lots of functionality is there,
>> >> >> which
>> >> >> doesn't mean new ones, or more efficient implementations of existing
>> >> >> ones, can't be added.
>> >> >>
>> >> >> > hbase-spark module is currently only in master branch which would
>> >> >> > be
>> >> >> > released as 2.0
>> >> >>
>> >> >> Just as a side note, it's part of CDH 5.7.0, not that it matters
>> >> >> much
>> >> >> for upstream HBase.
>> >> >>
>> >> >> --
>> >> >> Marcelo
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
bq. HBase's current support, even if there are bugs or things that still
need to be done, is much better than the Spark example

In my opinion, a simple example that works is better than a buggy package.

I hope before long the hbase-spark module in HBase can arrive at a state
which we can advertise as mature - but we're not there yet.

On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin 
wrote:

> You're completely missing my point. I'm saying that HBase's current
> support, even if there are bugs or things that still need to be done,
> is much better than the Spark example, which is basically a call to
> "SparkContext.hadoopRDD".
>
> Spark's example is not helpful in learning how to build an HBase
> application on Spark, and clashes head on with how the HBase
> developers think it should be done. That, and because it brings too
> many dependencies for something that is not really useful, is why I'm
> suggesting removing it.
>
>
> On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu  wrote:
> > There is an Open JIRA for fixing the documentation: HBASE-15473
> >
> > I would say the refguide link you provided should not be considered as
> > complete.
> >
> > Note it is marked as Blocker by Sean B.
> >
> > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin 
> > wrote:
> >>
> >> You're entitled to your own opinions.
> >>
> >> While you're at it, here's some much better documentation, from the
> >> HBase project themselves, than what the Spark example provides:
> >> http://hbase.apache.org/book.html#spark
> >>
> >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu  wrote:
> >> > bq. it's actually in use right now in spite of not being in any
> upstream
> >> > HBase release
> >> >
> >> > If it is not in upstream, then it is not relevant for discussion on
> >> > Apache
> >> > mailing list.
> >> >
> >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin  >
> >> > wrote:
> >> >>
> >> >> Alright, if you prefer, I'll say "it's actually in use right now in
> >> >> spite of not being in any upstream HBase release", and it's more
> >> >> useful than a single example file in the Spark repo for those who
> >> >> really want to integrate with HBase.
> >> >>
> >> >> Spark's example is really very trivial (just uses one of HBase's
> input
> >> >> formats), which makes it not very useful as a blueprint for
> developing
> >> >> HBase apps with Spark.
> >> >>
> >> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu 
> wrote:
> >> >> > bq. I wouldn't call it "incomplete".
> >> >> >
> >> >> > I would call it incomplete.
> >> >> >
> >> >> > Please see HBASE-15333 'Enhance the filter to handle short,
> integer,
> >> >> > long,
> >> >> > float and double' which is a bug fix.
> >> >> >
> >> >> > Please exclude presence of related of module in vendor distro from
> >> >> > this
> >> >> > discussion.
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
> >> >> > 
> >> >> > wrote:
> >> >> >>
> >> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu 
> >> >> >> wrote:
> >> >> >> > I want to note that the hbase-spark module in HBase is
> incomplete.
> >> >> >> > Zhan
> >> >> >> > has
> >> >> >> > several patches pending review.
> >> >> >>
> >> >> >> I wouldn't call it "incomplete". Lots of functionality is there,
> >> >> >> which
> >> >> >> doesn't mean new ones, or more efficient implementations of
> existing
> >> >> >> ones, can't be added.
> >> >> >>
> >> >> >> > hbase-spark module is currently only in master branch which
> would
> >> >> >> > be
> >> >> >> > released as 2.0
> >> >> >>
> >> >> >> Just as a side note, it's part of CDH 5.7.0, not that it matters
> >> >> >> much
> >> >> >> for upstream HBase.
> >> >> >>
> >> >> >> --
> >> >> >> Marcelo
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Marcelo
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Marcelo
> >
> >
>
>
>
> --
> Marcelo
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Marcin Tustin
Let's posit that the spark example is much better than what is available in
HBase. Why is that a reason to keep it within Spark?

On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu  wrote:

> bq. HBase's current support, even if there are bugs or things that still
> need to be done, is much better than the Spark example
>
> In my opinion, a simple example that works is better than a buggy package.
>
> I hope before long the hbase-spark module in HBase can arrive at a state
> which we can advertise as mature - but we're not there yet.
>
> On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin 
> wrote:
>
>> You're completely missing my point. I'm saying that HBase's current
>> support, even if there are bugs or things that still need to be done,
>> is much better than the Spark example, which is basically a call to
>> "SparkContext.hadoopRDD".
>>
>> Spark's example is not helpful in learning how to build an HBase
>> application on Spark, and clashes head on with how the HBase
>> developers think it should be done. That, and because it brings too
>> many dependencies for something that is not really useful, is why I'm
>> suggesting removing it.
>>
>>
>> On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu  wrote:
>> > There is an Open JIRA for fixing the documentation: HBASE-15473
>> >
>> > I would say the refguide link you provided should not be considered as
>> > complete.
>> >
>> > Note it is marked as Blocker by Sean B.
>> >
>> > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin 
>> > wrote:
>> >>
>> >> You're entitled to your own opinions.
>> >>
>> >> While you're at it, here's some much better documentation, from the
>> >> HBase project themselves, than what the Spark example provides:
>> >> http://hbase.apache.org/book.html#spark
>> >>
>> >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu  wrote:
>> >> > bq. it's actually in use right now in spite of not being in any
>> upstream
>> >> > HBase release
>> >> >
>> >> > If it is not in upstream, then it is not relevant for discussion on
>> >> > Apache
>> >> > mailing list.
>> >> >
>> >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <
>> van...@cloudera.com>
>> >> > wrote:
>> >> >>
>> >> >> Alright, if you prefer, I'll say "it's actually in use right now in
>> >> >> spite of not being in any upstream HBase release", and it's more
>> >> >> useful than a single example file in the Spark repo for those who
>> >> >> really want to integrate with HBase.
>> >> >>
>> >> >> Spark's example is really very trivial (just uses one of HBase's
>> input
>> >> >> formats), which makes it not very useful as a blueprint for
>> developing
>> >> >> HBase apps with Spark.
>> >> >>
>> >> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu 
>> wrote:
>> >> >> > bq. I wouldn't call it "incomplete".
>> >> >> >
>> >> >> > I would call it incomplete.
>> >> >> >
>> >> >> > Please see HBASE-15333 'Enhance the filter to handle short,
>> integer,
>> >> >> > long,
>> >> >> > float and double' which is a bug fix.
>> >> >> >
>> >> >> > Please exclude presence of related of module in vendor distro from
>> >> >> > this
>> >> >> > discussion.
>> >> >> >
>> >> >> > Thanks
>> >> >> >
>> >> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
>> >> >> > 
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu 
>> >> >> >> wrote:
>> >> >> >> > I want to note that the hbase-spark module in HBase is
>> incomplete.
>> >> >> >> > Zhan
>> >> >> >> > has
>> >> >> >> > several patches pending review.
>> >> >> >>
>> >> >> >> I wouldn't call it "incomplete". Lots of functionality is there,
>> >> >> >> which
>> >> >> >> doesn't mean new ones, or more efficient implementations of
>> existing
>> >> >> >> ones, can't be added.
>> >> >> >>
>> >> >> >> > hbase-spark module is currently only in master branch which
>> would
>> >> >> >> > be
>> >> >> >> > released as 2.0
>> >> >> >>
>> >> >> >> Just as a side note, it's part of CDH 5.7.0, not that it matters
>> >> >> >> much
>> >> >> >> for upstream HBase.
>> >> >> >>
>> >> >> >> --
>> >> >> >> Marcelo
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Marcelo
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
The same question can be asked w.r.t. examples for other projects,
such as flume
and kafka.

On Tue, Apr 19, 2016 at 11:01 AM, Marcin Tustin 
wrote:

> Let's posit that the spark example is much better than what is available
> in HBase. Why is that a reason to keep it within Spark?
>
> On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu  wrote:
>
>> bq. HBase's current support, even if there are bugs or things that still
>> need to be done, is much better than the Spark example
>>
>> In my opinion, a simple example that works is better than a buggy package.
>>
>> I hope before long the hbase-spark module in HBase can arrive at a state
>> which we can advertise as mature - but we're not there yet.
>>
>> On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin 
>> wrote:
>>
>>> You're completely missing my point. I'm saying that HBase's current
>>> support, even if there are bugs or things that still need to be done,
>>> is much better than the Spark example, which is basically a call to
>>> "SparkContext.hadoopRDD".
>>>
>>> Spark's example is not helpful in learning how to build an HBase
>>> application on Spark, and clashes head on with how the HBase
>>> developers think it should be done. That, and because it brings too
>>> many dependencies for something that is not really useful, is why I'm
>>> suggesting removing it.
>>>
>>>
>>> On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu  wrote:
>>> > There is an Open JIRA for fixing the documentation: HBASE-15473
>>> >
>>> > I would say the refguide link you provided should not be considered as
>>> > complete.
>>> >
>>> > Note it is marked as Blocker by Sean B.
>>> >
>>> > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin 
>>> > wrote:
>>> >>
>>> >> You're entitled to your own opinions.
>>> >>
>>> >> While you're at it, here's some much better documentation, from the
>>> >> HBase project themselves, than what the Spark example provides:
>>> >> http://hbase.apache.org/book.html#spark
>>> >>
>>> >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu  wrote:
>>> >> > bq. it's actually in use right now in spite of not being in any
>>> upstream
>>> >> > HBase release
>>> >> >
>>> >> > If it is not in upstream, then it is not relevant for discussion on
>>> >> > Apache
>>> >> > mailing list.
>>> >> >
>>> >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <
>>> van...@cloudera.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Alright, if you prefer, I'll say "it's actually in use right now in
>>> >> >> spite of not being in any upstream HBase release", and it's more
>>> >> >> useful than a single example file in the Spark repo for those who
>>> >> >> really want to integrate with HBase.
>>> >> >>
>>> >> >> Spark's example is really very trivial (just uses one of HBase's
>>> input
>>> >> >> formats), which makes it not very useful as a blueprint for
>>> developing
>>> >> >> HBase apps with Spark.
>>> >> >>
>>> >> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu 
>>> wrote:
>>> >> >> > bq. I wouldn't call it "incomplete".
>>> >> >> >
>>> >> >> > I would call it incomplete.
>>> >> >> >
>>> >> >> > Please see HBASE-15333 'Enhance the filter to handle short,
>>> integer,
>>> >> >> > long,
>>> >> >> > float and double' which is a bug fix.
>>> >> >> >
>>> >> >> > Please exclude presence of related of module in vendor distro
>>> from
>>> >> >> > this
>>> >> >> > discussion.
>>> >> >> >
>>> >> >> > Thanks
>>> >> >> >
>>> >> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
>>> >> >> > 
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu 
>>> >> >> >> wrote:
>>> >> >> >> > I want to note that the hbase-spark module in HBase is
>>> incomplete.
>>> >> >> >> > Zhan
>>> >> >> >> > has
>>> >> >> >> > several patches pending review.
>>> >> >> >>
>>> >> >> >> I wouldn't call it "incomplete". Lots of functionality is there,
>>> >> >> >> which
>>> >> >> >> doesn't mean new ones, or more efficient implementations of
>>> existing
>>> >> >> >> ones, can't be added.
>>> >> >> >>
>>> >> >> >> > hbase-spark module is currently only in master branch which
>>> would
>>> >> >> >> > be
>>> >> >> >> > released as 2.0
>>> >> >> >>
>>> >> >> >> Just as a side note, it's part of CDH 5.7.0, not that it matters
>>> >> >> >> much
>>> >> >> >> for upstream HBase.
>>> >> >> >>
>>> >> >> >> --
>>> >> >> >> Marcelo
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Marcelo
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Marcelo
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> 
> Latest news  at Handy
> Handy just raised $50m
> 
>  led
> by Fidelity
>
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Marcelo Vanzin
On Tue, Apr 19, 2016 at 11:07 AM, Ted Yu  wrote:

> The same question can be asked w.r.t. examples for other projects, such as 
> flume
> and kafka.
>

The main difference being that flume and kafka integration are part of
Spark itself. HBase integration is not.



> On Tue, Apr 19, 2016 at 11:01 AM, Marcin Tustin 
> wrote:
>
>> Let's posit that the spark example is much better than what is available
>> in HBase. Why is that a reason to keep it within Spark?
>>
>> On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu  wrote:
>>
>>> bq. HBase's current support, even if there are bugs or things that
>>> still need to be done, is much better than the Spark example
>>>
>>> In my opinion, a simple example that works is better than a buggy
>>> package.
>>>
>>> I hope before long the hbase-spark module in HBase can arrive at a state
>>> which we can advertise as mature - but we're not there yet.
>>>
>>> On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin 
>>> wrote:
>>>
 You're completely missing my point. I'm saying that HBase's current
 support, even if there are bugs or things that still need to be done,
 is much better than the Spark example, which is basically a call to
 "SparkContext.hadoopRDD".

 Spark's example is not helpful in learning how to build an HBase
 application on Spark, and clashes head on with how the HBase
 developers think it should be done. That, and because it brings too
 many dependencies for something that is not really useful, is why I'm
 suggesting removing it.


 On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu  wrote:
 > There is an Open JIRA for fixing the documentation: HBASE-15473
 >
 > I would say the refguide link you provided should not be considered as
 > complete.
 >
 > Note it is marked as Blocker by Sean B.
 >
 > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin >>> >
 > wrote:
 >>
 >> You're entitled to your own opinions.
 >>
 >> While you're at it, here's some much better documentation, from the
 >> HBase project themselves, than what the Spark example provides:
 >> http://hbase.apache.org/book.html#spark
 >>
 >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu 
 wrote:
 >> > bq. it's actually in use right now in spite of not being in any
 upstream
 >> > HBase release
 >> >
 >> > If it is not in upstream, then it is not relevant for discussion on
 >> > Apache
 >> > mailing list.
 >> >
 >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <
 van...@cloudera.com>
 >> > wrote:
 >> >>
 >> >> Alright, if you prefer, I'll say "it's actually in use right now
 in
 >> >> spite of not being in any upstream HBase release", and it's more
 >> >> useful than a single example file in the Spark repo for those who
 >> >> really want to integrate with HBase.
 >> >>
 >> >> Spark's example is really very trivial (just uses one of HBase's
 input
 >> >> formats), which makes it not very useful as a blueprint for
 developing
 >> >> HBase apps with Spark.
 >> >>
 >> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu 
 wrote:
 >> >> > bq. I wouldn't call it "incomplete".
 >> >> >
 >> >> > I would call it incomplete.
 >> >> >
 >> >> > Please see HBASE-15333 'Enhance the filter to handle short,
 integer,
 >> >> > long,
 >> >> > float and double' which is a bug fix.
 >> >> >
 >> >> > Please exclude presence of related of module in vendor distro
 from
 >> >> > this
 >> >> > discussion.
 >> >> >
 >> >> > Thanks
 >> >> >
 >> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
 >> >> > 
 >> >> > wrote:
 >> >> >>
 >> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu 
 >> >> >> wrote:
 >> >> >> > I want to note that the hbase-spark module in HBase is
 incomplete.
 >> >> >> > Zhan
 >> >> >> > has
 >> >> >> > several patches pending review.
 >> >> >>
 >> >> >> I wouldn't call it "incomplete". Lots of functionality is
 there,
 >> >> >> which
 >> >> >> doesn't mean new ones, or more efficient implementations of
 existing
 >> >> >> ones, can't be added.
 >> >> >>
 >> >> >> > hbase-spark module is currently only in master branch which
 would
 >> >> >> > be
 >> >> >> > released as 2.0
 >> >> >>
 >> >> >> Just as a side note, it's part of CDH 5.7.0, not that it
 matters
 >> >> >> much
 >> >> >> for upstream HBase.
 >> >> >>
 >> >> >> --
 >> >> >> Marcelo
 >> >> >
 >> >> >
 >> >>
 >> >>
 >> >>
 >> >> --
 >> >> Marcelo
 >> >
 >> >
 >>
 >>
 >>
 >> --
 >> Marcelo
 >
 >



 --
 Marcelo

>>>
>>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> 
>> Latest news  at Handy
>> Handy just

Re: Organizing Spark ML example packages

2016-04-19 Thread Bryan Cutler
+1, adding some organization would make it easier for people to find a
specific example

On Mon, Apr 18, 2016 at 11:52 PM, Yanbo Liang  wrote:

> This sounds good to me, and it will make ML examples more neatly.
>
> 2016-04-14 5:28 GMT-07:00 Nick Pentreath :
>
>> Hey Spark devs
>>
>> I noticed that we now have a large number of examples for ML & MLlib in
>> the examples project - 57 for ML and 67 for MLLIB to be precise. This is
>> bound to get larger as we add features (though I know there are some PRs to
>> clean up duplicated examples).
>>
>> What do you think about organizing them into packages to match the use
>> case and the structure of the code base? e.g.
>>
>> org.apache.spark.examples.ml.recommendation
>>
>> org.apache.spark.examples.ml.feature
>>
>> and so on...
>>
>> Is it worth doing? The doc pages with include_example would need
>> updating, and the run_example script input would just need to change the
>> package slightly. Did I miss any potential issue?
>>
>> N
>>
>
>


Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
Clarification: in my previous email, I was not talking
about spark-streaming-flume artifact or spark-streaming-kafka artifact.

I was talking about examples for these projects, such
as examples//src/main/python/streaming/flume_wordcount.py

On Tue, Apr 19, 2016 at 11:10 AM, Marcelo Vanzin 
wrote:

> On Tue, Apr 19, 2016 at 11:07 AM, Ted Yu  wrote:
>
>> The same question can be asked w.r.t. examples for other projects, such
>> as flume and kafka.
>>
>
> The main difference being that flume and kafka integration are part of
> Spark itself. HBase integration is not.
>
>
>
>> On Tue, Apr 19, 2016 at 11:01 AM, Marcin Tustin 
>> wrote:
>>
>>> Let's posit that the spark example is much better than what is available
>>> in HBase. Why is that a reason to keep it within Spark?
>>>
>>> On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu  wrote:
>>>
 bq. HBase's current support, even if there are bugs or things that
 still need to be done, is much better than the Spark example

 In my opinion, a simple example that works is better than a buggy
 package.

 I hope before long the hbase-spark module in HBase can arrive at a
 state which we can advertise as mature - but we're not there yet.

 On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin 
 wrote:

> You're completely missing my point. I'm saying that HBase's current
> support, even if there are bugs or things that still need to be done,
> is much better than the Spark example, which is basically a call to
> "SparkContext.hadoopRDD".
>
> Spark's example is not helpful in learning how to build an HBase
> application on Spark, and clashes head on with how the HBase
> developers think it should be done. That, and because it brings too
> many dependencies for something that is not really useful, is why I'm
> suggesting removing it.
>
>
> On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu  wrote:
> > There is an Open JIRA for fixing the documentation: HBASE-15473
> >
> > I would say the refguide link you provided should not be considered
> as
> > complete.
> >
> > Note it is marked as Blocker by Sean B.
> >
> > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin <
> van...@cloudera.com>
> > wrote:
> >>
> >> You're entitled to your own opinions.
> >>
> >> While you're at it, here's some much better documentation, from the
> >> HBase project themselves, than what the Spark example provides:
> >> http://hbase.apache.org/book.html#spark
> >>
> >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu 
> wrote:
> >> > bq. it's actually in use right now in spite of not being in any
> upstream
> >> > HBase release
> >> >
> >> > If it is not in upstream, then it is not relevant for discussion
> on
> >> > Apache
> >> > mailing list.
> >> >
> >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <
> van...@cloudera.com>
> >> > wrote:
> >> >>
> >> >> Alright, if you prefer, I'll say "it's actually in use right now
> in
> >> >> spite of not being in any upstream HBase release", and it's more
> >> >> useful than a single example file in the Spark repo for those who
> >> >> really want to integrate with HBase.
> >> >>
> >> >> Spark's example is really very trivial (just uses one of HBase's
> input
> >> >> formats), which makes it not very useful as a blueprint for
> developing
> >> >> HBase apps with Spark.
> >> >>
> >> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu 
> wrote:
> >> >> > bq. I wouldn't call it "incomplete".
> >> >> >
> >> >> > I would call it incomplete.
> >> >> >
> >> >> > Please see HBASE-15333 'Enhance the filter to handle short,
> integer,
> >> >> > long,
> >> >> > float and double' which is a bug fix.
> >> >> >
> >> >> > Please exclude presence of related of module in vendor distro
> from
> >> >> > this
> >> >> > discussion.
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
> >> >> > 
> >> >> > wrote:
> >> >> >>
> >> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu  >
> >> >> >> wrote:
> >> >> >> > I want to note that the hbase-spark module in HBase is
> incomplete.
> >> >> >> > Zhan
> >> >> >> > has
> >> >> >> > several patches pending review.
> >> >> >>
> >> >> >> I wouldn't call it "incomplete". Lots of functionality is
> there,
> >> >> >> which
> >> >> >> doesn't mean new ones, or more efficient implementations of
> existing
> >> >> >> ones, can't be added.
> >> >> >>
> >> >> >> > hbase-spark module is currently only in master branch which
> would
> >> >> >> > be
> >> >> >> > released as 2.0
> >> >> >>
> >> >> >> Just as a side note, it's part of CDH 5.7.0, not that it
> matters
> >> >> >> much
> >> >> >

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Marcelo Vanzin
On Tue, Apr 19, 2016 at 11:21 AM, Ted Yu  wrote:

> Clarification: in my previous email, I was not talking
> about spark-streaming-flume artifact or spark-streaming-kafka artifact.
> I was talking about examples for these projects, such
> as examples//src/main/python/streaming/flume_wordcount.py
>

I understand. And those examples are showing how to use code that is part
of Spark. HBaseTest just shows how to use a generic Spark API that can both
be used to talk to HBase or to anything else that has an InputFormat, so
it's much less useful as an example.

I'd put CassandraTest in that same category, although that particular
example at least shows more functionality than the HBase one.



> On Tue, Apr 19, 2016 at 11:10 AM, Marcelo Vanzin 
> wrote:
>
>> On Tue, Apr 19, 2016 at 11:07 AM, Ted Yu  wrote:
>>
>>> The same question can be asked w.r.t. examples for other projects, such
>>> as flume and kafka.
>>>
>>
>> The main difference being that flume and kafka integration are part of
>> Spark itself. HBase integration is not.
>>
>>
>>
>>> On Tue, Apr 19, 2016 at 11:01 AM, Marcin Tustin 
>>> wrote:
>>>
 Let's posit that the spark example is much better than what is
 available in HBase. Why is that a reason to keep it within Spark?

 On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu  wrote:

> bq. HBase's current support, even if there are bugs or things that
> still need to be done, is much better than the Spark example
>
> In my opinion, a simple example that works is better than a buggy
> package.
>
> I hope before long the hbase-spark module in HBase can arrive at a
> state which we can advertise as mature - but we're not there yet.
>
> On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin 
> wrote:
>
>> You're completely missing my point. I'm saying that HBase's current
>> support, even if there are bugs or things that still need to be done,
>> is much better than the Spark example, which is basically a call to
>> "SparkContext.hadoopRDD".
>>
>> Spark's example is not helpful in learning how to build an HBase
>> application on Spark, and clashes head on with how the HBase
>> developers think it should be done. That, and because it brings too
>> many dependencies for something that is not really useful, is why I'm
>> suggesting removing it.
>>
>>
>> On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu  wrote:
>> > There is an Open JIRA for fixing the documentation: HBASE-15473
>> >
>> > I would say the refguide link you provided should not be considered
>> as
>> > complete.
>> >
>> > Note it is marked as Blocker by Sean B.
>> >
>> > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin <
>> van...@cloudera.com>
>> > wrote:
>> >>
>> >> You're entitled to your own opinions.
>> >>
>> >> While you're at it, here's some much better documentation, from the
>> >> HBase project themselves, than what the Spark example provides:
>> >> http://hbase.apache.org/book.html#spark
>> >>
>> >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu 
>> wrote:
>> >> > bq. it's actually in use right now in spite of not being in any
>> upstream
>> >> > HBase release
>> >> >
>> >> > If it is not in upstream, then it is not relevant for discussion
>> on
>> >> > Apache
>> >> > mailing list.
>> >> >
>> >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <
>> van...@cloudera.com>
>> >> > wrote:
>> >> >>
>> >> >> Alright, if you prefer, I'll say "it's actually in use right
>> now in
>> >> >> spite of not being in any upstream HBase release", and it's more
>> >> >> useful than a single example file in the Spark repo for those
>> who
>> >> >> really want to integrate with HBase.
>> >> >>
>> >> >> Spark's example is really very trivial (just uses one of
>> HBase's input
>> >> >> formats), which makes it not very useful as a blueprint for
>> developing
>> >> >> HBase apps with Spark.
>> >> >>
>> >> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu 
>> wrote:
>> >> >> > bq. I wouldn't call it "incomplete".
>> >> >> >
>> >> >> > I would call it incomplete.
>> >> >> >
>> >> >> > Please see HBASE-15333 'Enhance the filter to handle short,
>> integer,
>> >> >> > long,
>> >> >> > float and double' which is a bug fix.
>> >> >> >
>> >> >> > Please exclude presence of related of module in vendor distro
>> from
>> >> >> > this
>> >> >> > discussion.
>> >> >> >
>> >> >> > Thanks
>> >> >> >
>> >> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
>> >> >> > 
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu <
>> yuzhih...@gmail.com>
>> >> >> >> wrote:
>> >> >> >> > I want to note that the hbase-spark module in HBase is
>> incomplete.
>> >> >> >> > Zhan

Re: Possible deadlock in registering applications in the recovery mode

2016-04-19 Thread Niranda Perera
Hi Reynold,

I have created a JIRA for this [1]. I have also created a PR for the same
issue [2].

Would be very grateful if you could look into this, because this is a
blocker in our spark deployment, which uses number of spark custom
extension.

thanks
best

[1] https://issues.apache.org/jira/browse/SPARK-14736
[2] https://github.com/apache/spark/pull/12506

On Mon, Apr 18, 2016 at 9:02 AM, Reynold Xin  wrote:

> I haven't looked closely at this, but I think your proposal makes sense.
>
>
> On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera 
> wrote:
>
>> Hi guys,
>>
>> Any update on this?
>>
>> Best
>>
>> On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <
>> niranda.per...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have encountered a small issue in the standalone recovery mode.
>>>
>>> Let's say there was an application A running in the cluster. Due to some
>>> issue, the entire cluster, together with the application A goes down.
>>>
>>> Then later on, cluster comes back online, and the master then goes into
>>> the 'recovering' mode, because it sees some apps, workers and drivers have
>>> already been in the cluster from Persistence Engine. While in the recovery
>>> process, the application comes back online, but now it would have a
>>> different ID, let's say B.
>>>
>>> But then, as per the master, application registration logic, this
>>> application B will NOT be added to the 'waitingApps' with the message
>>> ""Attempted to re-register application at same address". [1]
>>>
>>>   private def registerApplication(app: ApplicationInfo): Unit = {
>>> val appAddress = app.driver.address
>>> if (addressToApp.contains(appAddress)) {
>>>   logInfo("Attempted to re-register application at same address: " +
>>> appAddress)
>>>   return
>>> }
>>>
>>>
>>> The problem here is, master is trying to recover application A, which is
>>> not in there anymore. Therefore after the recovery process, app A will be
>>> dropped. However app A's successor, app B was also omitted from the
>>> 'waitingApps' list because it had the same address as App A previously.
>>>
>>> This creates a deadlock in the cluster, app A nor app B is available in
>>> the cluster.
>>>
>>> When the master is in the RECOVERING mode, shouldn't it add all the
>>> registering apps to a list first, and then after the recovery is completed
>>> (once the unsuccessful recoveries are removed), deploy the apps which are
>>> new?
>>>
>>> This would sort this deadlock IMO?
>>>
>>> look forward to hearing from you.
>>>
>>> best
>>>
>>> [1]
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
>>>
>>> --
>>> Niranda
>>> @n1r44 
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>>
>> --
>> Niranda
>> @n1r44 
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/


Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Reynold Xin
I talked to Lianhui offline and he said it is not that big of a deal to
revert the patch.


On Tue, Apr 19, 2016 at 9:52 AM, Mark Grover  wrote:

> Thanks.
>
> I'm more than happy to wait for more people to chime in here but I do feel
> that most of us are leaning towards Option B anyways. So, I created a JIRA
> (SPARK-14731) for reverting SPARK-12130 in Spark 2.0 and file a PR shortly.
> Mark
>
> On Tue, Apr 19, 2016 at 7:44 AM, Tom Graves 
> wrote:
>
>> It would be nice if we could keep this compatible between 1.6 and 2.0 so
>> I'm more for Option B at this point since the change made seems minor
>> and we can change to have shuffle service do internally like Marcelo
>> mention. Then lets try to keep compatible, but if there is a forcing
>> function lets figure out a good way to run 2 at once.
>>
>>
>> Tom
>>
>>
>> On Monday, April 18, 2016 5:23 PM, Marcelo Vanzin 
>> wrote:
>>
>>
>> On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin  wrote:
>> > IIUC, the reason for that PR is that they found the string comparison to
>> > increase the size in large shuffles. Maybe we should add the ability to
>> > support the short name to Spark 1.6.2?
>>
>> Is that something that really yields noticeable gains in performance?
>>
>> If it is, it seems like it would be simple to allow executors register
>> with the full class name, and map the long names to short names in the
>> shuffle service itself.
>>
>> You could even get fancy and have different ExecutorShuffleInfo
>> implementations for each shuffle service, with an abstract
>> "getBlockData" method that gets called instead of the current if/else
>> in ExternalShuffleBlockResolver.java.
>>
>>
>> --
>> Marcelo
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>
>>
>


Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Mark Grover
Great, thanks for confirming, Reynold. Appreciate it!

On Tue, Apr 19, 2016 at 4:20 PM, Reynold Xin  wrote:

> I talked to Lianhui offline and he said it is not that big of a deal to
> revert the patch.
>
>
> On Tue, Apr 19, 2016 at 9:52 AM, Mark Grover  wrote:
>
>> Thanks.
>>
>> I'm more than happy to wait for more people to chime in here but I do
>> feel that most of us are leaning towards Option B anyways. So, I created a
>> JIRA (SPARK-14731) for reverting SPARK-12130 in Spark 2.0 and file a PR
>> shortly.
>> Mark
>>
>> On Tue, Apr 19, 2016 at 7:44 AM, Tom Graves > > wrote:
>>
>>> It would be nice if we could keep this compatible between 1.6 and 2.0 so
>>> I'm more for Option B at this point since the change made seems minor
>>> and we can change to have shuffle service do internally like Marcelo
>>> mention. Then lets try to keep compatible, but if there is a forcing
>>> function lets figure out a good way to run 2 at once.
>>>
>>>
>>> Tom
>>>
>>>
>>> On Monday, April 18, 2016 5:23 PM, Marcelo Vanzin 
>>> wrote:
>>>
>>>
>>> On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin 
>>> wrote:
>>> > IIUC, the reason for that PR is that they found the string comparison
>>> to
>>> > increase the size in large shuffles. Maybe we should add the ability to
>>> > support the short name to Spark 1.6.2?
>>>
>>> Is that something that really yields noticeable gains in performance?
>>>
>>> If it is, it seems like it would be simple to allow executors register
>>> with the full class name, and map the long names to short names in the
>>> shuffle service itself.
>>>
>>> You could even get fancy and have different ExecutorShuffleInfo
>>> implementations for each shuffle service, with an abstract
>>> "getBlockData" method that gets called instead of the current if/else
>>> in ExternalShuffleBlockResolver.java.
>>>
>>>
>>> --
>>> Marcelo
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>
>