Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-10-09 Thread DB Tsai
Nice to hear that your experiment is consistent to my assumption. The
current L1/L2 will penalize the intercept as well which is not idea.
I'm working on GLMNET in Spark using OWLQN, and I can exactly get the
same solution as R but with scalability in # of rows and columns. Stay
tuned!

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Sep 29, 2014 at 11:45 AM, Yanbo Liang  wrote:
> Thank you for all your patient response.
>
> I can conclude that if the data is totally separable or over-fit occurs,
> weights may be different.
> And it also consistent with my experiment.
>
> I have evaluate two different dataset and the result as followed:
> Loss function: LogisticGradient
> Regularizer: L2
> regParam: 1.0
> numIterations: 1 (SGD)
>
> Dataset 1: spark-1.1.0/data/mllib/sample_binary_classification_data.txt
> # of classes: 2
> # of samples: 100
> # of features: 692
> areaUnderROC of both SGD and LBFGS can reach nearly 1.0
> Loss function of both optimization method converge nearly
> 1.7147811767900675E-5 (very very small)
> Weights of each optimization method is different but looks like multiple
> relationship (not very strict) just as what DB Tsai mention above.  It might
> be the dataset is totally separable.
>
> Dataset 2:
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#german.numer
> # of classes: 2
> # of samples: 1000
> # of features: 24
> areaUnderROC of both SGD and LBFGS both are nearly 0.8
> Loss function of both optimization method converge nearly 0.5367041390107519
> Weights of each optimization method is just the same.
>
>
>
> 2014-09-29 16:05 GMT+08:00 DB Tsai :
>>
>> Can you check the loss of both LBFGS and SGD implementation? One
>> reason maybe SGD doesn't converge well and you can see that by
>> comparing both log-likelihoods. One other potential reason maybe the
>> label of your training data is totally separable, so you can always
>> increase the log-likelihood by multiply a constant to the weights.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang 
>> wrote:
>> > Hi
>> >
>> > We have used LogisticRegression with two different optimization method
>> > SGD
>> > and LBFGS in MLlib.
>> > With the same dataset and the same training and test split, but get
>> > different weights vector.
>> >
>> > For example, we use
>> > spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our
>> > training
>> > and test dataset.
>> > With LogisticRegressionWithSGD and LogisticRegressionWithLBFGS as
>> > training
>> > method and the same other parameters.
>> >
>> > The precisions of these two methods almost near 100% and AUCs are also
>> > near
>> > 1.0.
>> > As far as I know, the convex optimization problem will converge to the
>> > global minimum value. (We use SGD with mini batch fraction as 1.0)
>> > But I got two different weights vector? Is this expectation or make
>> > sense?
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Trouble running tests

2014-10-09 Thread Yana
Hi, apologies if I missed a FAQ somewhere.

I am trying to submit a bug fix for the very first time. Reading
instructions, I forked the git repo (at
c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.

I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true

and after a while get the following error: 

[info] ScalaTest
[info] Run completed in 3 minutes, 37 seconds.
[info] Total number of tests run: 224
[info] Suites: completed 19, aborted 0
[info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
[info] All tests passed.
[info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
[success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
[error] Expected ID character
[error] Not a valid command: hive-thriftserver
[error] Expected project ID
[error] Expected configuration
[error] Expected ':' (if selecting a configuration)
[error] Expected key
[error] Not a valid key: hive-thriftserver
[error] hive-thriftserver/test
[error]  ^


(I am running this without my changes)

I have 2 questions:
1. How to fix this
2. Is there a best practice on what to fork so you start off with a "good
state"? I'm wondering if I should sync the latest changes or go back to a
label?

thanks in advance




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Introduction to Spark Blog

2014-10-09 Thread devl.development
Hi Spark community

Having spent some time getting up to speed with the various Spark components
in the core package, I've written a blog to help other newcomers and
contributors.

By no means am I a Spark expert so would be grateful for any advice,
comments or edit suggestions. 

Thanks very much here's the post.

http://batchinsights.wordpress.com/2014/10/09/a-short-dive-into-apache-spark/

Dev





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Introduction-to-Spark-Blog-tp8718.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Trouble running tests

2014-10-09 Thread Nicholas Chammas
_RUN_SQL_TESTS needs to be true as well. Those two _... variables set get
correctly when tests are run on Jenkins. They’re not meant to be
manipulated directly by testers.

Did you want to run SQL tests only locally? You can try faking being
Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That
should be simpler than futzing with the _... variables.

Nick
​

On Thu, Oct 9, 2014 at 10:10 AM, Yana  wrote:

> Hi, apologies if I missed a FAQ somewhere.
>
> I am trying to submit a bug fix for the very first time. Reading
> instructions, I forked the git repo (at
> c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.
>
> I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true
>
> and after a while get the following error:
>
> [info] ScalaTest
> [info] Run completed in 3 minutes, 37 seconds.
> [info] Total number of tests run: 224
> [info] Suites: completed 19, aborted 0
> [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
> [info] All tests passed.
> [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
> [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
> [error] Expected ID character
> [error] Not a valid command: hive-thriftserver
> [error] Expected project ID
> [error] Expected configuration
> [error] Expected ':' (if selecting a configuration)
> [error] Expected key
> [error] Not a valid key: hive-thriftserver
> [error] hive-thriftserver/test
> [error]  ^
>
>
> (I am running this without my changes)
>
> I have 2 questions:
> 1. How to fix this
> 2. Is there a best practice on what to fork so you start off with a "good
> state"? I'm wondering if I should sync the latest changes or go back to a
> label?
>
> thanks in advance
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: TorrentBroadcast slow performance

2014-10-09 Thread Guillaume Pitel

Hi,

Thanks to your answer, we've found the problem. It was on reverse IP 
resolution on the drivers we used (wrong configuration of the local 
bind9). Apparently, not being able to reverse-resolve the IP address of 
the nodes was the culprit of the 10s delay.


We've hit two other secondary problems with TorrentBroadcast though, in 
case you're interested  :


1 - Broadcasting a variable of about 2GB (1.8GB exactly) triggers a 
"java.lang.OutOfMemoryError: Requested array size exceeds VM limit", 
which is not the case with HttpBroadcast (I guess HttpBroadcast splits 
the serialized variable in small chunks)
2 - Memory use of Torrent seems to be higher than Http (i.e. switching 
from Http to Torrent triggers several OOM).


Additionally, a question : while HttpBroadcast stores the broadcast 
pieces on disk (in spark.local.dir/spark-... ), TorrentBroadcast seems 
not to use disk backend storage. Does it mean that HttpBroadcast can 
handle bigger broadcast out of memory ? If so, it's too bad that this 
design choice wasn't used for Torrent.


 That being said, hats off to the people in charge of the broadcast 
unloading wrt the lineage, this stuff works great !


Guillaume



Maybe there is a firewall issue that makes it slow for your nodes to connect through the IP 
addresses they're configured with. I see there's this 10 second pause between "Updated info of 
block broadcast_84_piece1" and "ensureFreeSpace(4194304) called" (where it actually 
receives the block). HTTP broadcast used only HTTP fetches from the executors to the driver, but 
TorrentBroadcast has connections between the executors themselves and between executors and the 
driver over a different port. Where are you running your driver app and nodes?

Matei

On Oct 7, 2014, at 11:42 AM, Davies Liu  wrote:


Could you create a JIRA for it? maybe it's a regression after
https://issues.apache.org/jira/browse/SPARK-3119.

We will appreciate that if you could tell how to reproduce it.

On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel
 wrote:

Hi,

I've had no answer to this on u...@spark.apache.org, so I post it on dev
before filing a JIRA (in case the problem or solution is already identified)

We've had some performance issues since switching to 1.1.0, and we finally
found the origin : TorrentBroadcast seems to be very slow in our setting
(and it became default with 1.1.0)

The logs of a 4MB variable with TorrentBroadcast : (15s)

14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84_piece1 stored
as bytes in memory (estimated size 171.6 KB, free 7.2 GB)
14/10/01 15:47:13 INFO storage.BlockManagerMaster: Updated info of block
broadcast_84_piece1
14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4194304) called
with curMem=1401611984, maxMem=9168696115
14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84_piece0 stored
as bytes in memory (estimated size 4.0 MB, free 7.2 GB)
14/10/01 15:47:23 INFO storage.BlockManagerMaster: Updated info of block
broadcast_84_piece0
14/10/01 15:47:23 INFO broadcast.TorrentBroadcast: Reading broadcast
variable 84 took 15.202260006 s
14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4371392) called
with curMem=1405806288, maxMem=9168696115
14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84 stored as
values in memory (estimated size 4.2 MB, free 7.2 GB)

(notice that a 10s lag happens after the "Updated info of block
broadcast_..." and before the MemoryStore log

And with HttpBroadcast (0.3s):

14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Started reading broadcast
variable 147
14/10/01 16:05:58 INFO storage.MemoryStore: ensureFreeSpace(4369376) called
with curMem=1373493232, maxMem=9168696115
14/10/01 16:05:58 INFO storage.MemoryStore: Block broadcast_147 stored as
values in memory (estimated size 4.2 MB, free 7.3 GB)
14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Reading broadcast variable
147 took 0.320907112 s 14/10/01 16:05:58 INFO storage.BlockManager: Found
block broadcast_147 locally

Since Torrent is supposed to perform much better than Http, we suspect a
configuration error from our side, but are unable to pin it down. Does
someone have any idea of the origin of the problem ?

For now we're sticking with the HttpBroadcast workaround.

Guillaume
--
Guillaume PITEL, Président
+33(0)626 222 431

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread James Yu
For performance, will foreign data format support, same as native ones?

Thanks,
James


On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian  wrote:

> The foreign data source API PR also matters here
> https://www.github.com/apache/spark/pull/2475
>
> Foreign data source like ORC can be added more easily and systematically
> after this PR is merged.
>
> On 10/9/14 8:22 AM, James Yu wrote:
>
>> Thanks Mark! I will keep eye on it.
>>
>> @Evan, I saw people use both format, so I really want to have Spark
>> support
>> ORCFile.
>>
>>
>> On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra 
>> wrote:
>>
>>  https://github.com/apache/spark/pull/2576
>>>
>>>
>>>
>>> On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan 
>>> wrote:
>>>
>>>  James,

 Michael at the meetup last night said there was some development
 activity around ORCFiles.

 I'm curious though, what are the pros and cons of ORCFiles vs Parquet?

 On Wed, Oct 8, 2014 at 10:03 AM, James Yu  wrote:

> Didn't see anyone asked the question before, but I was wondering if
>
 anyone

> knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
> getting more and more popular hi Hive world.
>
> Thanks,
> James
>
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



>


Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread Michael Armbrust
Yes, the foreign sources work is only about exposing a stable set of APIs
for external libraries to link against (to avoid the spark assembly
becoming a dependency mess).  The code path these APIs use will be the same
as that for datasources included in the core spark sql library.

Michael

On Thu, Oct 9, 2014 at 2:18 PM, James Yu  wrote:

> For performance, will foreign data format support, same as native ones?
>
> Thanks,
> James
>
>
> On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian  wrote:
>
> > The foreign data source API PR also matters here
> > https://www.github.com/apache/spark/pull/2475
> >
> > Foreign data source like ORC can be added more easily and systematically
> > after this PR is merged.
> >
> > On 10/9/14 8:22 AM, James Yu wrote:
> >
> >> Thanks Mark! I will keep eye on it.
> >>
> >> @Evan, I saw people use both format, so I really want to have Spark
> >> support
> >> ORCFile.
> >>
> >>
> >> On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra 
> >> wrote:
> >>
> >>  https://github.com/apache/spark/pull/2576
> >>>
> >>>
> >>>
> >>> On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan 
> >>> wrote:
> >>>
> >>>  James,
> 
>  Michael at the meetup last night said there was some development
>  activity around ORCFiles.
> 
>  I'm curious though, what are the pros and cons of ORCFiles vs Parquet?
> 
>  On Wed, Oct 8, 2014 at 10:03 AM, James Yu  wrote:
> 
> > Didn't see anyone asked the question before, but I was wondering if
> >
>  anyone
> 
> > knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
> > getting more and more popular hi Hive world.
> >
> > Thanks,
> > James
> >
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>  For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 
> 
> >
>


Re: Fwd: Accumulator question

2014-10-09 Thread Josh Rosen
Hi Nathan,

You’re right, it looks like we don’t currently provide a method to unregister 
accumulators.  I’ve opened a JIRA to discuss a fix: 
https://issues.apache.org/jira/browse/SPARK-3885

In the meantime, here’s a workaround that might work:  Accumulators have a 
public setValue() method that can be called (only by the driver) to change an 
accumulator’s value.  You might be able to use this to reset accumulators’ 
values to smaller objects (e.g. the “zero” object of whatever your accumulator 
type is, or ‘null’ if you’re sure that the accumulator will never be accessed 
again).

Hope this helps,
Josh

On October 8, 2014 at 2:54:33 PM, Nathan Kronenfeld 
(nkronenf...@oculusinfo.com) wrote:

I notice that accumulators register themselves with a private Accumulators  
object.  

I don't notice any way to unregister them when one is done.  

Am I missing something? If not, is there any plan for how to free up that  
memory?  

I've a case where we're gathering data from repeated queries using some  
relatively sizable accumulators; at the moment, we're creating one per  
query, and running out of memory after far too few queries.  

I've tried methods that don't involve accumulators; they involve a shuffle  
instead, and take 10x as long.  

Thanks,  
-Nathan  




--  
Nathan Kronenfeld  
Senior Visualization Developer  
Oculus Info Inc  
2 Berkeley Street, Suite 600,  
Toronto, Ontario M5A 4J5  
Phone: +1-416-203-3003 x 238  
Email: nkronenf...@oculusinfo.com  


Re: Trouble running tests

2014-10-09 Thread Michael Armbrust
Also, in general for SQL only changes it is sufficient to run "sbt/sbt
catatlyst/test sql/test hive/test".  The "hive/test" part takes the
longest, so I usually leave that out until just before submitting unless my
changes are hive specific.

On Thu, Oct 9, 2014 at 11:40 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> _RUN_SQL_TESTS needs to be true as well. Those two _... variables set get
> correctly when tests are run on Jenkins. They’re not meant to be
> manipulated directly by testers.
>
> Did you want to run SQL tests only locally? You can try faking being
> Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That
> should be simpler than futzing with the _... variables.
>
> Nick
> ​
>
> On Thu, Oct 9, 2014 at 10:10 AM, Yana  wrote:
>
> > Hi, apologies if I missed a FAQ somewhere.
> >
> > I am trying to submit a bug fix for the very first time. Reading
> > instructions, I forked the git repo (at
> > c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.
> >
> > I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true
> >
> > and after a while get the following error:
> >
> > [info] ScalaTest
> > [info] Run completed in 3 minutes, 37 seconds.
> > [info] Total number of tests run: 224
> > [info] Suites: completed 19, aborted 0
> > [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
> > [info] All tests passed.
> > [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
> > [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
> > [error] Expected ID character
> > [error] Not a valid command: hive-thriftserver
> > [error] Expected project ID
> > [error] Expected configuration
> > [error] Expected ':' (if selecting a configuration)
> > [error] Expected key
> > [error] Not a valid key: hive-thriftserver
> > [error] hive-thriftserver/test
> > [error]  ^
> >
> >
> > (I am running this without my changes)
> >
> > I have 2 questions:
> > 1. How to fix this
> > 2. Is there a best practice on what to fork so you start off with a "good
> > state"? I'm wondering if I should sync the latest changes or go back to a
> > label?
> >
> > thanks in advance
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Thanks for the feedback. For 1, there is an open patch: 
https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually use 
MEMORY_AND_DISK storage, so they will spill to disk if you have low memory, but 
they're faster to access otherwise.

Matei

On Oct 9, 2014, at 12:11 PM, Guillaume Pitel  wrote:

> Hi,
> 
> Thanks to your answer, we've found the problem. It was on reverse IP 
> resolution on the drivers we used (wrong configuration of the local bind9). 
> Apparently, not being able to reverse-resolve the IP address of the nodes was 
> the culprit of the 10s delay.
> 
> We've hit two other secondary problems with TorrentBroadcast though, in case 
> you're interested  :
> 
> 1 - Broadcasting a variable of about 2GB (1.8GB exactly) triggers a 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit", which is 
> not the case with HttpBroadcast (I guess HttpBroadcast splits the serialized 
> variable in small chunks)
> 2 - Memory use of Torrent seems to be higher than Http (i.e. switching from 
> Http to Torrent triggers several OOM).
> 
> Additionally, a question : while HttpBroadcast stores the broadcast pieces on 
> disk (in spark.local.dir/spark-... ), TorrentBroadcast seems not to use disk 
> backend storage. Does it mean that HttpBroadcast can handle bigger broadcast 
> out of memory ? If so, it's too bad that this design choice wasn't used for 
> Torrent.
> 
> That being said, hats off to the people in charge of the broadcast unloading 
> wrt the lineage, this stuff works great !
> 
> Guillaume
> 
> 
>> Maybe there is a firewall issue that makes it slow for your nodes to connect 
>> through the IP addresses they're configured with. I see there's this 10 
>> second pause between "Updated info of block broadcast_84_piece1" and 
>> "ensureFreeSpace(4194304) called" (where it actually receives the block). 
>> HTTP broadcast used only HTTP fetches from the executors to the driver, but 
>> TorrentBroadcast has connections between the executors themselves and 
>> between executors and the driver over a different port. Where are you 
>> running your driver app and nodes?
>> 
>> Matei
>> 
>> On Oct 7, 2014, at 11:42 AM, Davies Liu  wrote:
>> 
>>> Could you create a JIRA for it? maybe it's a regression after
>>> https://issues.apache.org/jira/browse/SPARK-3119.
>>> 
>>> We will appreciate that if you could tell how to reproduce it.
>>> 
>>> On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel
>>>  wrote:
 Hi,
 
 I've had no answer to this on u...@spark.apache.org, so I post it on dev
 before filing a JIRA (in case the problem or solution is already 
 identified)
 
 We've had some performance issues since switching to 1.1.0, and we finally
 found the origin : TorrentBroadcast seems to be very slow in our setting
 (and it became default with 1.1.0)
 
 The logs of a 4MB variable with TorrentBroadcast : (15s)
 
 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84_piece1 
 stored
 as bytes in memory (estimated size 171.6 KB, free 7.2 GB)
 14/10/01 15:47:13 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_84_piece1
 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4194304) called
 with curMem=1401611984, maxMem=9168696115
 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84_piece0 
 stored
 as bytes in memory (estimated size 4.0 MB, free 7.2 GB)
 14/10/01 15:47:23 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_84_piece0
 14/10/01 15:47:23 INFO broadcast.TorrentBroadcast: Reading broadcast
 variable 84 took 15.202260006 s
 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4371392) called
 with curMem=1405806288, maxMem=9168696115
 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84 stored as
 values in memory (estimated size 4.2 MB, free 7.2 GB)
 
 (notice that a 10s lag happens after the "Updated info of block
 broadcast_..." and before the MemoryStore log
 
 And with HttpBroadcast (0.3s):
 
 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Started reading broadcast
 variable 147
 14/10/01 16:05:58 INFO storage.MemoryStore: ensureFreeSpace(4369376) called
 with curMem=1373493232, maxMem=9168696115
 14/10/01 16:05:58 INFO storage.MemoryStore: Block broadcast_147 stored as
 values in memory (estimated size 4.2 MB, free 7.3 GB)
 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Reading broadcast variable
 147 took 0.320907112 s 14/10/01 16:05:58 INFO storage.BlockManager: Found
 block broadcast_147 locally
 
 Since Torrent is supposed to perform much better than Http, we suspect a
 configuration error from our side, but are unable to pin it down. Does
 someone have any idea of the origin of the problem ?
 
 For now we're sticking with the HttpBroadcast workaround.
 
 Guillaume
 --
 Guillaume PI

Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Oops I forgot to add, for 2, maybe we can add a flag to use DISK_ONLY for 
TorrentBroadcast, or if the broadcasts are bigger than some size.

Matei

On Oct 9, 2014, at 3:04 PM, Matei Zaharia  wrote:

> Thanks for the feedback. For 1, there is an open patch: 
> https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually 
> use MEMORY_AND_DISK storage, so they will spill to disk if you have low 
> memory, but they're faster to access otherwise.
> 
> Matei
> 
> On Oct 9, 2014, at 12:11 PM, Guillaume Pitel  
> wrote:
> 
>> Hi,
>> 
>> Thanks to your answer, we've found the problem. It was on reverse IP 
>> resolution on the drivers we used (wrong configuration of the local bind9). 
>> Apparently, not being able to reverse-resolve the IP address of the nodes 
>> was the culprit of the 10s delay.
>> 
>> We've hit two other secondary problems with TorrentBroadcast though, in case 
>> you're interested  :
>> 
>> 1 - Broadcasting a variable of about 2GB (1.8GB exactly) triggers a 
>> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit", which 
>> is not the case with HttpBroadcast (I guess HttpBroadcast splits the 
>> serialized variable in small chunks)
>> 2 - Memory use of Torrent seems to be higher than Http (i.e. switching from 
>> Http to Torrent triggers several OOM).
>> 
>> Additionally, a question : while HttpBroadcast stores the broadcast pieces 
>> on disk (in spark.local.dir/spark-... ), TorrentBroadcast seems not to use 
>> disk backend storage. Does it mean that HttpBroadcast can handle bigger 
>> broadcast out of memory ? If so, it's too bad that this design choice wasn't 
>> used for Torrent.
>> 
>> That being said, hats off to the people in charge of the broadcast unloading 
>> wrt the lineage, this stuff works great !
>> 
>> Guillaume
>> 
>> 
>>> Maybe there is a firewall issue that makes it slow for your nodes to 
>>> connect through the IP addresses they're configured with. I see there's 
>>> this 10 second pause between "Updated info of block broadcast_84_piece1" 
>>> and "ensureFreeSpace(4194304) called" (where it actually receives the 
>>> block). HTTP broadcast used only HTTP fetches from the executors to the 
>>> driver, but TorrentBroadcast has connections between the executors 
>>> themselves and between executors and the driver over a different port. 
>>> Where are you running your driver app and nodes?
>>> 
>>> Matei
>>> 
>>> On Oct 7, 2014, at 11:42 AM, Davies Liu  wrote:
>>> 
 Could you create a JIRA for it? maybe it's a regression after
 https://issues.apache.org/jira/browse/SPARK-3119.
 
 We will appreciate that if you could tell how to reproduce it.
 
 On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel
  wrote:
> Hi,
> 
> I've had no answer to this on u...@spark.apache.org, so I post it on dev
> before filing a JIRA (in case the problem or solution is already 
> identified)
> 
> We've had some performance issues since switching to 1.1.0, and we finally
> found the origin : TorrentBroadcast seems to be very slow in our setting
> (and it became default with 1.1.0)
> 
> The logs of a 4MB variable with TorrentBroadcast : (15s)
> 
> 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84_piece1 
> stored
> as bytes in memory (estimated size 171.6 KB, free 7.2 GB)
> 14/10/01 15:47:13 INFO storage.BlockManagerMaster: Updated info of block
> broadcast_84_piece1
> 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4194304) 
> called
> with curMem=1401611984, maxMem=9168696115
> 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84_piece0 
> stored
> as bytes in memory (estimated size 4.0 MB, free 7.2 GB)
> 14/10/01 15:47:23 INFO storage.BlockManagerMaster: Updated info of block
> broadcast_84_piece0
> 14/10/01 15:47:23 INFO broadcast.TorrentBroadcast: Reading broadcast
> variable 84 took 15.202260006 s
> 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4371392) 
> called
> with curMem=1405806288, maxMem=9168696115
> 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84 stored as
> values in memory (estimated size 4.2 MB, free 7.2 GB)
> 
> (notice that a 10s lag happens after the "Updated info of block
> broadcast_..." and before the MemoryStore log
> 
> And with HttpBroadcast (0.3s):
> 
> 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Started reading broadcast
> variable 147
> 14/10/01 16:05:58 INFO storage.MemoryStore: ensureFreeSpace(4369376) 
> called
> with curMem=1373493232, maxMem=9168696115
> 14/10/01 16:05:58 INFO storage.MemoryStore: Block broadcast_147 stored as
> values in memory (estimated size 4.2 MB, free 7.3 GB)
> 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Reading broadcast variable
> 147 took 0.320907112 s 14/10/01 16:05:58 INFO storage.BlockManager: Found
> block broadcast_147 local

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread James Yu
Sounds great, thanks!



On Thu, Oct 9, 2014 at 2:22 PM, Michael Armbrust 
wrote:

> Yes, the foreign sources work is only about exposing a stable set of APIs
> for external libraries to link against (to avoid the spark assembly
> becoming a dependency mess).  The code path these APIs use will be the same
> as that for datasources included in the core spark sql library.
>
> Michael
>
> On Thu, Oct 9, 2014 at 2:18 PM, James Yu  wrote:
>
>> For performance, will foreign data format support, same as native ones?
>>
>> Thanks,
>> James
>>
>>
>> On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian 
>> wrote:
>>
>> > The foreign data source API PR also matters here
>> > https://www.github.com/apache/spark/pull/2475
>> >
>> > Foreign data source like ORC can be added more easily and systematically
>> > after this PR is merged.
>> >
>> > On 10/9/14 8:22 AM, James Yu wrote:
>> >
>> >> Thanks Mark! I will keep eye on it.
>> >>
>> >> @Evan, I saw people use both format, so I really want to have Spark
>> >> support
>> >> ORCFile.
>> >>
>> >>
>> >> On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra > >
>> >> wrote:
>> >>
>> >>  https://github.com/apache/spark/pull/2576
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan 
>> >>> wrote:
>> >>>
>> >>>  James,
>> 
>>  Michael at the meetup last night said there was some development
>>  activity around ORCFiles.
>> 
>>  I'm curious though, what are the pros and cons of ORCFiles vs
>> Parquet?
>> 
>>  On Wed, Oct 8, 2014 at 10:03 AM, James Yu  wrote:
>> 
>> > Didn't see anyone asked the question before, but I was wondering if
>> >
>>  anyone
>> 
>> > knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
>> > getting more and more popular hi Hive world.
>> >
>> > Thanks,
>> > James
>> >
>>  -
>>  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>  For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 
>> 
>> >
>>
>
>


spark-prs and mesos/spark-ec2

2014-10-09 Thread Nicholas Chammas
Does it make sense to point the Spark PR review board to read from
mesos/spark-ec2 as well? PRs submitted against that repo may reference
Spark JIRAs and need review just like any other Spark PR.

Nick


[Spark SQL] Strange NPE in Spark SQL with Hive

2014-10-09 Thread Trident
Hi Community,

  I use Spark 1.0.2, using Spark SQL to do Hive SQL.

  When I run the following code in Spark Shell:

val file = sc.textFile("./README.md")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 
1)).reduceByKey(_+_)
count.collect()
‍
  Correct and no error!

  When I run the following code:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.hql("SHOW TABLES").collect().foreach(println)‍

  Correct and no error!

  But when I run:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.hql("SELECT COUNT(*) from uservisits").collect().foreach(println)‍

  It comes with some error messages.


  What I found was the following error:  
14/10/09 19:47:34 ERROR Executor: Exception in task ID 4 
java.lang.NullPointerException at 
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at 
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
 at java.lang.Thread.run(Thread.java:745) 14/10/09 19:47:34 INFO 
CoarseGrainedExecutorBackend: Got assigned task 5 14/10/09 19:47:34 INFO 
Executor: Running task ID 5 14/10/09 19:47:34 DEBUG BlockManager: Getting local 
block broadcast_1 14/10/09 19:47:34 DEBUG BlockManager: Level for block 
broadcast_1 is StorageLevel(true, true, false, true, 1) 14/10/09 19:47:34 DEBUG 
BlockManager: Getting block broadcast_1 from memory 14/10/09 19:47:34 INFO 
BlockManager: Found block broadcast_1 locally 14/10/09 19:47:34 INFO 
BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, 
targetRequestSize: 10066329 14/10/09 19:47:34 INFO 
BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out 
of 2 blocks 14/10/09 19:47:34 DEBUG 
BlockFetcherIterator$BasicBlockFetcherIterator: Sending request for 2 blocks 
(2.5 KB) from node19:50868 14/10/09 19:47:34 DEBUG BlockMessageArray: Adding 
BlockMessage [type = 1, id = shuffle_0_0_1, level = null, data = null] 14/10/09 
19:47:34 DEBUG BlockMessageArray: Added BufferMessage(id = 5, size = 34) 
14/10/09 19:47:34 DEBUG BlockMessageArray: Adding BlockMessage [type = 1, id = 
shuffle_0_1_1, level = null, data = null] 14/10/09 19:47:34 DEBUG 
BlockMessageArray: Added BufferMessage(id = 6, size = 34) 14/10/09 19:47:34 
DEBUG BlockMessageArray: Buffer list: 14/10/09 19:47:34 DEBUG 
BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=4 cap=4] 14/10/09 19:47:34 
DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=34 cap=34] 14/10/09 
19:47:34 DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=4 cap=4] 
14/10/09 19:47:34 DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=34 
cap=34] 14/10/09 19:47:34 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
Started 1 remote fetches in 2 ms 14/10/09 19:47:34 DEBUG 
BlockFetcherIterator$BasicBlockFetcherIterator: Got local blocks in  0 ms ms 
14/10/09 19:47:34 ERROR Executor: Exception in task ID 5 
java.lang.NullPointerException   at 
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at 
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
 at java.lang.Thread.run(Thread.java:745) 14/10/09 19:47:34 INFO 
CoarseGrainedExecutorBackend: Got assigned task 6 14/10/09 19:47:34 INFO 
Executor: Running task ID 6 14/10/09 19:47:34 DEBUG BlockManager: Getting local 
block broadcast_1 14/10/09 19:47:34 DEBUG BlockManager: Level for block 
broadcast_1 is StorageLevel(true, true, false, true, 1) 14/10/09 19:47:34 DEBUG 
BlockManager: Getting block broadcast_1 from memory 14/10/09 19:47:34 INFO 
BlockManager: Found block broadcast_1 locally 14/10/09 19:47:34 INFO 
BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, 
targetRequestSize: 10066329 14/10/09 19:47:34 INFO 
BlockFetcherIterator$BasicBlockFetcherItera

[Spark SQL Continue] Sorry, it is not only limited in SQL, may due to network

2014-10-09 Thread Trident
Dear Community,

   Please ignore my last post about Spark SQL.

   When I run:
val file = sc.textFile("./README.md")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 
1)).reduceByKey(_+_)
count.collect()
‍
it happends too.

is there any possible reason for that? we make have some adjustment in 
network last night


 Chen Weikeng
14/10/09 20:45:23 ERROR Executor: Exception in task ID 1 
java.lang.NullPointerException at 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:116)  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)   at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
 at java.lang.Thread.run(Thread.java:745) 14/10/09 20:45:23 INFO 
CoarseGrainedExecutorBackend: Got assigned task 2 14/10/09 20:45:23 INFO 
Executor: Running task ID 2 14/10/09 20:45:23 DEBUG BlockManager: Getting local 
block broadcast_0 14/10/09 20:45:23 DEBUG BlockManager: Level for block 
broadcast_0 is StorageLevel(true, true, false, true, 1) 14/10/09 20:45:23 DEBUG 
BlockManager: Getting block broadcast_0 from memory 14/10/09 20:45:23 INFO 
BlockManager: Found block broadcast_0 locally 14/10/09 20:45:23 DEBUG Executor: 
Task 2's epoch is 0 14/10/09 20:45:23 INFO HadoopRDD: Input split: 
file:/public/rdma14/app/spark-rdma/examples/src/main/resources/people.txt:16+16 
14/10/09 20:45:23 ERROR Executor: Exception in task ID 2 
java.lang.NullPointerException  at 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:116)  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)   at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
 at java.lang.Thread.run(Thread.java:745)‍

Re: Spark on Mesos 0.20

2014-10-09 Thread Fairiz Azizi
Hello,

Sorry for the late reply.

When I tried the LogQuery example this time, things now seem to be fine!

...

14/10/10 04:01:21 INFO scheduler.DAGScheduler: Stage 0 (collect at
LogQuery.scala:80) finished in 0.429 s

14/10/10 04:01:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,
whose tasks have all completed, from pool defa

14/10/10 04:01:21 INFO spark.SparkContext: Job finished: collect at
LogQuery.scala:80, took 12.802743914 s

(10.10.10.10,"FRED",GET http://images.com/2013/Generic.jpg HTTP/1.1)
bytes=621   n=2


Not sure if this is the correct response for that example.

Our mesos/spark builds have since been updated since I last wrote.

Possibly, the JDK version was updated to 1.7.0_67

If you are using an older JDK, maybe try updating that?


- Fi



Fairiz "Fi" Azizi

On Wed, Oct 8, 2014 at 7:54 AM, RJ Nowling  wrote:

> Yep!  That's the example I was talking about.
>
> Is an error message printed when it hangs? I get :
>
> 14/09/30 13:23:14 ERROR BlockManagerMasterActor: Got two different block 
> manager registrations on 20140930-131734-1723727882-5050-1895-1
>
>
>
> On Tue, Oct 7, 2014 at 8:36 PM, Fairiz Azizi  wrote:
>
>> Sure, could you point me to the example?
>>
>> The only thing I could find was
>>
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/LogQuery.scala
>>
>> So do you mean running it like:
>>MASTER="mesos://xxx*:5050*" ./run-example LogQuery
>>
>> I tried that and I can see the job run and the tasks complete on the
>> slave nodes, but the client process seems to hang forever, it's probably a
>> different problem. BTW, only a dozen or so tasks kick off.
>>
>> I actually haven't done much with Scala and Spark (it's been all python).
>>
>> Fi
>>
>>
>>
>> Fairiz "Fi" Azizi
>>
>> On Tue, Oct 7, 2014 at 6:29 AM, RJ Nowling  wrote:
>>
>>> I was able to reproduce it on a small 4 node cluster (1 mesos master and
>>> 3 mesos slaves) with relatively low-end specs.  As I said, I just ran the
>>> log query examples with the fine-grained mesos mode.
>>>
>>> Spark 1.1.0 and mesos 0.20.1.
>>>
>>> Fairiz, could you try running the logquery example included with Spark
>>> and see what you get?
>>>
>>> Thanks!
>>>
>>> On Mon, Oct 6, 2014 at 8:07 PM, Fairiz Azizi  wrote:
>>>
 That's what great about Spark, the community is so active! :)

 I compiled Mesos 0.20.1 from the source tarball.

 Using the Mapr3 Spark 1.1.0 distribution from the Spark downloads page
  (spark-1.1.0-bin-mapr3.tgz).

 I see no problems for the workloads we are trying.

 However, the cluster is small (less than 100 cores across 3 nodes).

 The workloads reads in just a few gigabytes from HDFS, via an ipython
 notebook spark shell.

 thanks,
 Fi



 Fairiz "Fi" Azizi

 On Mon, Oct 6, 2014 at 9:20 AM, Timothy Chen  wrote:

> Ok I created SPARK-3817 to track this, will try to repro it as well.
>
> Tim
>
> On Mon, Oct 6, 2014 at 6:08 AM, RJ Nowling  wrote:
> > I've recently run into this issue as well. I get it from running
> Spark
> > examples such as log query.  Maybe that'll help reproduce the issue.
> >
> >
> > On Monday, October 6, 2014, Gurvinder Singh <
> gurvinder.si...@uninett.no>
> > wrote:
> >>
> >> The issue does not occur if the task at hand has small number of map
> >> tasks. I have a task which has 978 map tasks and I see this error as
> >>
> >> 14/10/06 09:34:40 ERROR BlockManagerMasterActor: Got two different
> block
> >> manager registrations on 20140711-081617-711206558-5050-2543-5
> >>
> >> Here is the log from the mesos-slave where this container was
> running.
> >>
> >> http://pastebin.com/Q1Cuzm6Q
> >>
> >> If you look for the code from where error produced by spark, you
> will
> >> see that it simply exit and saying in comments "this should never
> >> happen, lets just quit" :-)
> >>
> >> - Gurvinder
> >> On 10/06/2014 09:30 AM, Timothy Chen wrote:
> >> > (Hit enter too soon...)
> >> >
> >> > What is your setup and steps to repro this?
> >> >
> >> > Tim
> >> >
> >> > On Mon, Oct 6, 2014 at 12:30 AM, Timothy Chen 
> wrote:
> >> >> Hi Gurvinder,
> >> >>
> >> >> I tried fine grain mode before and didn't get into that problem.
> >> >>
> >> >>
> >> >> On Sun, Oct 5, 2014 at 11:44 PM, Gurvinder Singh
> >> >>  wrote:
> >> >>> On 10/06/2014 08:19 AM, Fairiz Azizi wrote:
> >>  The Spark online docs indicate that Spark is compatible with
> Mesos
> >>  0.18.1
> >> 
> >>  I've gotten it to work just fine on 0.18.1 and 0.18.2
> >> 
> >>  Has anyone tried Spark on a newer version of Mesos, i.e. Mesos
> >>  v0.20.0?
> >> 
> >>  -Fi
> >> 
> >> >>> Yeah we are using Spark 1.1.0 with Me

Re: Spark on Mesos 0.20

2014-10-09 Thread Gurvinder Singh
On 10/10/2014 06:11 AM, Fairiz Azizi wrote:
> Hello,
> 
> Sorry for the late reply.
> 
> When I tried the LogQuery example this time, things now seem to be fine!
> 
> ...
> 
> 14/10/10 04:01:21 INFO scheduler.DAGScheduler: Stage 0 (collect at
> LogQuery.scala:80) finished in 0.429 s
> 
> 14/10/10 04:01:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,
> whose tasks have all completed, from pool defa
> 
> 14/10/10 04:01:21 INFO spark.SparkContext: Job finished: collect at
> LogQuery.scala:80, took 12.802743914 s
> 
> (10.10.10.10,"FRED",GET http://images.com/2013/Generic.jpg HTTP/1.1)   
> bytes=621   n=2
> 
> 
> Not sure if this is the correct response for that example.
> 
> Our mesos/spark builds have since been updated since I last wrote.
> 
> Possibly, the JDK version was updated to 1.7.0_67
> 
> If you are using an older JDK, maybe try updating that?
I have tested on current JDK 7 and now I am running JDK 8, the problem
still exist. Can you run logquery on data of size say 100+ GB, so that
you have more map tasks. As we start to see the issue on larger tasks.

- Gurvinder
> 
> 
> - Fi
> 
> 
> 
> Fairiz "Fi" Azizi
> 
> On Wed, Oct 8, 2014 at 7:54 AM, RJ Nowling  > wrote:
> 
> Yep!  That's the example I was talking about.
> 
> Is an error message printed when it hangs? I get :
> 
> 14/09/30 13:23:14 ERROR BlockManagerMasterActor: Got two different block 
> manager registrations on 20140930-131734-1723727882-5050-1895-1
> 
> 
> 
> On Tue, Oct 7, 2014 at 8:36 PM, Fairiz Azizi  > wrote:
> 
> Sure, could you point me to the example?
> 
> The only thing I could find was
> 
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/LogQuery.scala
> 
> So do you mean running it like:
>MASTER="mesos://xxx_:5050_" ./run-example LogQuery
> 
> I tried that and I can see the job run and the tasks complete on
> the slave nodes, but the client process seems to hang forever,
> it's probably a different problem. BTW, only a dozen or so tasks
> kick off.
> 
> I actually haven't done much with Scala and Spark (it's been all
> python).
> 
> Fi
> 
> 
> 
> Fairiz "Fi" Azizi
> 
> On Tue, Oct 7, 2014 at 6:29 AM, RJ Nowling  > wrote:
> 
> I was able to reproduce it on a small 4 node cluster (1
> mesos master and 3 mesos slaves) with relatively low-end
> specs.  As I said, I just ran the log query examples with
> the fine-grained mesos mode.
> 
> Spark 1.1.0 and mesos 0.20.1.
> 
> Fairiz, could you try running the logquery example included
> with Spark and see what you get?
> 
> Thanks!
> 
> On Mon, Oct 6, 2014 at 8:07 PM, Fairiz Azizi
> mailto:code...@gmail.com>> wrote:
> 
> That's what great about Spark, the community is so
> active! :)
> 
> I compiled Mesos 0.20.1 from the source tarball.
> 
> Using the Mapr3 Spark 1.1.0 distribution from the Spark
> downloads page  (spark-1.1.0-bin-mapr3.tgz).
> 
> I see no problems for the workloads we are trying. 
> 
> However, the cluster is small (less than 100 cores
> across 3 nodes).
> 
> The workloads reads in just a few gigabytes from HDFS,
> via an ipython notebook spark shell.
> 
> thanks,
> Fi
> 
> 
> 
> Fairiz "Fi" Azizi
> 
> On Mon, Oct 6, 2014 at 9:20 AM, Timothy Chen
> mailto:tnac...@gmail.com>> wrote:
> 
> Ok I created SPARK-3817 to track this, will try to
> repro it as well.
> 
> Tim
> 
> On Mon, Oct 6, 2014 at 6:08 AM, RJ Nowling
> mailto:rnowl...@gmail.com>> wrote:
> > I've recently run into this issue as well. I get
> it from running Spark
> > examples such as log query.  Maybe that'll help
> reproduce the issue.
> >
> >
> > On Monday, October 6, 2014, Gurvinder Singh
>  >
> > wrote:
> >>
> >> The issue does not occur if the task at hand has
> small number of map
> >> tasks. I have a task which has 978 map tasks and
> I see this error as
> >>
> >> 14/10/06 09:34:40 ERROR BlockManagerMasterActor:
> Got two different block
> >> manager registra