Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Xi Shen
Hi Chitturi,

Please checkout
https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int
).

I think it is caused by the initialization step. the "kmeans||" method does
not initialize dataset in parallel. If your dataset is large, it takes a
long time to initialize. Just changed to "random".

Hope it helps.


On Sun, Mar 13, 2016 at 2:58 PM Chitturi Padma 
wrote:

> Hi All,
>
>   I  am facing the same issue. taking k values from 60 to 120 incrementing
> by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes
> around 2.5h on a 800 MB data set with 38 dimensions.
> On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List]
> <[hidden email] >
> wrote:
>
>> Hi Jao,
>>
>> Sorry to pop up this old thread. I am have the same problem like you did.
>> I want to know if you have figured out how to improve k-means on Spark.
>>
>> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
>> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
>> cluster has 7 executors, each has 8 cores...
>>
>> If I set k=5000 which is the required value for my task, the job goes on
>> forever...
>>
>>
>> Thanks,
>> David
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>>
> To start a new topic under Apache Spark User List, email [hidden email]
>> 
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> 
>>
>
>
> --
> View this message in context: Re: Why KMeans with mllib is so slow ?
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
-- 

Regards,
David


How to implement a InputDStream like the twitter stream in Spark?

2016-08-17 Thread Xi Shen
Hi,

First I am not sure if I should inherit from InputDStream, or
ReceiverInputDStream. For ReceiverInputDStream, why would I want to run a
receiver on each worker nodes?

If I want to inherit InputDStream, what should I do in the comput() method?

-- 


Thanks,
David S.


Hi,

2016-08-21 Thread Xi Shen
I found there are several .conf files in the conf directory, which one is
used as the default one when I click the "new" button on the notebook
homepage? I want to edit the default profile configuration so all my
notebooks are created with custom settings.

-- 


Thanks,
David S.


Question about the offiicial binary Spark 2 package

2016-10-16 Thread Xi Shen
Hi,

I want to configure my Hive to use Spark 2 as its engine. According to
Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I
could build my own, but for some reason I hope I could use a official
binary build.

So I want to ask if the official Spark binary build labeled "with
user-provided Hadoop" also implies "user-provided Hive".

-- 


Thanks,
David S.


Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Xi Shen
I think most of the "big data" tools, like Spark and Hive, are not designed
to edit data. They are only designed to query data. I wonder in what
scenario you need to update large volume of data repetitively.


On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
wrote:

> If  my understanding is correct about your query
> In spark Dataframes are immutable , cant update the dataframe.
> you have to create a new dataframe to update the current dataframe .
>
>
> Thanks,
> Divya
>
>
> On 17 October 2016 at 09:50, Mungeol Heo  wrote:
>
> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> --


Thanks,
David S.


Re: Question about the offiicial binary Spark 2 package

2016-10-17 Thread Xi Shen
Okay, thank you.

On Mon, Oct 17, 2016 at 5:53 PM Sean Owen  wrote:

> You can take the "with user-provided Hadoop" binary from the download
> page, and yes that should mean it does not drag in a Hive dependency of its
> own.
>
> On Mon, Oct 17, 2016 at 7:08 AM Xi Shen  wrote:
>
> Hi,
>
> I want to configure my Hive to use Spark 2 as its engine. According to
> Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I
> could build my own, but for some reason I hope I could use a official
> binary build.
>
> So I want to ask if the official Spark binary build labeled "with
> user-provided Hadoop" also implies "user-provided Hive".
>
> --
>
>
> Thanks,
> David S.
>
> --


Thanks,
David S.


Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Xi Shen
It is a plain Java IO error. Your line is too long. You should alter your
JSON schema, so each line is a small JSON object.

Please do not concatenate all the object into an array, then write the
array in one line. You will have difficulty handling your super large JSON
array in Spark anyway.

Because one array is one object, it cannot be split into multiple partition.


On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri 
wrote:

> Hello Community members,
>
> I am getting error while reading large JSON file in spark,
>
> *Code:*
>
> val landingVisitor =
> sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json")
>
> *Error:*
>
> 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID
> 8)
> java.io.IOException: Too many bytes before newline: 2147483648
> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:135)
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
>
> What would be resolution for the same ?
>
> Thanks in Advance !
>
>
> --
> Yours Aye,
> Chetan Khatri.
>
> --


Thanks,
David S.


Re: Spark Random Forest training cost same time on yarn as on standalone

2016-10-20 Thread Xi Shen
If you are running on your local, I do not see the point that you start
with 32 executors with 2 cores for each.

Also, you can check the Spark web console to find out where the time spent.

Also, you may want to read
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
.


On Thu, Oct 20, 2016 at 6:21 PM 陈哲  wrote:

> I'm training random forest model using spark2.0 on yarn with cmd like:
> $SPARK_HOME/bin/spark-submit \
>   --class com.netease.risk.prediction.HelpMain --master yarn
> --deploy-mode client --driver-cores 1 --num-executors 32 --executor-cores 2 
> --driver-memory
> 10g --executor-memory 6g \
>   --conf spark.rpc.askTimeout=3000 --conf spark.rpc.lookupTimeout=3000
> --conf spark.rpc.message.maxSize=2000  --conf spark.driver.maxResultSize=0
> \
> 
>
> the training process cost almost 8 hours
>
> And I tried training model on local machine with master(local[4]) , the
> whole process still cost 8 - 9 hours.
>
> My question is why running on yarn doesn't save time ? is this suppose to
> be distributed, with 32 executors ? And am I missing anything or what I can
> do to improve this and save more time ?
>
> Thanks
>
> --


Thanks,
David S.


Re: spark pi example fail on yarn

2016-10-20 Thread Xi Shen
16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn
application has already exited with state FINISHED!

 From this, I think it is spark has difficult communicating with YARN. You
should check your Spark log.


On Fri, Oct 21, 2016 at 8:06 AM Li Li  wrote:

which log file should I

On Thu, Oct 20, 2016 at 10:02 PM, Saisai Shao 
wrote:
> Looks like ApplicationMaster is killed by SIGTERM.
>
> 16/10/20 18:12:04 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
> 16/10/20 18:12:04 INFO yarn.ApplicationMaster: Final app status:
>
> This container may be killed by yarn NodeManager or other processes, you'd
> better check yarn log to dig out more details.
>
> Thanks
> Saisai
>
> On Thu, Oct 20, 2016 at 6:51 PM, Li Li  wrote:
>>
>> I am setting up a small yarn/spark cluster. hadoop/yarn version is
>> 2.7.3 and I can run wordcount map-reduce correctly in yarn.
>> And I am using  spark-2.0.1-bin-hadoop2.7 using command:
>> ~/spark-2.0.1-bin-hadoop2.7$ ./bin/spark-submit --class
>> org.apache.spark.examples.SparkPi --master yarn-client
>> examples/jars/spark-examples_2.11-2.0.1.jar 1
>> it fails and the first error is:
>> 16/10/20 18:12:03 INFO storage.BlockManagerMaster: Registered
>> BlockManager BlockManagerId(driver, 10.161.219.189, 39161)
>> 16/10/20 18:12:03 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@76ad6715{/metrics/json,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO
>> cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster
>> registered as NettyRpcEndpointRef(null)
>> 16/10/20 18:12:12 INFO cluster.YarnClientSchedulerBackend: Add WebUI
>> Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,
>> Map(PROXY_HOSTS -> ai-hz1-spark1, PROXY_URI_BASES ->
>> http://ai-hz1-spark1:8088/proxy/application_1476957324184_0002),
>> /proxy/application_1476957324184_0002
>> 16/10/20 18:12:12 INFO ui.JettyUtils: Adding filter:
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>> 16/10/20 18:12:12 INFO cluster.YarnClientSchedulerBackend:
>> SchedulerBackend is ready for scheduling beginning after waiting
>> maxRegisteredResourcesWaitingTime: 3(ms)
>> 16/10/20 18:12:12 WARN spark.SparkContext: Use an existing
>> SparkContext, some configuration may not take effect.
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@489091bd{/SQL,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@1de9b505{/SQL/json,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@378f002a{/SQL/execution,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@2cc75074
{/SQL/execution/json,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@2d64160c{/static/sql,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO internal.SharedState: Warehouse path is
>> '/home/hadoop/spark-2.0.1-bin-hadoop2.7/spark-warehouse'.
>> 16/10/20 18:12:13 INFO spark.SparkContext: Starting job: reduce at
>> SparkPi.scala:38
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Got job 0 (reduce at
>> SparkPi.scala:38) with 1 output partitions
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Final stage:
>> ResultStage 0 (reduce at SparkPi.scala:38)
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Parents of final stage:
>> List()
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Missing parents: List()
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Submitting ResultStage
>> 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no
>> missing parents
>> 16/10/20 18:12:13 INFO memory.MemoryStore: Block broadcast_0 stored as
>> values in memory (estimated size 1832.0 B, free 366.3 MB)
>> 16/10/20 18:12:13 INFO memory.MemoryStore: Block broadcast_0_piece0
>> stored as bytes in memory (estimated size 1169.0 B, free 366.3 MB)
>> 16/10/20 18:12:13 INFO storage.BlockManagerInfo: Added
>> broadcast_0_piece0 in memory on 10.161.219.189:39161 (size: 1169.0 B,
>> free: 366.3 MB)
>> 16/10/20 18:12:13 INFO spark.SparkContext: Created broadcast 0 from
>> broadcast at DAGScheduler.scala:1012
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Submitting 1
>> missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at
>> SparkPi.scala:34)
>> 16/10/20 18:12:13 INFO cluster.YarnScheduler: Adding task set 0.0 with
>> 1 tasks
>> 16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn
>> application has already exited with state FINISHED!
>> 16/10/20 18:12:14 INFO server.ServerConnector: Stopped
>> ServerConnector@389adf1d{HTTP/1.1}{0.0.0.0:4040}
>> 16/10/20 18:12:14 INFO handler.ContextHandler: Stopped
>> o.s.j.s.ServletContextHandler@841e575
{/stages/stage/kill,null,UNAVAILABLE}
>> 16/10/20 18:12:14 INFO handler.ContextHandler: Stopped
>> o.s.j.s.ServletContextHandler@66629f63{/api,null,UNAVAILABLE}
>> 16/10/20 18:12:14 INFO handler.ContextHandler: S

Re: spark pi example fail on yarn

2016-10-20 Thread Xi Shen
I see, I had this issue before. I think you are using Java 8, right?
Because Java 8 JVM requires more bootstrap heap memory.

Turning off the memory check is an unsafe way to avoid this issue. I think
it is better to increase the memory ratio, like this:

  
yarn.nodemanager.vmem-pmem-ratio
3.15
  


On Fri, Oct 21, 2016 at 11:15 AM Li Li  wrote:

I modified yarn-site.xml yarn.nodemanager.vmem-check-enabled to false
and it works for yarn-client and spark-shell

On Fri, Oct 21, 2016 at 10:59 AM, Li Li  wrote:
> I found a warn in nodemanager log. is the virtual memory exceed? how
> should I config yarn to solve this problem?
>
> 2016-10-21 10:41:12,588 INFO
>
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 20299 for container-id
> container_1477017445921_0001_02_01: 335.1 MB of 1 GB physical
> memory used; 2.2 GB of 2.1 GB virtual memory used
> 2016-10-21 10:41:12,589 WARN
>
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Process tree for container: container_1477017445921_0001_02_01 has
> processes older than 1 iteration running over the configured limit.
> Limit=2254857728, current usage = 2338873344
>
> On Fri, Oct 21, 2016 at 8:49 AM, Saisai Shao 
wrote:
>> It is not Spark has difficulty to communicate with YARN, it simply means
AM
>> is exited with FINISHED state.
>>
>> I'm guessing it might be related to memory constraints for container,
please
>> check the yarn RM and NM logs to find out more details.
>>
>> Thanks
>> Saisai
>>
>> On Fri, Oct 21, 2016 at 8:14 AM, Xi Shen  wrote:
>>>
>>> 16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn
>>> application has already exited with state FINISHED!
>>>
>>>  From this, I think it is spark has difficult communicating with YARN.
You
>>> should check your Spark log.
>>>
>>>
>>> On Fri, Oct 21, 2016 at 8:06 AM Li Li  wrote:
>>>>
>>>> which log file should I
>>>>
>>>> On Thu, Oct 20, 2016 at 10:02 PM, Saisai Shao 
>>>> wrote:
>>>> > Looks like ApplicationMaster is killed by SIGTERM.
>>>> >
>>>> > 16/10/20 18:12:04 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
>>>> > 16/10/20 18:12:04 INFO yarn.ApplicationMaster: Final app status:
>>>> >
>>>> > This container may be killed by yarn NodeManager or other processes,
>>>> > you'd
>>>> > better check yarn log to dig out more details.
>>>> >
>>>> > Thanks
>>>> > Saisai
>>>> >
>>>> > On Thu, Oct 20, 2016 at 6:51 PM, Li Li  wrote:
>>>> >>
>>>> >> I am setting up a small yarn/spark cluster. hadoop/yarn version is
>>>> >> 2.7.3 and I can run wordcount map-reduce correctly in yarn.
>>>> >> And I am using  spark-2.0.1-bin-hadoop2.7 using command:
>>>> >> ~/spark-2.0.1-bin-hadoop2.7$ ./bin/spark-submit --class
>>>> >> org.apache.spark.examples.SparkPi --master yarn-client
>>>> >> examples/jars/spark-examples_2.11-2.0.1.jar 1
>>>> >> it fails and the first error is:
>>>> >> 16/10/20 18:12:03 INFO storage.BlockManagerMaster: Registered
>>>> >> BlockManager BlockManagerId(driver, 10.161.219.189, 39161)
>>>> >> 16/10/20 18:12:03 INFO handler.ContextHandler: Started
>>>> >> o.s.j.s.ServletContextHandler@76ad6715{/metrics/json,null,AVAILABLE}
>>>> >> 16/10/20 18:12:12 INFO
>>>> >> cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster
>>>> >> registered as NettyRpcEndpointRef(null)
>>>> >> 16/10/20 18:12:12 INFO cluster.YarnClientSchedulerBackend: Add WebUI
>>>> >> Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,
>>>> >> Map(PROXY_HOSTS -> ai-hz1-spark1, PROXY_URI_BASES ->
>>>> >> http://ai-hz1-spark1:8088/proxy/application_1476957324184_0002),
>>>> >> /proxy/application_1476957324184_0002
>>>> >> 16/10/20 18:12:12 INFO ui.JettyUtils: Adding filter:
>>>> >> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>>> >> 16/10/20 18:12:12 INFO cluster.YarnClientSchedulerBackend:
>>>> >> SchedulerBackend is ready for scheduling beginning after waiting
>>>> >> maxRegisteredResourcesWaitingTime: 3(ms)
>>>> >> 16/10/20 18:12:12 WARN

Fwd: How to start spark-shell with YARN?

2015-02-23 Thread Xi Shen
Hi,

I followed this guide,
http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to start
spark-shell with yarn-client

./bin/spark-shell --master yarn-client


But I got

WARN ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkYarnAM@10.0.2.15:38171] has failed, address is now
gated for [5000] ms. Reason is: [Disassociated].

In the spark-shell, and other exceptions in they yarn log. Please see
http://stackoverflow.com/questions/28671171/spark-shell-cannot-connect-to-yarn
for more detail.


However, submitting to the this cluster works. Also, spark-shell as
standalone works.


My system:

- ubuntu amd64
- spark 1.2.1
- yarn from hadoop 2.6 stable


Thanks,

[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>


Re: How to start spark-shell with YARN?

2015-02-24 Thread Xi Shen
Hi Arush,

I got the pre-build from https://spark.apache.org/downloads.html. When I
start spark-shell, it prompts:

Spark assembly has been built with Hive, including Datanucleus jars on
classpath

So we don't have pre-build with YARN support? If so, how the spark-submit
work? I checked the YARN log, and job is really submitted and ran
successfully.


Thanks,
David





On Tue Feb 24 2015 at 6:35:38 PM Arush Kharbanda 
wrote:

> Hi
>
> Are you sure that you built Spark for Yarn.If standalone works, not sure
> if its build for Yarn.
>
> Thanks
> Arush
> On Tue, Feb 24, 2015 at 12:06 PM, Xi Shen  wrote:
>
>> Hi,
>>
>> I followed this guide,
>> http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to
>> start spark-shell with yarn-client
>>
>> ./bin/spark-shell --master yarn-client
>>
>>
>> But I got
>>
>> WARN ReliableDeliverySupervisor: Association with remote system 
>> [akka.tcp://sparkYarnAM@10.0.2.15:38171] has failed, address is now gated 
>> for [5000] ms. Reason is: [Disassociated].
>>
>> In the spark-shell, and other exceptions in they yarn log. Please see
>> http://stackoverflow.com/questions/28671171/spark-shell-cannot-connect-to-yarn
>> for more detail.
>>
>>
>> However, submitting to the this cluster works. Also, spark-shell as
>> standalone works.
>>
>>
>> My system:
>>
>> - ubuntu amd64
>> - spark 1.2.1
>> - yarn from hadoop 2.6 stable
>>
>>
>> Thanks,
>>
>> [image: --]
>> Xi Shen
>> [image: http://]about.me/davidshen
>> <http://about.me/davidshen?promo=email_sig>
>>   <http://about.me/davidshen>
>>
>>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>


Re: How to start spark-shell with YARN?

2015-02-24 Thread Xi Shen
Hi Sean,

I launched the spark-shell on the same machine as I started YARN service. I
don't think port will be an issue.

I am new to spark. I checked the HDFS web UI and the YARN web UI. But I
don't know how to check the AM. Can you help?


Thanks,
David


On Tue, Feb 24, 2015 at 8:37 PM Sean Owen  wrote:

> I don't think the build is at issue. The error suggests your App Master
> can't be contacted. Is there a network port issue? did the AM fail?
>
> On Tue, Feb 24, 2015 at 9:15 AM, Xi Shen  wrote:
>
>> Hi Arush,
>>
>> I got the pre-build from https://spark.apache.org/downloads.html. When I
>> start spark-shell, it prompts:
>>
>> Spark assembly has been built with Hive, including Datanucleus jars
>> on classpath
>>
>> So we don't have pre-build with YARN support? If so, how the spark-submit
>> work? I checked the YARN log, and job is really submitted and ran
>> successfully.
>>
>>
>> Thanks,
>> David
>>
>>
>>
>>
>>
>> On Tue Feb 24 2015 at 6:35:38 PM Arush Kharbanda <
>> ar...@sigmoidanalytics.com> wrote:
>>
>>> Hi
>>>
>>> Are you sure that you built Spark for Yarn.If standalone works, not sure
>>> if its build for Yarn.
>>>
>>> Thanks
>>> Arush
>>> On Tue, Feb 24, 2015 at 12:06 PM, Xi Shen  wrote:
>>>
>>>> Hi,
>>>>
>>>> I followed this guide,
>>>> http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to
>>>> start spark-shell with yarn-client
>>>>
>>>> ./bin/spark-shell --master yarn-client
>>>>
>>>>
>>>> But I got
>>>>
>>>> WARN ReliableDeliverySupervisor: Association with remote system 
>>>> [akka.tcp://sparkYarnAM@10.0.2.15:38171] has failed, address is now gated 
>>>> for [5000] ms. Reason is: [Disassociated].
>>>>
>>>> In the spark-shell, and other exceptions in they yarn log. Please see
>>>> http://stackoverflow.com/questions/28671171/spark-shell-cannot-connect-to-yarn
>>>> for more detail.
>>>>
>>>>
>>>> However, submitting to the this cluster works. Also, spark-shell as
>>>> standalone works.
>>>>
>>>>
>>>> My system:
>>>>
>>>> - ubuntu amd64
>>>> - spark 1.2.1
>>>> - yarn from hadoop 2.6 stable
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> [image: --]
>>>> Xi Shen
>>>> [image: http://]about.me/davidshen
>>>> <http://about.me/davidshen?promo=email_sig>
>>>>   <http://about.me/davidshen>
>>>>
>>>>
>>> --
>>>
>>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>>
>>> *Arush Kharbanda* || Technical Teamlead
>>>
>>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>>
>>
>


spark-shell --master yarn-client fail on Windows

2015-03-05 Thread Xi Shen
Hi,

My HDFS and YARN services are started, and my spark-shell can wok in local
mode.

But when I try spark-shell --master yarn-client, a job can be created at
the YARN service, but will fail very soon. The Diagnostics are:

Application application_1425559747310_0002 failed 2 times due to AM
Container for appattempt_1425559747310_0002_02 exited with exitCode: 1
For more detailed output, check application tracking page:
http://Xi-Laptop:8088/proxy/application_1425559747310_0002/Then, click on
links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1425559747310_0002_02_01
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Shell output: 1 file(s) moved.
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.

And in the AM log, there're something like:

Could not find or load main class
'-Dspark.driver.appUIAddress=http:..:4040'

And it changes from time to time.

It feels like something is not right in YARN.


Thanks,
David


Spark code development practice

2015-03-05 Thread Xi Shen
Hi,

I am new to Spark. I see every spark program has a main() function. I
wonder if I can run the spark program directly, without using spark-submit.
I think it will be easier for early development and debug.


Thanks,
David


Re: Spark code development practice

2015-03-05 Thread Xi Shen
Thanks guys, this is very useful :)

@Stephen, I know spark-shell will create a SC for me. But I don't
understand why we still need to do "new SparkContext(...)" in our code.
Shouldn't we get it from some where? e.g. "SparkContext.get".

Another question, if I want my spark code to run in YARN later, how should
I create the SparkContext? Or I can just specify "--marst yarn" on command
line?


Thanks,
David


On Fri, Mar 6, 2015 at 12:38 PM Koen Vantomme 
wrote:

> use the spark-shell command and the shell will open
> type :paste abd then paste your code, after control-d
>
> open spark-shell:
> sparks/bin
> ./spark-shell
>
> Verstuurd vanaf mijn iPhone
>
> Op 6-mrt.-2015 om 02:28 heeft "fightf...@163.com"  het
> volgende geschreven:
>
> Hi,
>
> You can first establish a scala ide to develop and debug your spark
> program, lets say, intellij idea or eclipse.
>
> Thanks,
> Sun.
>
> --
> fightf...@163.com
>
>
> *From:* Xi Shen 
> *Date:* 2015-03-06 09:19
> *To:* user@spark.apache.org
> *Subject:* Spark code development practice
> Hi,
>
> I am new to Spark. I see every spark program has a main() function. I
> wonder if I can run the spark program directly, without using spark-submit.
> I think it will be easier for early development and debug.
>
>
> Thanks,
> David
>
>


How to reuse a ML trained model?

2015-03-07 Thread Xi Shen
Hi,

I checked a few ML algorithms in MLLib.

https://spark.apache.org/docs/0.8.1/api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel

I could not find a way to save the trained model. Does this means I have to
train my model every time? Is there a more economic way to do this?

I am thinking about something like:

model.run(...)
model.save("hdfs://path/to/hdfs")

Then, next I can do:

val model = Model.createFrom("hdfs://...")
model.predict(vector)

I am new to spark, maybe there are other ways to persistent the model?


Thanks,
David


Re: How to reuse a ML trained model?

2015-03-07 Thread Xi Shen
Ah~it is serializable. Thanks!


On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy  wrote:

> You can serialize your trained model to persist somewhere.
>
> Ekrem Aksoy
>
> On Sat, Mar 7, 2015 at 12:10 PM, Xi Shen  wrote:
>
>> Hi,
>>
>> I checked a few ML algorithms in MLLib.
>>
>>
>> https://spark.apache.org/docs/0.8.1/api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel
>>
>> I could not find a way to save the trained model. Does this means I have
>> to train my model every time? Is there a more economic way to do this?
>>
>> I am thinking about something like:
>>
>> model.run(...)
>> model.save("hdfs://path/to/hdfs")
>>
>> Then, next I can do:
>>
>> val model = Model.createFrom("hdfs://...")
>> model.predict(vector)
>>
>> I am new to spark, maybe there are other ways to persistent the model?
>>
>>
>> Thanks,
>> David
>>
>>
>


Re: How to reuse a ML trained model?

2015-03-07 Thread Xi Shen
Wait...it seem SparkContext does not provide a way to save/load object
files. It can only save/load RDD. What do I missed here?


Thanks,
David


On Sat, Mar 7, 2015 at 11:05 PM Xi Shen  wrote:

> Ah~it is serializable. Thanks!
>
>
> On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy  wrote:
>
>> You can serialize your trained model to persist somewhere.
>>
>> Ekrem Aksoy
>>
>> On Sat, Mar 7, 2015 at 12:10 PM, Xi Shen  wrote:
>>
>>> Hi,
>>>
>>> I checked a few ML algorithms in MLLib.
>>>
>>> https://spark.apache.org/docs/0.8.1/api/mllib/index.html#
>>> org.apache.spark.mllib.classification.LogisticRegressionModel
>>>
>>> I could not find a way to save the trained model. Does this means I have
>>> to train my model every time? Is there a more economic way to do this?
>>>
>>> I am thinking about something like:
>>>
>>> model.run(...)
>>> model.save("hdfs://path/to/hdfs")
>>>
>>> Then, next I can do:
>>>
>>> val model = Model.createFrom("hdfs://...")
>>> model.predict(vector)
>>>
>>> I am new to spark, maybe there are other ways to persistent the model?
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>
>>


Re: How to reuse a ML trained model?

2015-03-08 Thread Xi Shen
errr...do you have any suggestions for me before 1.3 release?

I can't believe there's no ML model serialize method in Spark. I think
training the models are quite expensive, isn't it?


Thanks,
David


On Sun, Mar 8, 2015 at 5:14 AM Burak Yavuz  wrote:

> Hi,
>
> There is model import/export for some of the ML algorithms on the current
> master (and they'll be shipped with the 1.3 release).
>
> Burak
> On Mar 7, 2015 4:17 AM, "Xi Shen"  wrote:
>
>> Wait...it seem SparkContext does not provide a way to save/load object
>> files. It can only save/load RDD. What do I missed here?
>>
>>
>> Thanks,
>> David
>>
>>
>> On Sat, Mar 7, 2015 at 11:05 PM Xi Shen  wrote:
>>
>>> Ah~it is serializable. Thanks!
>>>
>>>
>>> On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy 
>>> wrote:
>>>
>>>> You can serialize your trained model to persist somewhere.
>>>>
>>>> Ekrem Aksoy
>>>>
>>>> On Sat, Mar 7, 2015 at 12:10 PM, Xi Shen  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I checked a few ML algorithms in MLLib.
>>>>>
>>>>> https://spark.apache.org/docs/0.8.1/api/mllib/index.html#
>>>>> org.apache.spark.mllib.classification.LogisticRegressionModel
>>>>>
>>>>> I could not find a way to save the trained model. Does this means I
>>>>> have to train my model every time? Is there a more economic way to do 
>>>>> this?
>>>>>
>>>>> I am thinking about something like:
>>>>>
>>>>> model.run(...)
>>>>> model.save("hdfs://path/to/hdfs")
>>>>>
>>>>> Then, next I can do:
>>>>>
>>>>> val model = Model.createFrom("hdfs://...")
>>>>> model.predict(vector)
>>>>>
>>>>> I am new to spark, maybe there are other ways to persistent the model?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>
>>>>


How to use the TF-IDF model?

2015-03-08 Thread Xi Shen
Hi,

I read this page,
http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. But I am
wondering, how to use this TF-IDF RDD? What is this TF-IDF vector looks
like?

Can someone provide me some guide?


Thanks,


[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>


How to load my ML model?

2015-03-08 Thread Xi Shen
Hi,

I used the method on this
http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/train.html
passage to save my k-means model.

But now, I have no idea how to load it back...I tried

sc.objectFile("/path/to/data/file/directory/")


But I got this error:

org.apache.spark.SparkDriverExecutionException: Execution error
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:997)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:14
17)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88)
at
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1339)
at
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1339)
at
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:993)
... 12 more

Any suggestions?


Thanks,

[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>


How to do spares vector product in Spark?

2015-03-13 Thread Xi Shen
Hi,

I have two RDD[Vector], both Vector are spares and of the form:

(id, value)

"id" indicates the position of the value in the vector space. I want to
apply dot product on two of such RDD[Vector] and get a scale value. The
none exist values are treated as zero.

Any convenient tool to do this in Spark?


Thanks,
David


Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hi,

I read this document,
http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried
to build a TF-IDF model of my documents.

I have a list of documents, each word is represented as a Int, and each
document is listed in one line.

doc_name, int1, int2...
doc_name, int3, int4...

This is how I load my documents:
val documents: RDD[Seq[Int]] = sc.objectFile[(String,
Seq[Int])](s"$sparkStore/documents") map (_._2) cache()

Then I did:

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)

I write the tfidf model to a text file and try to understand the structure.
FileUtils.writeLines(new File("tfidf.out"),
tfidf.collect().toList.asJavaCollection)

What I is something like:

(1048576,[0,4,7,8,10,13,17,21],[...some float numbers...])
...

I think it s a tuple with 3 element.

   - I have no idea what the 1st element is...
   - I think the 2nd element is a list of the word
   - I think the 3rd element is a list of tf-idf value of the words in the
   previous list

Please help me understand this structure.


Thanks,
David


Re: Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hey, I work it out myself :)

The "Vector" is actually a "SparesVector", so when it is written into a
string, the format is

(size, [coordinate], [value...])


Simple!


On Sat, Mar 14, 2015 at 6:05 PM Xi Shen  wrote:

> Hi,
>
> I read this document,
> http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and
> tried to build a TF-IDF model of my documents.
>
> I have a list of documents, each word is represented as a Int, and each
> document is listed in one line.
>
> doc_name, int1, int2...
> doc_name, int3, int4...
>
> This is how I load my documents:
> val documents: RDD[Seq[Int]] = sc.objectFile[(String,
> Seq[Int])](s"$sparkStore/documents") map (_._2) cache()
>
> Then I did:
>
> val hashingTF = new HashingTF()
> val tf: RDD[Vector] = hashingTF.transform(documents)
> val idf = new IDF().fit(tf)
> val tfidf = idf.transform(tf)
>
> I write the tfidf model to a text file and try to understand the structure.
> FileUtils.writeLines(new File("tfidf.out"),
> tfidf.collect().toList.asJavaCollection)
>
> What I is something like:
>
> (1048576,[0,4,7,8,10,13,17,21],[...some float numbers...])
> ...
>
> I think it s a tuple with 3 element.
>
>- I have no idea what the 1st element is...
>- I think the 2nd element is a list of the word
>- I think the 3rd element is a list of tf-idf value of the words in
>the previous list
>
> Please help me understand this structure.
>
>
> Thanks,
> David
>
>
>
>


k-means hang without error/warning

2015-03-15 Thread Xi Shen
Hi,

I am running k-means using Spark in local mode. My data set is about 30k
records, and I set the k = 1000.

The algorithm starts and finished 13 jobs according to the UI monitor, then
it stopped working.

The last log I saw was:

[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned
broadcast *16*

There're many similar log repeated, but it seems it always stop at the 16th.

If I try to low down the *k* value, the algorithm will terminated. So I
just want to know what's wrong with *k=1000*.


Thanks,
David


How to set Spark executor memory?

2015-03-16 Thread Xi Shen
Hi,

I have set spark.executor.memory to 2048m, and in the UI "Environment"
page, I can see this value has been set correctly. But in the "Executors"
page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
value. why not 256MB, or just as what I set?

What am I missing here?


Thanks,
David


Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
I set it in code, not by configuration. I submit my jar file to local. I am
working in my developer environment.

On Mon, 16 Mar 2015 18:28 Akhil Das  wrote:

> How are you setting it? and how are you submitting the job?
>
> Thanks
> Best Regards
>
> On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen  wrote:
>
>> Hi,
>>
>> I have set spark.executor.memory to 2048m, and in the UI "Environment"
>> page, I can see this value has been set correctly. But in the "Executors"
>> page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
>> value. why not 256MB, or just as what I set?
>>
>> What am I missing here?
>>
>>
>> Thanks,
>> David
>>
>>
>


Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen
I used "local[*]". The CPU hits about 80% when there are active jobs, then
it drops to about 13% and hand for a very long time.

Thanks,
David

On Mon, 16 Mar 2015 17:46 Akhil Das  wrote:

> How many threads are you allocating while creating the sparkContext? like
> local[4] will allocate 4 threads. You can try increasing it to a higher
> number also try setting level of parallelism to a higher number.
>
> Thanks
> Best Regards
>
> On Mon, Mar 16, 2015 at 9:55 AM, Xi Shen  wrote:
>
>> Hi,
>>
>> I am running k-means using Spark in local mode. My data set is about 30k
>> records, and I set the k = 1000.
>>
>> The algorithm starts and finished 13 jobs according to the UI monitor,
>> then it stopped working.
>>
>> The last log I saw was:
>>
>> [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned
>> broadcast *16*
>>
>> There're many similar log repeated, but it seems it always stop at the
>> 16th.
>>
>> If I try to low down the *k* value, the algorithm will terminated. So I
>> just want to know what's wrong with *k=1000*.
>>
>>
>> Thanks,
>> David
>>
>>
>


Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
Hi Akhil,

Yes, you are right. If I ran the program from IDE as a normal java program,
the executor's memory is increased...but not to 2048m, it is set to
6.7GB...Looks like there's some formula to calculate this value.


Thanks,
David


On Mon, Mar 16, 2015 at 7:36 PM Akhil Das 
wrote:

> By default spark.executor.memory is set to 512m, I'm assuming since you
> are submiting the job using spark-submit and it is not able to override the
> value since you are running in local mode. Can you try it without using
> spark-submit as a standalone project?
>
> Thanks
> Best Regards
>
> On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen  wrote:
>
>> I set it in code, not by configuration. I submit my jar file to local. I
>> am working in my developer environment.
>>
>> On Mon, 16 Mar 2015 18:28 Akhil Das  wrote:
>>
>>> How are you setting it? and how are you submitting the job?
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen  wrote:
>>>
>>>> Hi,
>>>>
>>>> I have set spark.executor.memory to 2048m, and in the UI "Environment"
>>>> page, I can see this value has been set correctly. But in the "Executors"
>>>> page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
>>>> value. why not 256MB, or just as what I set?
>>>>
>>>> What am I missing here?
>>>>
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>
>


Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
I set "spark.executor.memory" to "2048m". If the executor storage memory is
0.6 of executor memory, it should be 2g * 0.6 = 1.2g.

My machine has 56GB memory, and 0.6 of that should be 33.6G...I hate math xD


On Mon, Mar 16, 2015 at 7:59 PM Akhil Das 
wrote:

> How much memory are you having on your machine? I think default value is
> 0.6 of the spark.executor.memory as you can see from here
> <http://spark.apache.org/docs/1.2.1/configuration.html#execution-behavior>
> .
>
> Thanks
> Best Regards
>
> On Mon, Mar 16, 2015 at 2:26 PM, Xi Shen  wrote:
>
>> Hi Akhil,
>>
>> Yes, you are right. If I ran the program from IDE as a normal java
>> program, the executor's memory is increased...but not to 2048m, it is set
>> to 6.7GB...Looks like there's some formula to calculate this value.
>>
>>
>> Thanks,
>> David
>>
>>
>> On Mon, Mar 16, 2015 at 7:36 PM Akhil Das 
>> wrote:
>>
>>> By default spark.executor.memory is set to 512m, I'm assuming since you
>>> are submiting the job using spark-submit and it is not able to override the
>>> value since you are running in local mode. Can you try it without using
>>> spark-submit as a standalone project?
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen  wrote:
>>>
>>>> I set it in code, not by configuration. I submit my jar file to local.
>>>> I am working in my developer environment.
>>>>
>>>> On Mon, 16 Mar 2015 18:28 Akhil Das  wrote:
>>>>
>>>>> How are you setting it? and how are you submitting the job?
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have set spark.executor.memory to 2048m, and in the UI
>>>>>> "Environment" page, I can see this value has been set correctly. But in 
>>>>>> the
>>>>>> "Executors" page, I saw there's only 1 executor and its memory is 
>>>>>> 265.4MB.
>>>>>> Very strange value. why not 256MB, or just as what I set?
>>>>>>
>>>>>> What am I missing here?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>
>>>>>
>>>
>


Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen
Hi Sean,

My system is windows 64 bit. I looked into the resource manager, Java is
the only process that used about 13% CPU recourse; no disk activity related
to Java; only about 6GB memory used out of 56GB in total.

My system response very well. I don't think it is a system issue.

Thanks,
David

On Mon, 16 Mar 2015 22:30 Sean Owen  wrote:

> I think you'd have to say more about "stopped working". Is the GC
> thrashing? does the UI respond? is the CPU busy or not?
>
> On Mon, Mar 16, 2015 at 4:25 AM, Xi Shen  wrote:
> > Hi,
> >
> > I am running k-means using Spark in local mode. My data set is about 30k
> > records, and I set the k = 1000.
> >
> > The algorithm starts and finished 13 jobs according to the UI monitor,
> then
> > it stopped working.
> >
> > The last log I saw was:
> >
> > [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned
> > broadcast 16
> >
> > There're many similar log repeated, but it seems it always stop at the
> 16th.
> >
> > If I try to low down the k value, the algorithm will terminated. So I
> just
> > want to know what's wrong with k=1000.
> >
> >
> > Thanks,
> > David
> >
>


Can I start multiple executors in local mode?

2015-03-16 Thread Xi Shen
Hi,

In YARN mode you can specify the number of executors. I wonder if we can
also start multiple executors at local, just to make the test run faster.

Thanks,
David


Suggestion for user logging

2015-03-16 Thread Xi Shen
Hi,

When you submit a jar to the spark cluster, it is very difficult to see the
logging. Is there any way to save the logging to a file? I mean only the
logging I created not the Spark log information.


Thanks,
David


Re: Can I start multiple executors in local mode?

2015-03-21 Thread Xi Shen
No, I didn't mean local cluster. I mean run in local, like in IDE.

On Mon, 16 Mar 2015 23:12 xu Peng  wrote:

> Hi David,
>
> You can try the local-cluster.
>
> the number in local-cluster[2,2,1024] represents that there are 2 worker,
> 2 cores and 1024M
>
> Best Regards
>
> Peng Xu
>
> 2015-03-16 19:46 GMT+08:00 Xi Shen :
>
>> Hi,
>>
>> In YARN mode you can specify the number of executors. I wonder if we can
>> also start multiple executors at local, just to make the test run faster.
>>
>> Thanks,
>> David
>>
>
>


Re: How to set Spark executor memory?

2015-03-21 Thread Xi Shen
Hi Sean,

It's getting strange now. If I ran from IDE, my executor memory is always
set to 6.7G, no matter what value I set in code. I have check my
environment variable, and there's no value of 6.7, or 12.5

Any idea?

Thanks,
David

On Tue, 17 Mar 2015 00:35 null  wrote:

>  Hi Xi Shen,
>
> You could set the spark.executor.memory in the code itself . new 
> SparkConf()..set("spark.executor.memory", "2g")
>
> Or you can try the -- spark.executor.memory 2g while submitting the jar.
>
>
>
> Regards
>
> Jishnu Prathap
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* Monday, March 16, 2015 2:06 PM
> *To:* Xi Shen
> *Cc:* user@spark.apache.org
> *Subject:* Re: How to set Spark executor memory?
>
>
>
> By default spark.executor.memory is set to 512m, I'm assuming since you
> are submiting the job using spark-submit and it is not able to override the
> value since you are running in local mode. Can you try it without using
> spark-submit as a standalone project?
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen  wrote:
>
> I set it in code, not by configuration. I submit my jar file to local. I
> am working in my developer environment.
>
>
>
> On Mon, 16 Mar 2015 18:28 Akhil Das  wrote:
>
> How are you setting it? and how are you submitting the job?
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen  wrote:
>
> Hi,
>
>
>
> I have set spark.executor.memory to 2048m, and in the UI "Environment"
> page, I can see this value has been set correctly. But in the "Executors"
> page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
> value. why not 256MB, or just as what I set?
>
>
>
> What am I missing here?
>
>
>
>
>
> Thanks,
>
> David
>
>
>
>
>
>
>  The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus transmitted by this email.
> www.wipro.com
>


Re: How to set Spark executor memory?

2015-03-21 Thread Xi Shen
In the log, I saw

  MemoryStorage: MemoryStore started with capacity 6.7GB

But I still can not find where to set this storage capacity.

On Sat, 21 Mar 2015 20:30 Xi Shen  wrote:

> Hi Sean,
>
> It's getting strange now. If I ran from IDE, my executor memory is always
> set to 6.7G, no matter what value I set in code. I have check my
> environment variable, and there's no value of 6.7, or 12.5
>
> Any idea?
>
> Thanks,
> David
>
> On Tue, 17 Mar 2015 00:35 null  wrote:
>
>>  Hi Xi Shen,
>>
>> You could set the spark.executor.memory in the code itself . new 
>> SparkConf()..set("spark.executor.memory", "2g")
>>
>> Or you can try the -- spark.executor.memory 2g while submitting the jar.
>>
>>
>>
>> Regards
>>
>> Jishnu Prathap
>>
>>
>>
>> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
>> *Sent:* Monday, March 16, 2015 2:06 PM
>> *To:* Xi Shen
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: How to set Spark executor memory?
>>
>>
>>
>> By default spark.executor.memory is set to 512m, I'm assuming since you
>> are submiting the job using spark-submit and it is not able to override the
>> value since you are running in local mode. Can you try it without using
>> spark-submit as a standalone project?
>>
>>
>>   Thanks
>>
>> Best Regards
>>
>>
>>
>> On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen  wrote:
>>
>> I set it in code, not by configuration. I submit my jar file to local. I
>> am working in my developer environment.
>>
>>
>>
>> On Mon, 16 Mar 2015 18:28 Akhil Das  wrote:
>>
>> How are you setting it? and how are you submitting the job?
>>
>>
>>   Thanks
>>
>> Best Regards
>>
>>
>>
>> On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen  wrote:
>>
>> Hi,
>>
>>
>>
>> I have set spark.executor.memory to 2048m, and in the UI "Environment"
>> page, I can see this value has been set correctly. But in the "Executors"
>> page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
>> value. why not 256MB, or just as what I set?
>>
>>
>>
>> What am I missing here?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> David
>>
>>
>>
>>
>>
>>
>>  The information contained in this electronic message and any
>> attachments to this message are intended for the exclusive use of the
>> addressee(s) and may contain proprietary, confidential or privileged
>> information. If you are not the intended recipient, you should not
>> disseminate, distribute or copy this e-mail. Please notify the sender
>> immediately and destroy all copies of this message and any attachments.
>> WARNING: Computer viruses can be transmitted via email. The recipient
>> should check this email and any attachments for the presence of viruses.
>> The company accepts no liability for any damage caused by any virus
>> transmitted by this email. www.wipro.com
>>
>


Re: How to set Spark executor memory?

2015-03-21 Thread Xi Shen
Yeah, I think it is harder to troubleshot the properties issues in a IDE.
But the reason I stick to IDE is because if I use spark-submit, the BLAS
native cannot be loaded. May be I should open another thread to discuss
that.

Thanks,
David

On Sun, 22 Mar 2015 10:38 Xi Shen  wrote:

> In the log, I saw
>
>   MemoryStorage: MemoryStore started with capacity 6.7GB
>
> But I still can not find where to set this storage capacity.
>
> On Sat, 21 Mar 2015 20:30 Xi Shen  wrote:
>
>> Hi Sean,
>>
>> It's getting strange now. If I ran from IDE, my executor memory is always
>> set to 6.7G, no matter what value I set in code. I have check my
>> environment variable, and there's no value of 6.7, or 12.5
>>
>> Any idea?
>>
>> Thanks,
>> David
>>
>> On Tue, 17 Mar 2015 00:35 null  wrote:
>>
>>>  Hi Xi Shen,
>>>
>>> You could set the spark.executor.memory in the code itself . new 
>>> SparkConf()..set("spark.executor.memory", "2g")
>>>
>>> Or you can try the -- spark.executor.memory 2g while submitting the jar.
>>>
>>>
>>>
>>> Regards
>>>
>>> Jishnu Prathap
>>>
>>>
>>>
>>> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
>>> *Sent:* Monday, March 16, 2015 2:06 PM
>>> *To:* Xi Shen
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: How to set Spark executor memory?
>>>
>>>
>>>
>>> By default spark.executor.memory is set to 512m, I'm assuming since you
>>> are submiting the job using spark-submit and it is not able to override the
>>> value since you are running in local mode. Can you try it without using
>>> spark-submit as a standalone project?
>>>
>>>
>>>   Thanks
>>>
>>> Best Regards
>>>
>>>
>>>
>>> On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen  wrote:
>>>
>>> I set it in code, not by configuration. I submit my jar file to local. I
>>> am working in my developer environment.
>>>
>>>
>>>
>>> On Mon, 16 Mar 2015 18:28 Akhil Das  wrote:
>>>
>>> How are you setting it? and how are you submitting the job?
>>>
>>>
>>>   Thanks
>>>
>>> Best Regards
>>>
>>>
>>>
>>> On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen  wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I have set spark.executor.memory to 2048m, and in the UI "Environment"
>>> page, I can see this value has been set correctly. But in the "Executors"
>>> page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
>>> value. why not 256MB, or just as what I set?
>>>
>>>
>>>
>>> What am I missing here?
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> David
>>>
>>>
>>>
>>>
>>>
>>>
>>>  The information contained in this electronic message and any
>>> attachments to this message are intended for the exclusive use of the
>>> addressee(s) and may contain proprietary, confidential or privileged
>>> information. If you are not the intended recipient, you should not
>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>> immediately and destroy all copies of this message and any attachments.
>>> WARNING: Computer viruses can be transmitted via email. The recipient
>>> should check this email and any attachments for the presence of viruses.
>>> The company accepts no liability for any damage caused by any virus
>>> transmitted by this email. www.wipro.com
>>>
>>


netlib-java cannot load native lib in Windows when using spark-submit

2015-03-21 Thread Xi Shen
Hi,

I use the *OpenBLAS* DLL, and have configured my application to work in
IDE. When I start my Spark application from IntelliJ IDE, I can see in the
log that the native lib is loaded successfully.

But if I use *spark-submit* to start my application, the native lib still
cannot be load. I saw the WARN message that it failed to load both the
native and native-ref library. I checked the *Environment* tab in the Spark
UI, and the *java.library.path* is set correctly.


Thanks,

David


How to do nested foreach with RDD

2015-03-21 Thread Xi Shen
Hi,

I have two big RDD, and I need to do some math against each pair of them.
Traditionally, it is like a nested for-loop. But for RDD, it cause a nested
RDD which is prohibited.

Currently, I am collecting one of them, then do a nested for-loop, so to
avoid nested RDD. But would like to know if there's spark-way to do this.


Thanks,
David


Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Xi Shen
Hi Ted,

I have tried to invoke the command from both cygwin environment and
powershell environment. I still get the messages:

15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS

>From the Spark UI, I can see:

  spark.driver.extraLibrary c:\openblas


Thanks,
David


On Sun, Mar 22, 2015 at 11:45 AM Ted Yu  wrote:

> Can you try the --driver-library-path option ?
>
> spark-submit --driver-library-path /opt/hadoop/lib/native ...
>
> Cheers
>
> On Sat, Mar 21, 2015 at 4:58 PM, Xi Shen  wrote:
>
>> Hi,
>>
>> I use the *OpenBLAS* DLL, and have configured my application to work in
>> IDE. When I start my Spark application from IntelliJ IDE, I can see in the
>> log that the native lib is loaded successfully.
>>
>> But if I use *spark-submit* to start my application, the native lib
>> still cannot be load. I saw the WARN message that it failed to load both
>> the native and native-ref library. I checked the *Environment* tab in
>> the Spark UI, and the *java.library.path* is set correctly.
>>
>>
>> Thanks,
>>
>> David
>>
>>
>>
>


Re: How to do nested foreach with RDD

2015-03-22 Thread Xi Shen
Hi Reza,

Yes, I just found RDD.cartesian(). Very useful.

Thanks,
David


On Sun, Mar 22, 2015 at 5:08 PM Reza Zadeh  wrote:

> You can do this with the 'cartesian' product method on RDD. For example:
>
> val rdd1 = ...
> val rdd2 = ...
>
> val combinations = rdd1.cartesian(rdd2).filter{ case (a,b) => a < b }
>
> Reza
>
> On Sat, Mar 21, 2015 at 10:37 PM, Xi Shen  wrote:
>
>> Hi,
>>
>> I have two big RDD, and I need to do some math against each pair of them.
>> Traditionally, it is like a nested for-loop. But for RDD, it cause a nested
>> RDD which is prohibited.
>>
>> Currently, I am collecting one of them, then do a nested for-loop, so to
>> avoid nested RDD. But would like to know if there's spark-way to do this.
>>
>>
>> Thanks,
>> David
>>
>>
>


Re: How to set Spark executor memory?

2015-03-22 Thread Xi Shen
OK, I actually got the answer days ago from StackOverflow, but I did not
check it :(

When running in "local" mode, to set the executor memory

- when using spark-submit, use "--driver-memory"
- when running as a Java application, like executing from IDE, set the
"-Xmx" vm option


Thanks,
David


On Sun, Mar 22, 2015 at 2:10 PM Ted Yu  wrote:

> bq. the BLAS native cannot be loaded
>
> Have you tried specifying --driver-library-path option ?
>
> Cheers
>
> On Sat, Mar 21, 2015 at 4:42 PM, Xi Shen  wrote:
>
>> Yeah, I think it is harder to troubleshot the properties issues in a IDE.
>> But the reason I stick to IDE is because if I use spark-submit, the BLAS
>> native cannot be loaded. May be I should open another thread to discuss
>> that.
>>
>> Thanks,
>> David
>>
>> On Sun, 22 Mar 2015 10:38 Xi Shen  wrote:
>>
>>> In the log, I saw
>>>
>>>   MemoryStorage: MemoryStore started with capacity 6.7GB
>>>
>>> But I still can not find where to set this storage capacity.
>>>
>>> On Sat, 21 Mar 2015 20:30 Xi Shen  wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> It's getting strange now. If I ran from IDE, my executor memory is
>>>> always set to 6.7G, no matter what value I set in code. I have check my
>>>> environment variable, and there's no value of 6.7, or 12.5
>>>>
>>>> Any idea?
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On Tue, 17 Mar 2015 00:35 null  wrote:
>>>>
>>>>>  Hi Xi Shen,
>>>>>
>>>>> You could set the spark.executor.memory in the code itself . new 
>>>>> SparkConf()..set("spark.executor.memory", "2g")
>>>>>
>>>>> Or you can try the -- spark.executor.memory 2g while submitting the
>>>>> jar.
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>> Jishnu Prathap
>>>>>
>>>>>
>>>>>
>>>>> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
>>>>> *Sent:* Monday, March 16, 2015 2:06 PM
>>>>> *To:* Xi Shen
>>>>> *Cc:* user@spark.apache.org
>>>>> *Subject:* Re: How to set Spark executor memory?
>>>>>
>>>>>
>>>>>
>>>>> By default spark.executor.memory is set to 512m, I'm assuming since
>>>>> you are submiting the job using spark-submit and it is not able to 
>>>>> override
>>>>> the value since you are running in local mode. Can you try it without 
>>>>> using
>>>>> spark-submit as a standalone project?
>>>>>
>>>>>
>>>>>   Thanks
>>>>>
>>>>> Best Regards
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen 
>>>>> wrote:
>>>>>
>>>>> I set it in code, not by configuration. I submit my jar file to local.
>>>>> I am working in my developer environment.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 16 Mar 2015 18:28 Akhil Das 
>>>>> wrote:
>>>>>
>>>>> How are you setting it? and how are you submitting the job?
>>>>>
>>>>>
>>>>>   Thanks
>>>>>
>>>>> Best Regards
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen 
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I have set spark.executor.memory to 2048m, and in the UI "Environment"
>>>>> page, I can see this value has been set correctly. But in the "Executors"
>>>>> page, I saw there's only 1 executor and its memory is 265.4MB. Very 
>>>>> strange
>>>>> value. why not 256MB, or just as what I set?
>>>>>
>>>>>
>>>>>
>>>>> What am I missing here?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  The information contained in this electronic message and any
>>>>> attachments to this message are intended for the exclusive use of the
>>>>> addressee(s) and may contain proprietary, confidential or privileged
>>>>> information. If you are not the intended recipient, you should not
>>>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>>>> immediately and destroy all copies of this message and any attachments.
>>>>> WARNING: Computer viruses can be transmitted via email. The recipient
>>>>> should check this email and any attachments for the presence of viruses.
>>>>> The company accepts no liability for any damage caused by any virus
>>>>> transmitted by this email. www.wipro.com
>>>>>
>>>>
>


Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-23 Thread Xi Shen
I did not build my own Spark. I got the binary version online. If it can
load the native libs from IDE, it should also be able to load native when
running with "--matter local".

On Mon, 23 Mar 2015 07:15 Burak Yavuz  wrote:

> Did you build Spark with: -Pnetlib-lgpl?
>
> Ref: https://spark.apache.org/docs/latest/mllib-guide.html
>
> Burak
>
> On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu  wrote:
>
>> How about pointing LD_LIBRARY_PATH to native lib folder ?
>>
>> You need Spark 1.2.0 or higher for the above to work. See SPARK-1719
>>
>> Cheers
>>
>> On Sun, Mar 22, 2015 at 4:02 AM, Xi Shen  wrote:
>>
>>> Hi Ted,
>>>
>>> I have tried to invoke the command from both cygwin environment and
>>> powershell environment. I still get the messages:
>>>
>>> 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
>>> com.github.fommil.netlib.NativeSystemBLAS
>>> 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
>>> com.github.fommil.netlib.NativeRefBLAS
>>>
>>> From the Spark UI, I can see:
>>>
>>>   spark.driver.extraLibrary c:\openblas
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>
>>> On Sun, Mar 22, 2015 at 11:45 AM Ted Yu  wrote:
>>>
>>>> Can you try the --driver-library-path option ?
>>>>
>>>> spark-submit --driver-library-path /opt/hadoop/lib/native ...
>>>>
>>>> Cheers
>>>>
>>>> On Sat, Mar 21, 2015 at 4:58 PM, Xi Shen  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I use the *OpenBLAS* DLL, and have configured my application to work
>>>>> in IDE. When I start my Spark application from IntelliJ IDE, I can see in
>>>>> the log that the native lib is loaded successfully.
>>>>>
>>>>> But if I use *spark-submit* to start my application, the native lib
>>>>> still cannot be load. I saw the WARN message that it failed to load both
>>>>> the native and native-ref library. I checked the *Environment* tab in
>>>>> the Spark UI, and the *java.library.path* is set correctly.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>
>>
>


How to deploy binary dependencies to workers?

2015-03-24 Thread Xi Shen
Hi,

I am doing ML using Spark mllib. However, I do not have full control to the
cluster. I am using Microsoft Azure HDInsight

I want to deploy the BLAS or whatever required dependencies to accelerate
the computation. But I don't know how to deploy those DLLs when I submit my
JAR to the cluster.

I know how to pack those DLLs into a jar. The real challenge is how to let
the system find them...


Thanks,
David


Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Xi Shen
What is your environment? I remember I had similar error when "running
spark-shell --master yarn-client" in Windows environment.


On Wed, Mar 25, 2015 at 9:07 PM sachin Singh 
wrote:

> Hi ,
> when I am submitting spark job in cluster mode getting error as under in
> hadoop-yarn  log,
> someone has any idea,please suggest,
>
> 2015-03-25 23:35:22,467 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
> application_1427124496008_0028 State change from FINAL_SAVING to FAILED
> 2015-03-25 23:35:22,467 WARN
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs
> OPERATION=Application Finished - Failed TARGET=RMAppManager
>  RESULT=FAILURE
> DESCRIPTION=App failed with state: FAILED   PERMISSIONS=Application
> application_1427124496008_0028 failed 2 times due to AM Container for
> appattempt_1427124496008_0028_02 exited with  exitCode: 13 due to:
> Exception from container-launch.
> Container id: container_1427124496008_0028_02_01
> Exit code: 13
> Stack trace: ExitCodeException exitCode=13:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.
> launchContainer(DefaultContainerExecutor.java:197)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.
> launcher.ContainerLaunch.call(ContainerLaunch.java:299)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.
> launcher.ContainerLaunch.call(ContainerLaunch.java:81)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Container exited with a non-zero exit code 13
> .Failing this attempt.. Failing the application.
> APPID=application_1427124496008_0028
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/issue-while-submitting-Spark-Job-as-
> master-yarn-cluster-tp0.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Xi Shen
I think you meant to use the "--files" to deploy the DLLs. I gave a try,
but it did not work.

>From the Spark UI, Environment tab, I can see

spark.yarn.dist.files

file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll

I think my DLLs are all deployed. But I still got the warn message that
native BLAS library cannot be load.

And idea?


Thanks,
David


On Wed, Mar 25, 2015 at 5:40 AM DB Tsai  wrote:

> I would recommend to upload those jars to HDFS, and use add jars
> option in spark-submit with URI from HDFS instead of URI from local
> filesystem. Thus, it can avoid the problem of fetching jars from
> driver which can be a bottleneck.
>
> Sincerely,
>
> DB Tsai
> ---
> Blog: https://www.dbtsai.com
>
>
> On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen  wrote:
> > Hi,
> >
> > I am doing ML using Spark mllib. However, I do not have full control to
> the
> > cluster. I am using Microsoft Azure HDInsight
> >
> > I want to deploy the BLAS or whatever required dependencies to accelerate
> > the computation. But I don't know how to deploy those DLLs when I submit
> my
> > JAR to the cluster.
> >
> > I know how to pack those DLLs into a jar. The real challenge is how to
> let
> > the system find them...
> >
> >
> > Thanks,
> > David
> >
>


Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Xi Shen
Not of course...all machines in HDInsight are Windows 64bit server. And I
have made sure all my DLLs are for 64bit machines. I have managed to get
those DLLs loade on my local machine which is also Windows 64bit.




[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>

On Thu, Mar 26, 2015 at 11:11 AM, DB Tsai  wrote:

> Are you deploying the windows dll to linux machine?
>
> Sincerely,
>
> DB Tsai
> ---
> Blog: https://www.dbtsai.com
>
>
> On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen  wrote:
> > I think you meant to use the "--files" to deploy the DLLs. I gave a try,
> but
> > it did not work.
> >
> > From the Spark UI, Environment tab, I can see
> >
> > spark.yarn.dist.files
> >
> >
> file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll
> >
> > I think my DLLs are all deployed. But I still got the warn message that
> > native BLAS library cannot be load.
> >
> > And idea?
> >
> >
> > Thanks,
> > David
> >
> >
> > On Wed, Mar 25, 2015 at 5:40 AM DB Tsai  wrote:
> >>
> >> I would recommend to upload those jars to HDFS, and use add jars
> >> option in spark-submit with URI from HDFS instead of URI from local
> >> filesystem. Thus, it can avoid the problem of fetching jars from
> >> driver which can be a bottleneck.
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> ---
> >> Blog: https://www.dbtsai.com
> >>
> >>
> >> On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen  wrote:
> >> > Hi,
> >> >
> >> > I am doing ML using Spark mllib. However, I do not have full control
> to
> >> > the
> >> > cluster. I am using Microsoft Azure HDInsight
> >> >
> >> > I want to deploy the BLAS or whatever required dependencies to
> >> > accelerate
> >> > the computation. But I don't know how to deploy those DLLs when I
> submit
> >> > my
> >> > JAR to the cluster.
> >> >
> >> > I know how to pack those DLLs into a jar. The real challenge is how to
> >> > let
> >> > the system find them...
> >> >
> >> >
> >> > Thanks,
> >> > David
> >> >
>


How to troubleshoot server.TransportChannelHandler Exception

2015-03-25 Thread Xi Shen
Hi,

My environment is Windows 64bit, Spark + YARN. I had a job that takes a
long time. It starts well, but it ended with below exception:

15/03/25 12:39:09 WARN server.TransportChannelHandler: Exception in
connection from
headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.68.34:58507
java.io.IOException: An existing connection was forcibly closed by the
remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/03/25 12:39:09 ERROR executor.CoarseGrainedExecutorBackend: Driver
Disassociated [akka.tcp://
sparkexecu...@workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:65469]
-> [akka.tcp://
sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
disassociated! Shutting down.
15/03/25 12:39:09 WARN remote.ReliableDeliverySupervisor: Association with
remote system [akka.tcp://
sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Interestingly, the job is shown as Succeeded in the RM. I checked the
application log, it is miles long, and this is the only exception I found.
And it is no very useful to help me pin point the problem.

Any idea what would be the cause?


Thanks,


[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>


Re: How to troubleshoot server.TransportChannelHandler Exception

2015-03-26 Thread Xi Shen
ah~hell, I am using Spark 1.2.0, and my job was submitted to use 8
cores...the magic number in the bug.




[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>

On Thu, Mar 26, 2015 at 5:48 PM, Akhil Das 
wrote:

> Whats your spark version? Not quiet sure, but you could be hitting this
> issue
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4516
> On 26 Mar 2015 11:01, "Xi Shen"  wrote:
>
>> Hi,
>>
>> My environment is Windows 64bit, Spark + YARN. I had a job that takes a
>> long time. It starts well, but it ended with below exception:
>>
>> 15/03/25 12:39:09 WARN server.TransportChannelHandler: Exception in
>> connection from
>> headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.68.34:58507
>> java.io.IOException: An existing connection was forcibly closed by the
>> remote host
>> at sun.nio.ch.SocketDispatcher.read0(Native Method)
>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>> at
>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>> at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>> at
>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
>> at
>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> at
>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>> at java.lang.Thread.run(Thread.java:745)
>> 15/03/25 12:39:09 ERROR executor.CoarseGrainedExecutorBackend: Driver
>> Disassociated [akka.tcp://
>> sparkexecu...@workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:65469]
>> -> [akka.tcp://
>> sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
>> disassociated! Shutting down.
>> 15/03/25 12:39:09 WARN remote.ReliableDeliverySupervisor: Association
>> with remote system [akka.tcp://
>> sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>
>> Interestingly, the job is shown as Succeeded in the RM. I checked the
>> application log, it is miles long, and this is the only exception I found.
>> And it is no very useful to help me pin point the problem.
>>
>> Any idea what would be the cause?
>>
>>
>> Thanks,
>>
>>
>> [image: --]
>> Xi Shen
>> [image: http://]about.me/davidshen
>> <http://about.me/davidshen?promo=email_sig>
>>   <http://about.me/davidshen>
>>
>


Re: How to deploy binary dependencies to workers?

2015-03-26 Thread Xi Shen
OK, after various testing, I found the native library can be loaded if
running in yarn-cluster mode. But I still cannot find out why it won't load
when running in yarn-client mode...


Thanks,
David


On Thu, Mar 26, 2015 at 4:21 PM Xi Shen  wrote:

> Not of course...all machines in HDInsight are Windows 64bit server. And I
> have made sure all my DLLs are for 64bit machines. I have managed to get
> those DLLs loade on my local machine which is also Windows 64bit.
>
>
>
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> <http://about.me/davidshen?promo=email_sig>
>   <http://about.me/davidshen>
>
> On Thu, Mar 26, 2015 at 11:11 AM, DB Tsai  wrote:
>
>> Are you deploying the windows dll to linux machine?
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> Blog: https://www.dbtsai.com
>>
>>
>> On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen  wrote:
>> > I think you meant to use the "--files" to deploy the DLLs. I gave a
>> try, but
>> > it did not work.
>> >
>> > From the Spark UI, Environment tab, I can see
>> >
>> > spark.yarn.dist.files
>> >
>> >
>> file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll
>> >
>> > I think my DLLs are all deployed. But I still got the warn message that
>> > native BLAS library cannot be load.
>> >
>> > And idea?
>> >
>> >
>> > Thanks,
>> > David
>> >
>> >
>> > On Wed, Mar 25, 2015 at 5:40 AM DB Tsai  wrote:
>> >>
>> >> I would recommend to upload those jars to HDFS, and use add jars
>> >> option in spark-submit with URI from HDFS instead of URI from local
>> >> filesystem. Thus, it can avoid the problem of fetching jars from
>> >> driver which can be a bottleneck.
>> >>
>> >> Sincerely,
>> >>
>> >> DB Tsai
>> >> ---
>> >> Blog: https://www.dbtsai.com
>> >>
>> >>
>> >> On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen 
>> wrote:
>> >> > Hi,
>> >> >
>> >> > I am doing ML using Spark mllib. However, I do not have full control
>> to
>> >> > the
>> >> > cluster. I am using Microsoft Azure HDInsight
>> >> >
>> >> > I want to deploy the BLAS or whatever required dependencies to
>> >> > accelerate
>> >> > the computation. But I don't know how to deploy those DLLs when I
>> submit
>> >> > my
>> >> > JAR to the cluster.
>> >> >
>> >> > I know how to pack those DLLs into a jar. The real challenge is how
>> to
>> >> > let
>> >> > the system find them...
>> >> >
>> >> >
>> >> > Thanks,
>> >> > David
>> >> >
>>
>
>


Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi,

When I run k-means cluster with Spark, I got this in the last two lines in
the log:

15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5



Then it hangs for a long time. There's no active job. The driver machine is
idle. I cannot access the work node, I am not sure if they are busy.

I understand k-means may take a long time to finish. But why no active job?
no log?


Thanks,
David


Re: K Means cluster with spark

2015-03-26 Thread Xi Shen
Hi Sandeep,

I followed the DenseKMeans example which comes with the spark package.

My total vectors are about 40k, and my k=500. All my code are written in
Scala.

Thanks,
David

On Fri, 27 Mar 2015 05:51 sandeep vura  wrote:

> Hi Shen,
>
> I am also working on k means clustering with spark. May i know which links
> you are following to get understand k means clustering with spark and also
> need sample k means program to process in spark. which is written in scala.
>
> Regards,
> Sandeep.v
>


Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak,

My iterations is set to 500. But I think it should also stop of the
centroid coverages, right?

My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
vectors, each vector has about 300 features, all normalised. All work node
have sufficient memory and disk space.

Thanks,
David

On Fri, 27 Mar 2015 02:48 Burak Yavuz  wrote:

> Hi David,
>
> When the number of runs are large and the data is not properly
> partitioned, it seems that K-Means is hanging according to my experience.
> Especially setting the number of runs to something high drastically
> increases the work in executors. If that's not the case, can you give more
> info on what Spark version you are using, your setup, and your dataset?
>
> Thanks,
> Burak
> On Mar 26, 2015 5:10 AM, "Xi Shen"  wrote:
>
>> Hi,
>>
>> When I run k-means cluster with Spark, I got this in the last two lines
>> in the log:
>>
>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
>>
>>
>>
>> Then it hangs for a long time. There's no active job. The driver machine
>> is idle. I cannot access the work node, I am not sure if they are busy.
>>
>> I understand k-means may take a long time to finish. But why no active
>> job? no log?
>>
>>
>> Thanks,
>> David
>>
>>


Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
OH, the job I talked about has ran more than 11 hrs without a result...it
doesn't make sense.


On Fri, Mar 27, 2015 at 9:48 AM Xi Shen  wrote:

> Hi Burak,
>
> My iterations is set to 500. But I think it should also stop of the
> centroid coverages, right?
>
> My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
> vectors, each vector has about 300 features, all normalised. All work node
> have sufficient memory and disk space.
>
> Thanks,
> David
>
> On Fri, 27 Mar 2015 02:48 Burak Yavuz  wrote:
>
>> Hi David,
>>
>> When the number of runs are large and the data is not properly
>> partitioned, it seems that K-Means is hanging according to my experience.
>> Especially setting the number of runs to something high drastically
>> increases the work in executors. If that's not the case, can you give more
>> info on what Spark version you are using, your setup, and your dataset?
>>
>> Thanks,
>> Burak
>> On Mar 26, 2015 5:10 AM, "Xi Shen"  wrote:
>>
>>> Hi,
>>>
>>> When I run k-means cluster with Spark, I got this in the last two lines
>>> in the log:
>>>
>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
>>>
>>>
>>>
>>> Then it hangs for a long time. There's no active job. The driver machine
>>> is idle. I cannot access the work node, I am not sure if they are busy.
>>>
>>> I understand k-means may take a long time to finish. But why no active
>>> job? no log?
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>


Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
The code is very simple.

val data = sc.textFile("very/large/text/file") map { l =>
  // turn each line into dense vector
  Vectors.dense(...)
}

// the resulting data set is about 40k vectors

KMeans.train(data, k=5000, maxIterations=500)

I just kill my application. In the log I found this:

15/03/26 *11:42:43* INFO storage.BlockManagerMaster: Updated info of block
broadcast_26_piece0
15/03/26 *23:02:57* WARN server.TransportChannelHandler: Exception in
connection from
workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
java.io.IOException: An existing connection was forcibly closed by the
remote host

Notice the time gap. I think it means the work node did not generate any
log at all for about 12hrs...does it mean they are not working at all?

But when testing with very small data set, my application works and output
expected data.


Thanks,
David


On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz  wrote:

> Can you share the code snippet of how you call k-means? Do you cache the
> data before k-means? Did you repartition the data?
> On Mar 26, 2015 4:02 PM, "Xi Shen"  wrote:
>
>> OH, the job I talked about has ran more than 11 hrs without a result...it
>> doesn't make sense.
>>
>>
>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen  wrote:
>>
>>> Hi Burak,
>>>
>>> My iterations is set to 500. But I think it should also stop of the
>>> centroid coverages, right?
>>>
>>> My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
>>> vectors, each vector has about 300 features, all normalised. All work node
>>> have sufficient memory and disk space.
>>>
>>> Thanks,
>>> David
>>>
>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz  wrote:
>>>
>>>> Hi David,
>>>>
>>>> When the number of runs are large and the data is not properly
>>>> partitioned, it seems that K-Means is hanging according to my experience.
>>>> Especially setting the number of runs to something high drastically
>>>> increases the work in executors. If that's not the case, can you give more
>>>> info on what Spark version you are using, your setup, and your dataset?
>>>>
>>>> Thanks,
>>>> Burak
>>>> On Mar 26, 2015 5:10 AM, "Xi Shen"  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> When I run k-means cluster with Spark, I got this in the last two
>>>>> lines in the log:
>>>>>
>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
>>>>>
>>>>>
>>>>>
>>>>> Then it hangs for a long time. There's no active job. The driver
>>>>> machine is idle. I cannot access the work node, I am not sure if they are
>>>>> busy.
>>>>>
>>>>> I understand k-means may take a long time to finish. But why no active
>>>>> job? no log?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>


Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Xi Shen
It it bought in by another dependency, so you do not need to specify it
explicitly...I think this is what Ted mean.

On Fri, Mar 27, 2015 at 9:48 AM Pala M Muthaia 
wrote:

> +spark-dev
>
> Yes, the dependencies are there. I guess my question is how come the build
> is succeeding in the mainline then, without adding these dependencies?
>
> On Thu, Mar 26, 2015 at 3:44 PM, Ted Yu  wrote:
>
>> Looking at output from dependency:tree, servlet-api is brought in by the
>> following:
>>
>> [INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile
>> [INFO] |  +- org.antlr:antlr:jar:3.2:compile
>> [INFO] |  +- com.googlecode.json-simple:json-simple:jar:1.1:compile
>> [INFO] |  +- org.yaml:snakeyaml:jar:1.6:compile
>> [INFO] |  +- edu.stanford.ppl:snaptree:jar:0.1:compile
>> [INFO] |  +- org.mindrot:jbcrypt:jar:0.3m:compile
>> [INFO] |  +- org.apache.thrift:libthrift:jar:0.7.0:compile
>> [INFO] |  |  \- javax.servlet:servlet-api:jar:2.5:compile
>>
>> FYI
>>
>> On Thu, Mar 26, 2015 at 3:36 PM, Pala M Muthaia <
>> mchett...@rocketfuelinc.com> wrote:
>>
>>> Hi,
>>>
>>> We are trying to build spark 1.2 from source (tip of the branch-1.2 at
>>> the moment). I tried to build spark using the following command:
>>>
>>> mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
>>> -Phive-thriftserver -DskipTests clean package
>>>
>>> I encountered various missing class definition exceptions (e.g: class
>>> javax.servlet.ServletException not found).
>>>
>>> I eventually got the build to succeed after adding the following set of
>>> dependencies to the spark-core's pom.xml:
>>>
>>> 
>>>   javax.servlet
>>>   *servlet-api*
>>>   3.0
>>> 
>>>
>>> 
>>>   org.eclipse.jetty
>>>   *jetty-io*
>>> 
>>>
>>> 
>>>   org.eclipse.jetty
>>>   *jetty-http*
>>> 
>>>
>>> 
>>>   org.eclipse.jetty
>>>   *jetty-servlet*
>>> 
>>>
>>> Pretty much all of the missing class definition errors came up while
>>> building HttpServer.scala, and went away after the above dependencies were
>>> included.
>>>
>>> My guess is official build for spark 1.2 is working already. My question
>>> is what is wrong with my environment or setup, that requires me to add
>>> dependencies to pom.xml in this manner, to get this build to succeed.
>>>
>>> Also, i am not sure if this build would work at runtime for us, i am
>>> still testing this out.
>>>
>>>
>>> Thanks,
>>> pala
>>>
>>
>>
>


Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
How do I get the number of cores that I specified at the command line? I
want to use "spark.default.parallelism". I have 4 executors, each has 8
cores. According to
https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
the "spark.default.parallelism" value will be 4 * 8 = 32...I think it is
too large, or inappropriate. Please give some suggestion.

I have already used cache, and count to pre-cache.

I can try with smaller k for testing, but eventually I will have to use k =
5000 or even large. Because I estimate our data set would have that much of
clusters.


Thanks,
David


On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz  wrote:

> Hi David,
> The number of centroids (k=5000) seems too large and is probably the cause
> of the code taking too long.
>
> Can you please try the following:
> 1) Repartition data to the number of available cores with
> .repartition(numCores)
> 2) cache data
> 3) call .count() on data right before k-means
> 4) try k=500 (even less if possible)
>
> Thanks,
> Burak
>
> On Mar 26, 2015 4:15 PM, "Xi Shen"  wrote:
> >
> > The code is very simple.
> >
> > val data = sc.textFile("very/large/text/file") map { l =>
> >   // turn each line into dense vector
> >   Vectors.dense(...)
> > }
> >
> > // the resulting data set is about 40k vectors
> >
> > KMeans.train(data, k=5000, maxIterations=500)
> >
> > I just kill my application. In the log I found this:
> >
> > 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block
> broadcast_26_piece0
> > 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
> connection from
> workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
> > java.io.IOException: An existing connection was forcibly closed by the
> remote host
> >
> > Notice the time gap. I think it means the work node did not generate any
> log at all for about 12hrs...does it mean they are not working at all?
> >
> > But when testing with very small data set, my application works and
> output expected data.
> >
> >
> > Thanks,
> > David
> >
> >
> > On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz  wrote:
> >>
> >> Can you share the code snippet of how you call k-means? Do you cache
> the data before k-means? Did you repartition the data?
> >>
> >> On Mar 26, 2015 4:02 PM, "Xi Shen"  wrote:
> >>>
> >>> OH, the job I talked about has ran more than 11 hrs without a
> result...it doesn't make sense.
> >>>
> >>>
> >>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen  wrote:
> >>>>
> >>>> Hi Burak,
> >>>>
> >>>> My iterations is set to 500. But I think it should also stop of the
> centroid coverages, right?
> >>>>
> >>>> My spark is 1.2.0, working in windows 64 bit. My data set is about
> 40k vectors, each vector has about 300 features, all normalised. All work
> node have sufficient memory and disk space.
> >>>>
> >>>> Thanks,
> >>>> David
> >>>>
> >>>>
> >>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz  wrote:
> >>>>>
> >>>>> Hi David,
> >>>>>
> >>>>> When the number of runs are large and the data is not properly
> partitioned, it seems that K-Means is hanging according to my experience.
> Especially setting the number of runs to something high drastically
> increases the work in executors. If that's not the case, can you give more
> info on what Spark version you are using, your setup, and your dataset?
> >>>>>
> >>>>> Thanks,
> >>>>> Burak
> >>>>>
> >>>>> On Mar 26, 2015 5:10 AM, "Xi Shen"  wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> When I run k-means cluster with Spark, I got this in the last two
> lines in the log:
> >>>>>>
> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Then it hangs for a long time. There's no active job. The driver
> machine is idle. I cannot access the work node, I am not sure if they are
> busy.
> >>>>>>
> >>>>>> I understand k-means may take a long time to finish. But why no
> active job? no log?
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> David
> >>>>>>
>


Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak,

After I added .repartition(sc.defaultParallelism), I can see from the log
the partition number is set to 32. But in the Spark UI, it seems all the
data are loaded onto one executor. Previously they were loaded onto 4
executors.

Any idea?


Thanks,
David


On Fri, Mar 27, 2015 at 11:01 AM Xi Shen  wrote:

> How do I get the number of cores that I specified at the command line? I
> want to use "spark.default.parallelism". I have 4 executors, each has 8
> cores. According to
> https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
> the "spark.default.parallelism" value will be 4 * 8 = 32...I think it is
> too large, or inappropriate. Please give some suggestion.
>
> I have already used cache, and count to pre-cache.
>
> I can try with smaller k for testing, but eventually I will have to use k
> = 5000 or even large. Because I estimate our data set would have that much
> of clusters.
>
>
> Thanks,
> David
>
>
> On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz  wrote:
>
>> Hi David,
>> The number of centroids (k=5000) seems too large and is probably the
>> cause of the code taking too long.
>>
>> Can you please try the following:
>> 1) Repartition data to the number of available cores with
>> .repartition(numCores)
>> 2) cache data
>> 3) call .count() on data right before k-means
>> 4) try k=500 (even less if possible)
>>
>> Thanks,
>> Burak
>>
>> On Mar 26, 2015 4:15 PM, "Xi Shen"  wrote:
>> >
>> > The code is very simple.
>> >
>> > val data = sc.textFile("very/large/text/file") map { l =>
>> >   // turn each line into dense vector
>> >   Vectors.dense(...)
>> > }
>> >
>> > // the resulting data set is about 40k vectors
>> >
>> > KMeans.train(data, k=5000, maxIterations=500)
>> >
>> > I just kill my application. In the log I found this:
>> >
>> > 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
>> block broadcast_26_piece0
>> > 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
>> connection from workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.
>> net/100.72.84.107:56277
>> > java.io.IOException: An existing connection was forcibly closed by the
>> remote host
>> >
>> > Notice the time gap. I think it means the work node did not generate
>> any log at all for about 12hrs...does it mean they are not working at all?
>> >
>> > But when testing with very small data set, my application works and
>> output expected data.
>> >
>> >
>> > Thanks,
>> > David
>> >
>> >
>> > On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz  wrote:
>> >>
>> >> Can you share the code snippet of how you call k-means? Do you cache
>> the data before k-means? Did you repartition the data?
>> >>
>> >> On Mar 26, 2015 4:02 PM, "Xi Shen"  wrote:
>> >>>
>> >>> OH, the job I talked about has ran more than 11 hrs without a
>> result...it doesn't make sense.
>> >>>
>> >>>
>> >>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen 
>> wrote:
>> >>>>
>> >>>> Hi Burak,
>> >>>>
>> >>>> My iterations is set to 500. But I think it should also stop of the
>> centroid coverages, right?
>> >>>>
>> >>>> My spark is 1.2.0, working in windows 64 bit. My data set is about
>> 40k vectors, each vector has about 300 features, all normalised. All work
>> node have sufficient memory and disk space.
>> >>>>
>> >>>> Thanks,
>> >>>> David
>> >>>>
>> >>>>
>> >>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz  wrote:
>> >>>>>
>> >>>>> Hi David,
>> >>>>>
>> >>>>> When the number of runs are large and the data is not properly
>> partitioned, it seems that K-Means is hanging according to my experience.
>> Especially setting the number of runs to something high drastically
>> increases the work in executors. If that's not the case, can you give more
>> info on what Spark version you are using, your setup, and your dataset?
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Burak
>> >>>>>
>> >>>>> On Mar 26, 2015 5:10 AM, "Xi Shen"  wrote:
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> When I run k-means cluster with Spark, I got this in the last two
>> lines in the log:
>> >>>>>>
>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Then it hangs for a long time. There's no active job. The driver
>> machine is idle. I cannot access the work node, I am not sure if they are
>> busy.
>> >>>>>>
>> >>>>>> I understand k-means may take a long time to finish. But why no
>> active job? no log?
>> >>>>>>
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> David
>> >>>>>>
>>
>


k-means can only run on one executor with one thread?

2015-03-26 Thread Xi Shen
Hi,

I have a large data set, and I expects to get 5000 clusters.

I load the raw data, convert them into DenseVector; then I did repartition
and cache; finally I give the RDD[Vector] to KMeans.train().

Now the job is running, and data are loaded. But according to the Spark UI,
all data are loaded onto one executor. I checked that executor, and its CPU
workload is very low. I think it is using only 1 of the 8 cores. And all
other 3 executors are at rest.

Did I miss something? Is it possible to distribute the workload to all 4
executors?


Thanks,
David


SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen
Hi,

I want to load my data in this way:

sc.wholeTextFiles(opt.input) map { x => (x._1,
x._2.lines.filter(!_.isEmpty).toSeq) }


But I got

java.io.NotSerializableException: scala.collection.Iterator$$anon$13

But if I use "x._2.split('\n')", I can get the expected result. I want to
know what's wrong with using the "lines()" function.


Thanks,

[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>


Re: SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen
I have to use .lines.toArray.toSeq

A little tricky.




[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>

On Fri, Mar 27, 2015 at 4:41 PM, Xi Shen  wrote:

> Hi,
>
> I want to load my data in this way:
>
> sc.wholeTextFiles(opt.input) map { x => (x._1,
> x._2.lines.filter(!_.isEmpty).toSeq) }
>
>
> But I got
>
> java.io.NotSerializableException: scala.collection.Iterator$$anon$13
>
> But if I use "x._2.split('\n')", I can get the expected result. I want to
> know what's wrong with using the "lines()" function.
>
>
> Thanks,
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> <http://about.me/davidshen?promo=email_sig>
>   <http://about.me/davidshen>
>


Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Xi Shen
Yes, I have done repartition.

I tried to repartition to the number of cores in my cluster. Not helping...
I tried to repartition to the number of centroids (k value). Not helping...


On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley 
wrote:

> Can you try specifying the number of partitions when you load the data to
> equal the number of executors?  If your ETL changes the number of
> partitions, you can also repartition before calling KMeans.
>
>
> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen  wrote:
>
>> Hi,
>>
>> I have a large data set, and I expects to get 5000 clusters.
>>
>> I load the raw data, convert them into DenseVector; then I did
>> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>>
>> Now the job is running, and data are loaded. But according to the Spark
>> UI, all data are loaded onto one executor. I checked that executor, and its
>> CPU workload is very low. I think it is using only 1 of the 8 cores. And
>> all other 3 executors are at rest.
>>
>> Did I miss something? Is it possible to distribute the workload to all 4
>> executors?
>>
>>
>> Thanks,
>> David
>>
>>
>


Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
I have put more detail of my problem at
http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed

It is really appreciate if you can help me take a look at this problem. I
have tried various settings and ways to load/partition my data, but I just
cannot get rid that long pause.


Thanks,
David





[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>

On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen  wrote:

> Yes, I have done repartition.
>
> I tried to repartition to the number of cores in my cluster. Not helping...
> I tried to repartition to the number of centroids (k value). Not helping...
>
>
> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley 
> wrote:
>
>> Can you try specifying the number of partitions when you load the data to
>> equal the number of executors?  If your ETL changes the number of
>> partitions, you can also repartition before calling KMeans.
>>
>>
>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen  wrote:
>>
>>> Hi,
>>>
>>> I have a large data set, and I expects to get 5000 clusters.
>>>
>>> I load the raw data, convert them into DenseVector; then I did
>>> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>>>
>>> Now the job is running, and data are loaded. But according to the Spark
>>> UI, all data are loaded onto one executor. I checked that executor, and its
>>> CPU workload is very low. I think it is using only 1 of the 8 cores. And
>>> all other 3 executors are at rest.
>>>
>>> Did I miss something? Is it possible to distribute the workload to all 4
>>> executors?
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>
>>


Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
My vector dimension is like 360 or so. The data count is about 270k. My
driver has 2.9G memory. I attache a screenshot of current executor status.
I submitted this job with "--master yarn-cluster". I have a total of 7
worker node, one of them acts as the driver. In the screenshot, you can see
all worker nodes have loaded some data, but the driver is not loaded with
any data.

But the funny thing is, when I log on to the driver, and check its CPU &
memory status. I saw one java process using about 18% of CPU, and is using
about 1.6 GB memory.

[image: Inline image 1]

On Sat, Mar 28, 2015 at 7:06 PM Reza Zadeh  wrote:

> How many dimensions does your data have? The size of the k-means model is
> k * d, where d is the dimension of the data.
>
> Since you're using k=1000, if your data has dimension higher than say,
> 10,000, you will have trouble, because k*d doubles have to fit in the
> driver.
>
> Reza
>
> On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen  wrote:
>
>> I have put more detail of my problem at http://stackoverflow.com/
>> questions/29295420/spark-kmeans-computation-cannot-be-distributed
>>
>> It is really appreciate if you can help me take a look at this problem. I
>> have tried various settings and ways to load/partition my data, but I just
>> cannot get rid that long pause.
>>
>>
>> Thanks,
>> David
>>
>>
>>
>>
>>
>> [image: --]
>> Xi Shen
>> [image: http://]about.me/davidshen
>> <http://about.me/davidshen?promo=email_sig>
>>   <http://about.me/davidshen>
>>
>> On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen  wrote:
>>
>>> Yes, I have done repartition.
>>>
>>> I tried to repartition to the number of cores in my cluster. Not
>>> helping...
>>> I tried to repartition to the number of centroids (k value). Not
>>> helping...
>>>
>>>
>>> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley 
>>> wrote:
>>>
>>>> Can you try specifying the number of partitions when you load the data
>>>> to equal the number of executors?  If your ETL changes the number of
>>>> partitions, you can also repartition before calling KMeans.
>>>>
>>>>
>>>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a large data set, and I expects to get 5000 clusters.
>>>>>
>>>>> I load the raw data, convert them into DenseVector; then I did
>>>>> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>>>>>
>>>>> Now the job is running, and data are loaded. But according to the
>>>>> Spark UI, all data are loaded onto one executor. I checked that executor,
>>>>> and its CPU workload is very low. I think it is using only 1 of the 8
>>>>> cores. And all other 3 executors are at rest.
>>>>>
>>>>> Did I miss something? Is it possible to distribute the workload to all
>>>>> 4 executors?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>
>>>>
>>
>


Re: Why KMeans with mllib is so slow ?

2015-03-29 Thread Xi Shen
Hi Burak,

Unfortunately, I am expected to do my work in HDInsight environment which
only supports Spark 1.2.0 with Microsoft's flavor. I cannot simple replace
it with Spark 1.3.

I think the problem I am observing is caused by kmeans|| initialization
step. I will open another thread to discuss it.


Thanks,
David





[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>

On Sun, Mar 29, 2015 at 4:34 PM, Burak Yavuz  wrote:

> Hi David,
>
> Can you also try with Spark 1.3 if possible? I believe there was a 2x
> improvement on K-Means between 1.2 and 1.3.
>
> Thanks,
> Burak
>
>
>
> On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 
> wrote:
>
>> Hi Jao,
>>
>> Sorry to pop up this old thread. I am have the same problem like you did.
>> I
>> want to know if you have figured out how to improve k-means on Spark.
>>
>> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
>> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster.
>> The
>> cluster has 7 executors, each has 8 cores...
>>
>> If I set k=5000 which is the required value for my task, the job goes on
>> forever...
>>
>>
>> Thanks,
>> David
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


kmeans|| in Spark is not real paralleled?

2015-03-29 Thread Xi Shen
Hi,

I have opened a couple of threads asking about k-means performance problem
in Spark. I think I made a little progress.

Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It
uses the "kmeans||" initialization algorithm which supposedly to be a
faster version of kmeans++ and give better results in general.

But I observed that if the k is very large, the initialization step takes a
long time. From the CPU utilization chart, it looks like only one thread is
working. Please see
https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
.

I read the paper, http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf,
and it points out kmeans++ initialization algorithm will suffer if k is
large. That's why the paper contributed the kmeans|| algorithm.


If I invoke KMeans.train by using the random initialization algorithm, I do
not observe this problem, even with very large k, like k=5000. This makes
me suspect that the kmeans|| in Spark is not properly implemented and do
not utilize parallel implementation.


I have also tested my code and data set with Spark 1.3.0, and I still
observe this problem. I quickly checked the PR regarding the KMeans
algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
and polish, not changing/improving the algorithm.


I originally worked on Windows 64bit environment, and I also tested on
Linux 64bit environment. I could provide the code and data set if anyone
want to reproduce this problem.


I hope a Spark developer could comment on this problem and help identifying
if it is a bug.


Thanks,

[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>


Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xi Shen
For the same amount of data, if I set the k=500, the job finished in about
3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the
longest time I waited was 12 hrs...

If I use kmeans-random, same amount of data, k=5000, the job finished in
less than 2 hrs.

I think current kmeans|| implementation could not handle large vector
dimensions properly. In my case, my vector has about 350 dimensions. I
found another post complaining about kmeans performance in Spark, and that
guy has vectors of 200 dimensions.

It is possible people never tested large dimension case.


Thanks,
David




On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng  wrote:

> Hi Xi,
>
> Please create a JIRA if it takes longer to locate the issue. Did you
> try a smaller k?
>
> Best,
> Xiangrui
>
> On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen  wrote:
> > Hi Burak,
> >
> > After I added .repartition(sc.defaultParallelism), I can see from the
> log
> > the partition number is set to 32. But in the Spark UI, it seems all the
> > data are loaded onto one executor. Previously they were loaded onto 4
> > executors.
> >
> > Any idea?
> >
> >
> > Thanks,
> > David
> >
> >
> > On Fri, Mar 27, 2015 at 11:01 AM Xi Shen  wrote:
> >>
> >> How do I get the number of cores that I specified at the command line? I
> >> want to use "spark.default.parallelism". I have 4 executors, each has 8
> >> cores. According to
> >> https://spark.apache.org/docs/1.2.0/configuration.html#
> execution-behavior,
> >> the "spark.default.parallelism" value will be 4 * 8 = 32...I think it
> is too
> >> large, or inappropriate. Please give some suggestion.
> >>
> >> I have already used cache, and count to pre-cache.
> >>
> >> I can try with smaller k for testing, but eventually I will have to use
> k
> >> = 5000 or even large. Because I estimate our data set would have that
> much
> >> of clusters.
> >>
> >>
> >> Thanks,
> >> David
> >>
> >>
> >> On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz  wrote:
> >>>
> >>> Hi David,
> >>> The number of centroids (k=5000) seems too large and is probably the
> >>> cause of the code taking too long.
> >>>
> >>> Can you please try the following:
> >>> 1) Repartition data to the number of available cores with
> >>> .repartition(numCores)
> >>> 2) cache data
> >>> 3) call .count() on data right before k-means
> >>> 4) try k=500 (even less if possible)
> >>>
> >>> Thanks,
> >>> Burak
> >>>
> >>> On Mar 26, 2015 4:15 PM, "Xi Shen"  wrote:
> >>> >
> >>> > The code is very simple.
> >>> >
> >>> > val data = sc.textFile("very/large/text/file") map { l =>
> >>> >   // turn each line into dense vector
> >>> >   Vectors.dense(...)
> >>> > }
> >>> >
> >>> > // the resulting data set is about 40k vectors
> >>> >
> >>> > KMeans.train(data, k=5000, maxIterations=500)
> >>> >
> >>> > I just kill my application. In the log I found this:
> >>> >
> >>> > 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
> >>> > block broadcast_26_piece0
> >>> > 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
> >>> > connection from
> >>> > workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.
> net/100.72.84.107:56277
> >>> > java.io.IOException: An existing connection was forcibly closed by
> the
> >>> > remote host
> >>> >
> >>> > Notice the time gap. I think it means the work node did not generate
> >>> > any log at all for about 12hrs...does it mean they are not working
> at all?
> >>> >
> >>> > But when testing with very small data set, my application works and
> >>> > output expected data.
> >>> >
> >>> >
> >>> > Thanks,
> >>> > David
> >>> >
> >>> >
> >>> > On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz 
> wrote:
> >>> >>
> >>> >> Can you share the code snippet of how you call k-means? Do you cache
> >>> >> the data before k-means? Did you repartition the data?
> >>> >>
> >>> >> On Mar 26, 

Re: kmeans|| in Spark is not real paralleled?

2015-04-03 Thread Xi Shen
Hi Xingrui,

I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and
attached the sample code. But I could not attache the test data. I will
update the bug once I found a place to host the test data.


Thanks,
David


On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng  wrote:

> This PR updated the k-means|| initialization:
> https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
> which was included in 1.3.0. It should fix kmean|| initialization with
> large k. Please create a JIRA for this issue and send me the code and the
> dataset to produce this problem. Thanks! -Xiangrui
>
> On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen  wrote:
>
>> Hi,
>>
>> I have opened a couple of threads asking about k-means performance
>> problem in Spark. I think I made a little progress.
>>
>> Previous I use the simplest way of KMeans.train(rdd, k, maxIterations).
>> It uses the "kmeans||" initialization algorithm which supposedly to be a
>> faster version of kmeans++ and give better results in general.
>>
>> But I observed that if the k is very large, the initialization step takes
>> a long time. From the CPU utilization chart, it looks like only one thread
>> is working. Please see
>> https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
>> .
>>
>> I read the paper,
>> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it
>> points out kmeans++ initialization algorithm will suffer if k is large.
>> That's why the paper contributed the kmeans|| algorithm.
>>
>>
>> If I invoke KMeans.train by using the random initialization algorithm, I
>> do not observe this problem, even with very large k, like k=5000. This
>> makes me suspect that the kmeans|| in Spark is not properly implemented and
>> do not utilize parallel implementation.
>>
>>
>> I have also tested my code and data set with Spark 1.3.0, and I still
>> observe this problem. I quickly checked the PR regarding the KMeans
>> algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
>> and polish, not changing/improving the algorithm.
>>
>>
>> I originally worked on Windows 64bit environment, and I also tested on
>> Linux 64bit environment. I could provide the code and data set if anyone
>> want to reproduce this problem.
>>
>>
>> I hope a Spark developer could comment on this problem and help
>> identifying if it is a bug.
>>
>>
>> Thanks,
>>
>> [image: --]
>> Xi Shen
>> [image: http://]about.me/davidshen
>> <http://about.me/davidshen?promo=email_sig>
>>   <http://about.me/davidshen>
>>
>
>


IOUtils cannot write anything in Spark?

2015-04-22 Thread Xi Shen
Hi,

I have a RDD of some processed data. I want to write these files to HDFS,
but not for future M/R processing. I want to write plain old style text
file. I tried:

rdd foreach {d =>
  val file = // create the file using a HDFS FileSystem
  val lines = d map {
// format data into string
  }

  IOUtils.writeLines(lines, System.separator(), file)
}

Note, I was using the IOUtils from common-io, not from Hadoop package.

The results are all file are created in myHDFS, but has no data at all...



[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>


Spark job concurrency problem

2015-05-04 Thread Xi Shen
Hi,

I have two small RDD, each has about 600 records. In my code, I did

val rdd1 = sc...cache()
val rdd2 = sc...cache()

val result = rdd1.cartesian(rdd2).*repartition*(num_cpu).map {case (a,b) =>
  some_expensive_job(a,b)
}

I ran my job in YARN cluster with "--master yarn-cluster", I have 6
executor, and each has a large memory volume.

However, I noticed my job is very slow. I went to the RM page, and found
there are two containers, one is the driver, one is the worker. I guess
this is correct?

I went to the worker's log, and monitor the log detail. My app print some
information, so I can use them to estimate the progress of the "map"
operation. Looking at the log, it feels like the jobs are done one by one
sequentially, rather than #cpu batch at a time.

I checked the worker node, and their CPU are all busy.



[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>