Re: Debugging Spark itself in standalone cluster mode

2016-07-01 Thread cbruegg
Thanks for the guidance! Setting the --driver-java-options in spark-shell
instead of SPARK_MASTER_OPTS made the debugger connect to the right JVM. My
breakpoints get hit now.

nirandap [via Apache Spark Developers List] <
ml-node+s1001551n18145...@n3.nabble.com> schrieb am Fr., 1. Juli 2016 um
04:39 Uhr:

> Guys,
>
> Aren't TaskScheduler and DAGScheduler residing in the spark context? So,
> the debug configs need to be set in the JVM where the spark context is
> running? [1]
>
> But yes, I agree, if you really need to check the execution, you need to
> set those configs in the executors [2]
>
> [1]
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sparkcontext.html
> [2]
> http://spark.apache.org/docs/latest/configuration.html#runtime-environment
>
>
> On Fri, Jul 1, 2016 at 12:30 AM, rxin [via Apache Spark Developers List] 
> <[hidden
> email] > wrote:
>
>> Yes, scheduling is centralized in the driver.
>>
>> For debugging, I think you'd want to set the executor JVM, not the worker
>> JVM flags.
>>
>> On Thu, Jun 30, 2016 at 11:36 AM, cbruegg <[hidden email]
>> > wrote:
>>
> Hello everyone,
>>>
>>> I'm a student assistant in research at the University of Paderborn,
>>> working
>>> on integrating Spark (v1.6.2) with a new network resource management
>>> system.
>>> I have already taken a deep dive into the source code of spark-core
>>> w.r.t.
>>> its scheduling systems.
>>>
>>> We are running a cluster in standalone mode consisting of a master node
>>> and
>>> three slave nodes. Am I right to assume that tasks are scheduled within
>>> the
>>> TaskSchedulerImpl using the DAGScheduler in this mode? I need to find a
>>> place where the execution plan (and each stage) for a job is computed and
>>> can be analyzed, so I placed some breakpoints in these two classes.
>>>
>>> The remote debugging session within IntelliJ IDEA has been established by
>>> running the following commands on the master node before:
>>>
>>>   export SPARK_WORKER_OPTS="-Xdebug
>>> -Xrunjdwp:server=y,transport=dt_socket,address=4000,suspend=n"
>>>   export SPARK_MASTER_OPTS="-Xdebug
>>> -Xrunjdwp:server=y,transport=dt_socket,address=4000,suspend=n"
>>>
>>> Port 4000 has been forwarded to my local machine. Unfortunately, none of
>>> my
>>> breakpoints through the class get hit when I invoke a task like
>>> sc.parallelize(1 to 1000).count() in spark-shell on the master node
>>> (using
>>> --master spark://...), though when I pause all threads I can see that the
>>> process I am debugging runs some kind of event queue, which means that
>>> the
>>> debugger is connected to /something/.
>>>
>>> Do I rely on false assumptions or should these breakpoints in fact get
>>> hit?
>>> I am not too familiar with Spark, so please bear with me if I got
>>> something
>>> wrong. Many thanks in advance for your help.
>>>
>>> Best regards,
>>> Christian Brüggemann
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Debugging-Spark-itself-in-standalone-cluster-mode-tp18139.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>>
>> To unsubscribe e-mail: [hidden email]
>>> 
>>>
>>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Debugging-Spark-itself-in-standalone-cluster-mode-tp18139p18141.html
>>
> To start a new topic under Apache Spark Developers List, email [hidden
>> email] 
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> 
>>
>
>
>
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Debugging-Spark-itself-in-standalone-cluster-mode-tp18139p18145.html
> To unsubscribe from Debugging Spark itself in standalone cluster mode, click
> here
> 

Re: MinMaxScaler With features include category variables

2016-07-01 Thread Yanbo Liang
You can combine the columns which are need to be normalized into a vector
by VectorAssembler and do normalization on it.
Do another assembling for columns should not be normalized. At last, you
can assemble the two vector into one vector as the feature column and feed
it into model training.

Thanks
Yanbo

2016-06-25 21:16 GMT-07:00 段石石 :

> Hi all:
>
>
> I use the MinMaxScaler for data normalization, but I found the the api
> is only for Vector, we must vectorized the features firtst. However, the
> feature usually include two parts: one is need to be Normalization, another
> should not be normalized such as categorical. I want to add a api with the
> DataFrame which aim to normalize the columns which we want to normalize.
> And then we can make it to be vector and sent to the ML model api to train.
> I think that will be very useful for the developer with machine learning.
>
>
>
> Best Regards
>
> Thanks
>


Code Style Formatting

2016-07-01 Thread Anton Okolnychyi
Hi, all.

I've read the Spark code style guide.
I am wondering if there is an easy way to configure the code formatting in
IntelliJ IDEA to match the existing code base style.
IntelliJ IDEA highlights all failed checks from scalastyle-config.xml.
However, I did not find any predefined configurations that I can import to
IntelliJ IDEA to adjust hot it does the formatting.
Is it possible to avoid the manual configuration?

Best regards,
Anton Okolnychyi


Deploying ML Pipeline Model

2016-07-01 Thread Rishabh Bhardwaj
Hi All,

I am looking for ways to deploy a ML Pipeline model in production .
Spark has already proved to be a one of the best framework for model
training and creation, but once the ml pipeline model is ready how can I
deploy it outside spark context ?
MLlib model has toPMML method but today Pipeline model can not be saved to
PMML. There are some frameworks like MLeap which are trying to abstract
Pipeline Model and provide ML Pipeline Model deployment outside spark
context,but currently they don't have most of the ml transformers and
estimators.
I am looking for related work going on this area.
Any pointers will be helpful.

Thanks,
Rishabh.


Jenkins networking / port contention

2016-07-01 Thread Cody Koeninger
Can someone familiar with amplab's jenkins setup clarify whether all tests
running at a given time are competing for network ports, or whether there's
some sort of containerization being done?

Based on the use of Utils.startServiceOnPort in the tests, I'd assume the
former.


Jetty 9.3 CVE to be avoided...

2016-07-01 Thread Stephen Hellberg
To anyone contemplating an upgrade of the Jetty component in use with Apache
Spark, please be aware of  CVE-2016-4800
  , and ensure that you
are attempting to only integrate a version of the Jetty 9.3 stream that is
*9.3.9* /or later/.

Hopefully forewarned is forearmed; no need to expose vulnerabilities
unnecessarily!  ;-)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Jetty-9-3-CVE-to-be-avoided-tp18151.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Jenkins networking / port contention

2016-07-01 Thread shane knapp
i assume you're talking about zinc ports?

the tests are designed to run one at a time on randomized ports -- no
containerization.  we're on bare metal.

the test launch code executes this for each build:
# Generate random point for Zinc
export ZINC_PORT
ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")

On Fri, Jul 1, 2016 at 6:02 AM, Cody Koeninger  wrote:
> Can someone familiar with amplab's jenkins setup clarify whether all tests
> running at a given time are competing for network ports, or whether there's
> some sort of containerization being done?
>
> Based on the use of Utils.startServiceOnPort in the tests, I'd assume the
> former.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Jenkins networking / port contention

2016-07-01 Thread Cody Koeninger
Thanks for the response.  I'm talking about test code that starts up
embedded network services for integration testing.

KafkaTestUtils in particular always attempts to start a kafka broker
on the standard port, 9092.  Util.startServiceInPort is intended to
pick a higher port if the starting one has a bind collision... but in
my local testing multiple KafkaTestUtils instances running at the same
time on the same machine don't actually behave correctly.

I already updated the kafka 0.10 consumer tests to use a random port,
and can do the same for the 0.8 consumer tests, but wanted to make
sure I understood what was happening in the Jenkins environment.

On Fri, Jul 1, 2016 at 11:18 AM, shane knapp  wrote:
> i assume you're talking about zinc ports?
>
> the tests are designed to run one at a time on randomized ports -- no
> containerization.  we're on bare metal.
>
> the test launch code executes this for each build:
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
>
> On Fri, Jul 1, 2016 at 6:02 AM, Cody Koeninger  wrote:
>> Can someone familiar with amplab's jenkins setup clarify whether all tests
>> running at a given time are competing for network ports, or whether there's
>> some sort of containerization being done?
>>
>> Based on the use of Utils.startServiceOnPort in the tests, I'd assume the
>> former.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Jenkins networking / port contention

2016-07-01 Thread shane knapp
gotcha...  adding @joshrosen directly who might be of more assistance...  :)

On Fri, Jul 1, 2016 at 9:38 AM, Cody Koeninger  wrote:
> Thanks for the response.  I'm talking about test code that starts up
> embedded network services for integration testing.
>
> KafkaTestUtils in particular always attempts to start a kafka broker
> on the standard port, 9092.  Util.startServiceInPort is intended to
> pick a higher port if the starting one has a bind collision... but in
> my local testing multiple KafkaTestUtils instances running at the same
> time on the same machine don't actually behave correctly.
>
> I already updated the kafka 0.10 consumer tests to use a random port,
> and can do the same for the 0.8 consumer tests, but wanted to make
> sure I understood what was happening in the Jenkins environment.
>
> On Fri, Jul 1, 2016 at 11:18 AM, shane knapp  wrote:
>> i assume you're talking about zinc ports?
>>
>> the tests are designed to run one at a time on randomized ports -- no
>> containerization.  we're on bare metal.
>>
>> the test launch code executes this for each build:
>> # Generate random point for Zinc
>> export ZINC_PORT
>> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
>>
>> On Fri, Jul 1, 2016 at 6:02 AM, Cody Koeninger  wrote:
>>> Can someone familiar with amplab's jenkins setup clarify whether all tests
>>> running at a given time are competing for network ports, or whether there's
>>> some sort of containerization being done?
>>>
>>> Based on the use of Utils.startServiceOnPort in the tests, I'd assume the
>>> former.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Jenkins networking / port contention

2016-07-01 Thread Reynold Xin
Multiple instances of test runs are usually running in parallel, so they
would need to bind to different ports.

On Friday, July 1, 2016, Cody Koeninger  wrote:

> Thanks for the response.  I'm talking about test code that starts up
> embedded network services for integration testing.
>
> KafkaTestUtils in particular always attempts to start a kafka broker
> on the standard port, 9092.  Util.startServiceInPort is intended to
> pick a higher port if the starting one has a bind collision... but in
> my local testing multiple KafkaTestUtils instances running at the same
> time on the same machine don't actually behave correctly.
>
> I already updated the kafka 0.10 consumer tests to use a random port,
> and can do the same for the 0.8 consumer tests, but wanted to make
> sure I understood what was happening in the Jenkins environment.
>
> On Fri, Jul 1, 2016 at 11:18 AM, shane knapp  > wrote:
> > i assume you're talking about zinc ports?
> >
> > the tests are designed to run one at a time on randomized ports -- no
> > containerization.  we're on bare metal.
> >
> > the test launch code executes this for each build:
> > # Generate random point for Zinc
> > export ZINC_PORT
> > ZINC_PORT=$(python -S -c "import random; print
> random.randrange(3030,4030)")
> >
> > On Fri, Jul 1, 2016 at 6:02 AM, Cody Koeninger  > wrote:
> >> Can someone familiar with amplab's jenkins setup clarify whether all
> tests
> >> running at a given time are competing for network ports, or whether
> there's
> >> some sort of containerization being done?
> >>
> >> Based on the use of Utils.startServiceOnPort in the tests, I'd assume
> the
> >> former.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>


Re: Jenkins networking / port contention

2016-07-01 Thread Cody Koeninger
Makes sense.  I'll submit a fix for kafka 0.8 and do a scan through of
other tests to see if I can find similar issues.

On Fri, Jul 1, 2016 at 11:45 AM, Reynold Xin  wrote:
> Multiple instances of test runs are usually running in parallel, so they
> would need to bind to different ports.
>
>
> On Friday, July 1, 2016, Cody Koeninger  wrote:
>>
>> Thanks for the response.  I'm talking about test code that starts up
>> embedded network services for integration testing.
>>
>> KafkaTestUtils in particular always attempts to start a kafka broker
>> on the standard port, 9092.  Util.startServiceInPort is intended to
>> pick a higher port if the starting one has a bind collision... but in
>> my local testing multiple KafkaTestUtils instances running at the same
>> time on the same machine don't actually behave correctly.
>>
>> I already updated the kafka 0.10 consumer tests to use a random port,
>> and can do the same for the 0.8 consumer tests, but wanted to make
>> sure I understood what was happening in the Jenkins environment.
>>
>> On Fri, Jul 1, 2016 at 11:18 AM, shane knapp  wrote:
>> > i assume you're talking about zinc ports?
>> >
>> > the tests are designed to run one at a time on randomized ports -- no
>> > containerization.  we're on bare metal.
>> >
>> > the test launch code executes this for each build:
>> > # Generate random point for Zinc
>> > export ZINC_PORT
>> > ZINC_PORT=$(python -S -c "import random; print
>> > random.randrange(3030,4030)")
>> >
>> > On Fri, Jul 1, 2016 at 6:02 AM, Cody Koeninger 
>> > wrote:
>> >> Can someone familiar with amplab's jenkins setup clarify whether all
>> >> tests
>> >> running at a given time are competing for network ports, or whether
>> >> there's
>> >> some sort of containerization being done?
>> >>
>> >> Based on the use of Utils.startServiceOnPort in the tests, I'd assume
>> >> the
>> >> former.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[build system] quick jenkins restart

2016-07-01 Thread shane knapp
i put jenkins in quiet mode as i noticed we have almost no builds
queued.  one of our students needed rust installed on the workers, and
i need to update the PATH on all of the workers.

we should be back up and building within 30 minutes.

thanks!

shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] quick jenkins restart

2016-07-01 Thread shane knapp
aand we're back.

On Fri, Jul 1, 2016 at 10:10 AM, shane knapp  wrote:
> i put jenkins in quiet mode as i noticed we have almost no builds
> queued.  one of our students needed rust installed on the workers, and
> i need to update the PATH on all of the workers.
>
> we should be back up and building within 30 minutes.
>
> thanks!
>
> shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Code Style Formatting

2016-07-01 Thread Reynold Xin
There isn't one pre-made, but the default works out OK. The main thing
you'd need to update are spacing changes for function argument indentation
and import ordering.

On Fri, Jul 1, 2016 at 4:11 AM, Anton Okolnychyi  wrote:

> Hi, all.
>
> I've read the Spark code style guide.
> I am wondering if there is an easy way to configure the code formatting in
> IntelliJ IDEA to match the existing code base style.
> IntelliJ IDEA highlights all failed checks from scalastyle-config.xml.
> However, I did not find any predefined configurations that I can import to
> IntelliJ IDEA to adjust hot it does the formatting.
> Is it possible to avoid the manual configuration?
>
> Best regards,
> Anton Okolnychyi
>


branch-2.0 is now 2.0.1-SNAPSHOT?

2016-07-01 Thread Koert Kuipers
is that correct?
where do i get the latest 2.0.0-SNAPSHOT?
thanks,
koert