Re: A Spark Compilation Question

2014-09-28 Thread Yi Tian
I think you should modify the module settings in IDEA instead of pom.xml


Best Regards,

Yi Tian
tianyi.asiai...@gmail.com




On Sep 26, 2014, at 18:09, Yanbo Liang  wrote:

> Hi Hansu,
> 
> I have encountered the same problem. Maven compiled avro file and generated
> corresponding Java file in new directory which is not source file directory
> of the project.
> 
> I have modified pom.xml file and it can be work.
> The line marked as red is added, you can add them to your
> spark-*.*.*/external/flume-sink/pom.xml.
> 
>
>org.apache.avro
>avro-maven-plugin
>${avro.version}
>
>  
> 
> ${project.basedir}/target/scala-${scala.binary.version}/src_managed/main/compiled_avro
>  ${project.basedir}/src/main/java
>
>
>  
>generate-sources
>
>  idl-protocol
>
>  
>
>  
> 
>
>org.codehaus.mojo
> build-helper-maven-plugin
> 1.9.1
> 
>  
> add-source
> generate-sources
> 
>  add-source
> 
>
>  
>  ${project.basedir}/src/main/java
>  
>
>  
>
>  
> 
> 
> 
> 
> 2014-09-13 2:45 GMT+08:00 Hansu GU :
> 
>> I downloaded the source and imported it into IntelliJ 13.1 as a Maven
>> project.
>> 
>> When I used IntelliJ Build -> make Project, I encountered:
>> 
>> Error:(44, 66) not found: type SparkFlumeProtocol val
>> transactionTimeout: Int, val backOffInterval: Int) extends
>> SparkFlumeProtocol with Logging {
>> 
>> I think there are some avro generated files missing but I am not sure.
>> Could anyone help me understand this in order to successfully compile
>> the source?
>> 
>> Thanks,
>> Hansu
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-28 Thread Yanbo Liang
Hi

We have used LogisticRegression with two different optimization method SGD
and LBFGS in MLlib.
With the same dataset and the same training and test split, but get
different weights vector.

For example, we use
spark-1.1.0/data/mllib/sample_binary_classification_data.txt
as our training and test dataset.
With LogisticRegressionWithSGD and LogisticRegressionWithLBFGS as training
method and the same other parameters.

The precisions of these two methods almost near 100% and AUCs are also near
1.0.
As far as I know, the convex optimization problem will converge to the
global minimum value. (We use SGD with mini batch fraction as 1.0)
But I got two different weights vector? Is this expectation or make sense?


How to use multi thread in RDD map function ?

2014-09-28 Thread myasuka
Hi, everyone
I come across with a problem about increasing the concurency. In a
program, after shuffle write, each node should fetch 16 pair matrices to do
matrix multiplication. such as:

*import breeze.linalg.{DenseMatrix => BDM}

pairs.map(t => {
val b1 = t._2._1.asInstanceOf[BDM[Double]]
val b2 = t._2._2.asInstanceOf[BDM[Double]]
 
val c = (b1 * b2).asInstanceOf[BDM[Double]]

(new BlockID(t._1.row, t._1.column), c)
  })*
 
Each node has 16 cores. However, no matter I set 16 tasks or more on
each node, the concurrency cannot be higher than 60%, which means not every
core on the node is computing. Then I check the running log on the WebUI,
according to the amount of shuffle read and write in every task, I see some
task do once matrix multiplication, some do twice while some do none.

Thus, I think of using java multi thread to increase the concurrency. I
wrote a program in scala which calls java multi thread without Spark on a
single node, by watch the 'top' monitor, I find this program can use CPU up
to 1500% ( means nearly every core are computing). But I have no idea how to
use Java multi thread in RDD transformation.

Is there any one can provide some example code to use Java multi thread
in RDD transformation, or give any idea to increase the concurrency ?

Thanks for all




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to use multi thread in RDD map function ?

2014-09-28 Thread Yi Tian
for yarn-client mode:
 
SPARK_EXECUTOR_CORES * SPARK_EXECUTOR_INSTANCES = 2(or 3) * 
TotalCoresOnYourCluster

for standlone mode:

SPARK_WORKER_INSTANCES * SPARK_WORKER_CORES = 2(or 3) * TotalCoresOnYourCluster



Best Regards,

Yi Tian
tianyi.asiai...@gmail.com




On Sep 28, 2014, at 17:59, myasuka  wrote:

> Hi, everyone
>I come across with a problem about increasing the concurency. In a
> program, after shuffle write, each node should fetch 16 pair matrices to do
> matrix multiplication. such as:
> 
> *import breeze.linalg.{DenseMatrix => BDM}
> 
> pairs.map(t => {
>val b1 = t._2._1.asInstanceOf[BDM[Double]]
>val b2 = t._2._2.asInstanceOf[BDM[Double]]
> 
>val c = (b1 * b2).asInstanceOf[BDM[Double]]
> 
>(new BlockID(t._1.row, t._1.column), c)
>  })*
> 
>Each node has 16 cores. However, no matter I set 16 tasks or more on
> each node, the concurrency cannot be higher than 60%, which means not every
> core on the node is computing. Then I check the running log on the WebUI,
> according to the amount of shuffle read and write in every task, I see some
> task do once matrix multiplication, some do twice while some do none.
> 
>Thus, I think of using java multi thread to increase the concurrency. I
> wrote a program in scala which calls java multi thread without Spark on a
> single node, by watch the 'top' monitor, I find this program can use CPU up
> to 1500% ( means nearly every core are computing). But I have no idea how to
> use Java multi thread in RDD transformation.
> 
>Is there any one can provide some example code to use Java multi thread
> in RDD transformation, or give any idea to increase the concurrency ?
> 
> Thanks for all
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 



Re: Workflow Scheduler for Spark

2014-09-28 Thread Egor Pahomov
I created Jira  and design
doc

on
this matter.

2014-09-17 22:28 GMT+04:00 Reynold Xin :

> There might've been some misunderstanding. I was referring to the MLlib
> pipeline design doc when I said the design doc was posted, in response to
> the first paragraph of your original email.
>
>
> On Wed, Sep 17, 2014 at 2:47 AM, Egor Pahomov 
> wrote:
>
> > It's doc about MLLib pipeline functionality. What about oozie-like
> > workflow?
> >
> > 2014-09-17 13:08 GMT+04:00 Mark Hamstra :
> >
> > > See https://issues.apache.org/jira/browse/SPARK-3530 and this doc,
> > > referenced in that JIRA:
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
> > >
> > > On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov 
> > > wrote:
> > >
> > >> I have problems using Oozie. For example it doesn't sustain spark
> > context
> > >> like ooyola job server does. Other than GUI interfaces like HUE it's
> > hard
> > >> to work with - scoozie stopped in development year ago(I spoke with
> > >> creator) and oozie xml very hard to write.
> > >> Oozie still have all documentation and code in MR model rather than in
> > >> yarn
> > >> model. And based on it's current speed of development I can't expect
> > >> radical changes in nearest future. There is no "Databricks" for oozie,
> > >> which would have people on salary to develop this kind of radical
> > changes.
> > >> It's dinosaur.
> > >>
> > >> Reunold, can you help finding this doc? Do you mean just pipelining
> > spark
> > >> code or additional logic of persistence tasks, job server, task retry,
> > >> data
> > >> availability and extra?
> > >>
> > >>
> > >> 2014-09-17 11:21 GMT+04:00 Reynold Xin :
> > >>
> > >> > Hi Egor,
> > >> >
> > >> > I think the design doc for the pipeline feature has been posted.
> > >> >
> > >> > For the workflow, I believe Oozie actually works fine with Spark if
> > you
> > >> > want some external workflow system. Do you have any trouble using
> > that?
> > >> >
> > >> >
> > >> > On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov <
> > pahomov.e...@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> There are two things we(Yandex) miss in Spark: MLlib good
> > abstractions
> > >> and
> > >> >> good workflow job scheduler. From threads "Adding abstraction in
> > MlLib"
> > >> >> and
> > >> >> "[mllib] State of Multi-Model training" I got the idea, that
> > databricks
> > >> >> working on it and we should wait until first post doc, which would
> > lead
> > >> >> us.
> > >> >> What about workflow scheduler? Is there anyone already working on
> it?
> > >> Does
> > >> >> anyone have a plan on doing it?
> > >> >>
> > >> >> P.S. We thought that MLlib abstractions about multiple algorithms
> run
> > >> with
> > >> >> same data would need such scheduler, which would rerun algorithm in
> > >> case
> > >> >> of
> > >> >> failure. I understand, that spark provide fault tolerance out of
> the
> > >> box,
> > >> >> but we found some "Ooozie-like" scheduler more reliable for such
> long
> > >> >> living workflows.
> > >> >>
> > >> >> --
> > >> >>
> > >> >>
> > >> >>
> > >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> > >> >>
> > >> >
> > >> >
> > >>
> > >>
> > >> --
> > >>
> > >>
> > >>
> > >> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> > >>
> > >
> > >
> >
> >
> > --
> >
> >
> >
> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >
>



-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-28 Thread Du Li
It turned out a bug in my code. In the select clause the list of fields is
misaligned with the schema of the target table. As a consequence the map
data couldn’t be cast to some other type in the schema.

Thanks anyway.


On 9/26/14, 8:08 PM, "Cheng Lian"  wrote:

>Would you mind to provide the DDL of this partitioned table together
>with the query you tried? The stacktrace suggests that the query was
>trying to cast a map into something else, which is not supported in
>Spark SQL. And I doubt whether Hive support casting a complex type to
>some other type.
>
>On 9/27/14 7:48 AM, Du Li wrote:
>> Hi,
>>
>> I was loading data into a partitioned table on Spark 1.1.0
>> beeline-thriftserver. The table has complex data types such as
>>map> string> and array>. The query is like ³insert
>>overwrite
>> table a partition (Š) select в and the select clause worked if run
>> separately. However, when running the insert query, there was an error
>>as
>> follows.
>>
>> The source code of Cast.scala seems to only handle the primitive data
>> types, which is perhaps why the MatchError was thrown.
>>
>> I just wonder if this is still work in progress, or I should do it
>> differently.
>>
>> Thanks,
>> Du
>>
>>
>> 
>> scala.MatchError: MapType(StringType,StringType,true) (of class
>> org.apache.spark.sql.catalyst.types.MapType)
>>
>> 
>>org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala
>>:2
>> 47)
>>  
>>org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
>>  
>>org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
>>
>> 
>>org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.sca
>>la
>> :84)
>>
>> 
>>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.ap
>>pl
>> y(Projection.scala:66)
>>
>> 
>>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.ap
>>pl
>> y(Projection.scala:50)
>>  scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>  scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>
>> 
>>org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$
>>sq
>> 
>>l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.s
>>ca
>> la:149)
>>
>> 
>>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHi
>>ve
>> File$1.apply(InsertIntoHiveTable.scala:158)
>>
>> 
>>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHi
>>ve
>> File$1.apply(InsertIntoHiveTable.scala:158)
>>  
>>org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>  org.apache.spark.scheduler.Task.run(Task.scala:54)
>>
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>
>> 
>>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java
>>:1
>> 145)
>>
>> 
>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav
>>a:
>> 615)
>>  java.lang.Thread.run(Thread.java:722)
>>
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>



view not supported in spark thrift server?

2014-09-28 Thread Du Li

Can anybody confirm whether or not view is currently supported in spark? I 
found “create view translate” in the blacklist of HiveCompatibilitySuite.scala 
and also the following scenario threw NullPointerException on 
beeline/thriftserver (1.1.0). Any plan to support it soon?


> create table src(k string, v string);

> load data local inpath 
> '/home/y/share/yspark/examples/src/main/resources/kv1.txt' into table src;

> create view kv as select k, v from src;

> select * from kv;

Error: java.lang.NullPointerException (state=,code=0)


Re: view not supported in spark thrift server?

2014-09-28 Thread Michael Armbrust
Views are not supported yet.  Its not currently on the near term roadmap,
but that can change if there is sufficient demand or someone in the
community is interested in implementing them.  I do not think it would be
very hard.

Michael

On Sun, Sep 28, 2014 at 11:59 AM, Du Li  wrote:

>
>  Can anybody confirm whether or not view is currently supported in spark?
> I found “create view translate” in the blacklist of
> HiveCompatibilitySuite.scala and also the following scenario threw
> NullPointerException on beeline/thriftserver (1.1.0). Any plan to support
> it soon?
>
>  > create table src(k string, v string);
>
> > load data local inpath
> '/home/y/share/yspark/examples/src/main/resources/kv1.txt' into table src;
>
> > create view kv as select k, v from src;
>
> > select * from kv;
>
> Error: java.lang.NullPointerException (state=,code=0)
>


Re: view not supported in spark thrift server?

2014-09-28 Thread Du Li
Thanks, Michael, for your quick response.

View is critical for my project that is migrating from shark to spark SQL. I 
have implemented and tested everything else. It would be perfect if view could 
be implemented soon.

Du


From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Sunday, September 28, 2014 at 12:13 PM
To: Du Li mailto:l...@yahoo-inc.com.invalid>>
Cc: "dev@spark.apache.org" 
mailto:dev@spark.apache.org>>, 
"u...@spark.apache.org" 
mailto:u...@spark.apache.org>>
Subject: Re: view not supported in spark thrift server?

Views are not supported yet.  Its not currently on the near term roadmap, but 
that can change if there is sufficient demand or someone in the community is 
interested in implementing them.  I do not think it would be very hard.

Michael

On Sun, Sep 28, 2014 at 11:59 AM, Du Li 
mailto:l...@yahoo-inc.com.invalid>> wrote:

Can anybody confirm whether or not view is currently supported in spark? I 
found “create view translate” in the blacklist of HiveCompatibilitySuite.scala 
and also the following scenario threw NullPointerException on 
beeline/thriftserver (1.1.0). Any plan to support it soon?


> create table src(k string, v string);

> load data local inpath 
> '/home/y/share/yspark/examples/src/main/resources/kv1.txt' into table src;

> create view kv as select k, v from src;

> select * from kv;

Error: java.lang.NullPointerException (state=,code=0)



Spark meetup on Oct 15 in NYC

2014-09-28 Thread Reynold Xin
Hi Spark users and developers,

Some of the most active Spark developers (including Matei Zaharia, Michael
Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for
Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to
host a meetup event. This might be the event with the highest committer to
user ratio in the history of user meetups. Look forward to meeting more
users in NYC.

You can sign up for that here:
http://www.meetup.com/Spark-NYC/events/209271842/

Cheers.