Getting the execution times of spark job

2014-09-02 Thread Niranda Perera
Hi,

I have been playing around with spark for a couple of days. I am
using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the
implementation is to run Hive queries on Spark. I used JavaHiveContext to
achieve this (As per the examples).

I have 2 questions.
1. I am wondering how I could get the execution times of a spark job? Does
Spark provide monitoring facilities in the form of an API?

2. I used a laymen way to get the execution times by enclosing a
JavaHiveContext.hql method with System.nanoTime() as follows

long start, end;
JavaHiveContext hiveCtx;
JavaSchemaRDD hiveResult;

start = System.nanoTime();
hiveResult = hiveCtx.hql(query);
end = System.nanoTime();
System.out.println(start-end);

But the result I got is drastically different from the execution times
recorded in SparkUI. Can you please explain this disparity?

Look forward to hearing from you.

rgds

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 


Re: Getting the execution times of spark job

2014-09-02 Thread Zongheng Yang
For your second question: hql() (as well as sql()) does not launch a
Spark job immediately; instead, it fires off the Spark SQL
parser/optimizer/planner pipeline first, and a Spark job will be
started after the a physical execution plan is selected. Therefore,
your hand-rolled end-to-end measurement includes the time to go
through the Spark SQL code path, and the times reported inside the UI
are the execution times of the Spark job(s) only.

On Mon, Sep 1, 2014 at 11:45 PM, Niranda Perera  wrote:
> Hi,
>
> I have been playing around with spark for a couple of days. I am
> using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the
> implementation is to run Hive queries on Spark. I used JavaHiveContext to
> achieve this (As per the examples).
>
> I have 2 questions.
> 1. I am wondering how I could get the execution times of a spark job? Does
> Spark provide monitoring facilities in the form of an API?
>
> 2. I used a laymen way to get the execution times by enclosing a
> JavaHiveContext.hql method with System.nanoTime() as follows
>
> long start, end;
> JavaHiveContext hiveCtx;
> JavaSchemaRDD hiveResult;
>
> start = System.nanoTime();
> hiveResult = hiveCtx.hql(query);
> end = System.nanoTime();
> System.out.println(start-end);
>
> But the result I got is drastically different from the execution times
> recorded in SparkUI. Can you please explain this disparity?
>
> Look forward to hearing from you.
>
> rgds
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL Query and join different data sources.

2014-09-02 Thread Yin Huai
Actually, with HiveContext, you can join hive tables with registered
temporary tables.


On Fri, Aug 22, 2014 at 9:07 PM, chutium  wrote:

> oops, thanks Yan, you are right, i got
>
> scala> sqlContext.sql("select * from a join b").take(10)
> java.lang.RuntimeException: Table Not Found: b
> at scala.sys.package$.error(package.scala:27)
> at
>
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
> at
>
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
> at scala.Option.getOrElse(Option.scala:120)
> at
>
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:90)
>
> and with hql
>
> scala> hiveContext.hql("select * from a join b").take(10)
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
> 14/08/22 14:48:45 INFO parse.ParseDriver: Parsing command: select * from a
> join b
> 14/08/22 14:48:45 INFO parse.ParseDriver: Parse Completed
> 14/08/22 14:48:45 ERROR metadata.Hive:
> NoSuchObjectException(message:default.a table not found)
> at
>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27129)
> at
>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27097)
> at
>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:27028)
> at
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
> at
>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936)
> at
>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:922)
> at
>
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at
>
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
> at com.sun.proxy.$Proxy17.getTable(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924)
> at
>
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:59)
>
>
> so sqlContext is looking up table from
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog, Catalog.scala
> hiveContext looking up from org.apache.spark.sql.hive.HiveMetastoreCatalog,
> HiveMetastoreCatalog.scala
>
> maybe we can do something in sqlContext to register a hive table as
> Spark-SQL-Table, need to read column info, partition info, location, SerDe,
> Input/OutputFormat and maybe StorageHandler also, from the hive
> metastore...
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7955.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


about spark assembly jar

2014-09-02 Thread scwf

hi, all
  I suggest spark not use assembly jar as default run-time 
dependency(spark-submit/spark-class depend on assembly jar),use a library of 
all 3rd dependency jar like hadoop/hive/hbase more reasonable.

  1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this 
jar if we want to update the version of some component(such as hadoop)
  2 in our practice with spark, sometimes we meet jar compatibility issue, it 
is hard to diagnose compatibility issue with assembly jar







-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: about spark assembly jar

2014-09-02 Thread Sean Owen
Hm, are you suggesting that the Spark distribution be a bag of 100
JARs? It doesn't quite seem reasonable. It does not remove version
conflicts, just pushes them to run-time, which isn't good. The
assembly is also necessary because that's where shading happens. In
development, you want to run against exactly what will be used in a
real Spark distro.

On Tue, Sep 2, 2014 at 9:39 AM, scwf  wrote:
> hi, all
>   I suggest spark not use assembly jar as default run-time
> dependency(spark-submit/spark-class depend on assembly jar),use a library of
> all 3rd dependency jar like hadoop/hive/hbase more reasonable.
>
>   1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
> this jar if we want to update the version of some component(such as hadoop)
>   2 in our practice with spark, sometimes we meet jar compatibility issue,
> it is hard to diagnose compatibility issue with assembly jar
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: about spark assembly jar

2014-09-02 Thread scwf

yes, i am not sure what happens when building assembly jar and in my 
understanding it just package all the dependency jars to a big one.

On 2014/9/2 16:45, Sean Owen wrote:

Hm, are you suggesting that the Spark distribution be a bag of 100
JARs? It doesn't quite seem reasonable. It does not remove version
conflicts, just pushes them to run-time, which isn't good. The
assembly is also necessary because that's where shading happens. In
development, you want to run against exactly what will be used in a
real Spark distro.

On Tue, Sep 2, 2014 at 9:39 AM, scwf  wrote:

hi, all
   I suggest spark not use assembly jar as default run-time
dependency(spark-submit/spark-class depend on assembly jar),use a library of
all 3rd dependency jar like hadoop/hive/hbase more reasonable.

   1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
this jar if we want to update the version of some component(such as hadoop)
   2 in our practice with spark, sometimes we meet jar compatibility issue,
it is hard to diagnose compatibility issue with assembly jar







-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org








-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: about spark assembly jar

2014-09-02 Thread Ye Xianjin
Sorry, The quick reply didn't cc the dev list.

Sean, sometimes I have to use the spark-shell to confirm some behavior change. 
In that case, I have to reassembly the whole project.  is there another way 
around, not use the the big jar in development? For the original question, I 
have no comments. 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, September 2, 2014 at 4:58 PM, Sean Owen wrote:

> No, usually you unit-test your changes during development. That
> doesn't require the assembly. Eventually you may wish to test some
> change against the complete assembly.
> 
> But that's a different question; I thought you were suggesting that
> the assembly JAR should never be created.
> 
> On Tue, Sep 2, 2014 at 9:53 AM, Ye Xianjin  (mailto:advance...@gmail.com)> wrote:
> > Hi, Sean:
> > In development, do I really need to reassembly the whole project even if I
> > only change a line or two code in one component?
> > I used to that but found time-consuming.
> > 
> > --
> > Ye Xianjin
> > Sent with Sparrow
> > 
> > On Tuesday, September 2, 2014 at 4:45 PM, Sean Owen wrote:
> > 
> > Hm, are you suggesting that the Spark distribution be a bag of 100
> > JARs? It doesn't quite seem reasonable. It does not remove version
> > conflicts, just pushes them to run-time, which isn't good. The
> > assembly is also necessary because that's where shading happens. In
> > development, you want to run against exactly what will be used in a
> > real Spark distro.
> > 
> > On Tue, Sep 2, 2014 at 9:39 AM, scwf  > (mailto:wangf...@huawei.com)> wrote:
> > 
> > hi, all
> > I suggest spark not use assembly jar as default run-time
> > dependency(spark-submit/spark-class depend on assembly jar),use a library of
> > all 3rd dependency jar like hadoop/hive/hbase more reasonable.
> > 
> > 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
> > this jar if we want to update the version of some component(such as hadoop)
> > 2 in our practice with spark, sometimes we meet jar compatibility issue,
> > it is hard to diagnose compatibility issue with assembly jar
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> > 
> > 
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> > 
> 
> 
> 




Re: about spark assembly jar

2014-09-02 Thread scwf

Hi sean owen,
here are some problems when i used assembly jar
1 i put spark-assembly-*.jar to the lib directory of my application, it throw 
compile error

Error:scalac: Error: class scala.reflect.BeanInfo not found.
scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found.

at 
scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655)

at 
scala.tools.nsc.symtab.Definitions$definitions$.getClass(Definitions.scala:608)

at 
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.(GenJVM.scala:127)

at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.scala:85)

at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)

at scala.tools.nsc.Global$Run.compile(Global.scala:1041)

at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)

at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)

at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)

at xsbt.CompilerInterface.run(CompilerInterface.scala:27)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102)

at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48)

at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41)

at 
org.jetbrains.jps.incremental.scala.local.IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)

at 
org.jetbrains.jps.incremental.scala.local.LocalServer.compile(LocalServer.scala:25)

at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.scala:58)

at 
org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(Main.scala:21)

at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(Main.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
2 i test my branch which updated hive version to org.apache.hive 0.13.1
  it run successfully when use a bag of 3rd jars as dependency but throw error 
using assembly jar, it seems assembly jar lead to conflict
  ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:66)
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.(ArrayWritableObjectInspector.java:59)
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:283)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4194)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)




On 2014/9/2 16:45, Sean Owen wrote:

Hm, are you suggesting that the Spark distribution be a bag of 100
JARs? It doesn't quite seem reasonable. It does not remove version
conflicts, just pushes them to run-time, which isn't good. The
assembly is also necessary because that's where shading happens. In
development, you want to run against exactly what will be used in a
real Spark distro.

On Tue, Sep 2, 2014 at 9:39 AM, scwf  wrote:

hi, all
   I suggest spark not use assembly jar as default run-time
dependency(spark-submit/spark-class depend on assembly jar),use a library of
all 3rd dependency jar like hadoop/hive/hbase more reasonable.

   1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
this jar if we want to update the version of some component(such as hadoop)
   2 in our practice with spark, sometimes we meet jar compatibility issue,
it is hard to diagnose compatibility issue with assembly jar







-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org









Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Will Benton
Zongheng pointed out in my SPARK-3329 PR 
(https://github.com/apache/spark/pull/2220) that Aaron had already fixed this 
issue but that it had gotten inadvertently clobbered by another patch.  I don't 
know how the project handles this kind of problem, but I've rewritten my 
SPARK-3329 branch to cherry-pick Aaron's fix (also fixing a merge conflict and 
handling a test case that it didn't).

The other weird spurious testsuite failures related to orderings I've seen were 
in "DESCRIBE FUNCTION EXTENDED" for functions with lists of synonyms (e.g. 
STDDEV).  I can't reproduce those now but will take another look later this 
week.



best,
wb

- Original Message -
> From: "Sean Owen" 
> To: "Will Benton" 
> Cc: "Patrick Wendell" , dev@spark.apache.org
> Sent: Sunday, August 31, 2014 12:18:42 PM
> Subject: Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
> 
> Fantastic. As it happens, I just fixed up Mahout's tests for Java 8
> and observed a lot of the same type of failure.
> 
> I'm about to submit PRs for the two issues I identified. AFAICT these
> 3 then cover the failures I mentioned:
> 
> https://issues.apache.org/jira/browse/SPARK-3329
> https://issues.apache.org/jira/browse/SPARK-3330
> https://issues.apache.org/jira/browse/SPARK-3331
> 
> I'd argue that none necessarily block a release, since they just
> represent a problem with test-only code in Java 8, with the test-only
> context of Jenkins and multiple profiles, and with a trivial
> configuration in a style check for Python. Should be fixed but none
> indicate a bug in the release.
> 
> On Sun, Aug 31, 2014 at 6:11 PM, Will Benton  wrote:
> > - Original Message -
> >
> >> dev/run-tests fails two tests (1 Hive, 1 Kafka Streaming) for me
> >> locally on 1.1.0-rc3. Does anyone else see that? It may be my env.
> >> Although I still see the Hive failure on Debian too:
> >>
> >> [info] - SET commands semantics for a HiveContext *** FAILED ***
> >> [info]   Expected Array("spark.sql.key.usedfortestonly=test.val.0",
> >> "spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0"),
> >> but got
> >> Array("spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0",
> >> "spark.sql.key.usedfortestonly=test.val.0") (HiveQuerySuite.scala:541)
> >
> > I've seen this error before.  (In particular, I've seen it on my OS X
> > machine using Oracle JDK 8 but not on Fedora using OpenJDK.)  I've also
> > seen similar errors in topic branches (but not on master) that seem to
> > indicate that tests depend on sets of pairs arriving from Hive in a
> > particular order; it seems that this isn't a safe assumption.
> >
> > I just submitted a (trivial) PR to fix this spurious failure:
> > https://github.com/apache/spark/pull/2220
> >
> >
> > best,
> > wb
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: about spark assembly jar

2014-09-02 Thread Sandy Ryza
This doesn't help for every dependency, but Spark provides an option to
build the assembly jar without Hadoop and its dependencies.  We make use of
this in CDH packaging.

-Sandy


On Tue, Sep 2, 2014 at 2:12 AM, scwf  wrote:

> Hi sean owen,
> here are some problems when i used assembly jar
> 1 i put spark-assembly-*.jar to the lib directory of my application, it
> throw compile error
>
> Error:scalac: Error: class scala.reflect.BeanInfo not found.
> scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not
> found.
>
> at scala.tools.nsc.symtab.Definitions$definitions$.
> getModuleOrClass(Definitions.scala:655)
>
> at scala.tools.nsc.symtab.Definitions$definitions$.
> getClass(Definitions.scala:608)
>
> at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.<
> init>(GenJVM.scala:127)
>
> at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
> scala:85)
>
> at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
>
> at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
>
> at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
>
> at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
>
> at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
>
> at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:39)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
>
> at sbt.compiler.AnalyzingCompiler.call(
> AnalyzingCompiler.scala:102)
>
> at sbt.compiler.AnalyzingCompiler.compile(
> AnalyzingCompiler.scala:48)
>
> at sbt.compiler.AnalyzingCompiler.compile(
> AnalyzingCompiler.scala:41)
>
> at org.jetbrains.jps.incremental.scala.local.
> IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
>
> at org.jetbrains.jps.incremental.scala.local.LocalServer.
> compile(LocalServer.scala:25)
>
> at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
> scala:58)
>
> at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
> Main.scala:21)
>
> at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
> Main.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:39)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
>
> at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
> 2 i test my branch which updated hive version to org.apache.hive 0.13.1
>   it run successfully when use a bag of 3rd jars as dependency but throw
> error using assembly jar, it seems assembly jar lead to conflict
>   ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
> at org.apache.hadoop.hive.ql.io.parquet.serde.
> ArrayWritableObjectInspector.getObjectInspector(
> ArrayWritableObjectInspector.java:66)
> at org.apache.hadoop.hive.ql.io.parquet.serde.
> ArrayWritableObjectInspector.(ArrayWritableObjectInspector.java:59)
> at org.apache.hadoop.hive.ql.io.parquet.serde.
> ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
> at org.apache.hadoop.hive.metastore.MetaStoreUtils.
> getDeserializer(MetaStoreUtils.java:339)
> at org.apache.hadoop.hive.ql.metadata.Table.
> getDeserializerFromMetaStore(Table.java:283)
> at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
> Table.java:189)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
> Hive.java:597)
> at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
> DDLTask.java:4194)
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
> java:281)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
> TaskRunner.java:85)
>
>
>
>
>
> On 2014/9/2 16:45, Sean Owen wrote:
>
>> Hm, are you suggesting that the Spark distribution be a bag of 100
>> JARs? It doesn't quite seem reasonable. It does not remove version
>> conflicts, just pushes them to run-time, which isn't good. The
>> assembly is also necessary because that's where shading happens. In
>> development, you want to run against exactly what will be used in a
>> real Spark distro.
>>
>> On Tue, Sep 2, 2014 at 9:39 AM, scwf  wrote:
>>
>>> hi, all
>>>I suggest spark not use assembly jar as default run-time
>>> dependency(spark-submit/spark-class depend on assembly jar),use a
>>> library of
>>> all 3rd dependency jar like hadoop/hive/hbase more reasonable.
>>>
>>>1 assembly jar packaged all 3rd jars into a big one, 

hive client.getAllPartitions in lookupRelation can take a very long time

2014-09-02 Thread chutium
in our hive warehouse there are many tables with a lot of partitions, such as
scala> hiveContext.sql("use db_external")
scala> val result = hiveContext.sql("show partitions et_fullorders").count
result: Long = 5879

i noticed that this part of code:
https://github.com/apache/spark/blob/9d006c97371ddf357e0b821d5c6d1535d9b6fe41/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L55-L56

reads the whole partitions info at the beginning of plan phase, i added a
logInfo around this val partitions = ...

it shows:

scala> val result = hiveContext.sql("select * from db_external.et_fullorders
limit 5")
14/09/02 16:15:56 INFO ParseDriver: Parsing command: select * from
db_external.et_fullorders limit 5
14/09/02 16:15:56 INFO ParseDriver: Parse Completed
14/09/02 16:15:56 INFO HiveContext$$anon$1: getAllPartitionsForPruner
started
14/09/02 16:17:35 INFO HiveContext$$anon$1: getAllPartitionsForPruner
finished

it took about 2min to get all partitions...

is there any possible way to avoid this operation? such as only fetch the
requested partition somehow?

Thanks



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/hive-client-getAllPartitions-in-lookupRelation-can-take-a-very-long-time-tp8186.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread shane knapp
so, i had a meeting w/the databricks guys on friday and they recommended i
send an email out to the list to say 'hi' and give you guys a quick intro.
 :)

hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
time getting the jenkins build infrastructure up to production quality.
 much of this will be 'under the covers' work, like better system level
auth, backups, etc, but some will definitely be user facing:  timely
jenkins updates, debugging broken build infrastructure and some plugin
support.

i've been working in the bay area now since 1997 at many different
companies, and my last 10 years has been split between google and palantir.
 i'm a huge proponent of OSS, and am really happy to be able to help with
the work you guys are doing!

if anyone has any requests/questions/comments, feel free to drop me a line!

shane


Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Reynold Xin
Welcome, Shane!

On Tuesday, September 2, 2014, shane knapp  wrote:

> so, i had a meeting w/the databricks guys on friday and they recommended i
> send an email out to the list to say 'hi' and give you guys a quick intro.
>  :)
>
> hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
> time getting the jenkins build infrastructure up to production quality.
>  much of this will be 'under the covers' work, like better system level
> auth, backups, etc, but some will definitely be user facing:  timely
> jenkins updates, debugging broken build infrastructure and some plugin
> support.
>
> i've been working in the bay area now since 1997 at many different
> companies, and my last 10 years has been split between google and palantir.
>  i'm a huge proponent of OSS, and am really happy to be able to help with
> the work you guys are doing!
>
> if anyone has any requests/questions/comments, feel free to drop me a line!
>
> shane
>


Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Nicholas Chammas
Hi Shane!

Thank you for doing the Jenkins upgrade last week. It's nice to know that
infrastructure is gonna get some dedicated TLC going forward.

Welcome aboard!

Nick


On Tue, Sep 2, 2014 at 1:35 PM, shane knapp  wrote:

> so, i had a meeting w/the databricks guys on friday and they recommended i
> send an email out to the list to say 'hi' and give you guys a quick intro.
>  :)
>
> hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
> time getting the jenkins build infrastructure up to production quality.
>  much of this will be 'under the covers' work, like better system level
> auth, backups, etc, but some will definitely be user facing:  timely
> jenkins updates, debugging broken build infrastructure and some plugin
> support.
>
> i've been working in the bay area now since 1997 at many different
> companies, and my last 10 years has been split between google and palantir.
>  i'm a huge proponent of OSS, and am really happy to be able to help with
> the work you guys are doing!
>
> if anyone has any requests/questions/comments, feel free to drop me a line!
>
> shane
>


Re: about spark assembly jar

2014-09-02 Thread Reynold Xin
Having a SSD help tremendously with assembly time.

Without that, you can do the following in order for Spark to pick up the
compiled classes before assembly at runtime.

export SPARK_PREPEND_CLASSES=true


On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza  wrote:

> This doesn't help for every dependency, but Spark provides an option to
> build the assembly jar without Hadoop and its dependencies.  We make use of
> this in CDH packaging.
>
> -Sandy
>
>
> On Tue, Sep 2, 2014 at 2:12 AM, scwf  wrote:
>
> > Hi sean owen,
> > here are some problems when i used assembly jar
> > 1 i put spark-assembly-*.jar to the lib directory of my application, it
> > throw compile error
> >
> > Error:scalac: Error: class scala.reflect.BeanInfo not found.
> > scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not
> > found.
> >
> > at scala.tools.nsc.symtab.Definitions$definitions$.
> > getModuleOrClass(Definitions.scala:655)
> >
> > at scala.tools.nsc.symtab.Definitions$definitions$.
> > getClass(Definitions.scala:608)
> >
> > at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.<
> > init>(GenJVM.scala:127)
> >
> > at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
> > scala:85)
> >
> > at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
> >
> > at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
> >
> > at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
> >
> > at
> xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
> >
> > at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
> >
> > at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
> >
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >
> > at sun.reflect.NativeMethodAccessorImpl.invoke(
> > NativeMethodAccessorImpl.java:39)
> >
> > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > DelegatingMethodAccessorImpl.java:25)
> >
> > at java.lang.reflect.Method.invoke(Method.java:597)
> >
> > at sbt.compiler.AnalyzingCompiler.call(
> > AnalyzingCompiler.scala:102)
> >
> > at sbt.compiler.AnalyzingCompiler.compile(
> > AnalyzingCompiler.scala:48)
> >
> > at sbt.compiler.AnalyzingCompiler.compile(
> > AnalyzingCompiler.scala:41)
> >
> > at org.jetbrains.jps.incremental.scala.local.
> > IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
> >
> > at org.jetbrains.jps.incremental.scala.local.LocalServer.
> > compile(LocalServer.scala:25)
> >
> > at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
> > scala:58)
> >
> > at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
> > Main.scala:21)
> >
> > at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
> > Main.scala)
> >
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >
> > at sun.reflect.NativeMethodAccessorImpl.invoke(
> > NativeMethodAccessorImpl.java:39)
> >
> > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > DelegatingMethodAccessorImpl.java:25)
> >
> > at java.lang.reflect.Method.invoke(Method.java:597)
> >
> > at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
> > 2 i test my branch which updated hive version to org.apache.hive 0.13.1
> >   it run successfully when use a bag of 3rd jars as dependency but throw
> > error using assembly jar, it seems assembly jar lead to conflict
> >   ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
> > at org.apache.hadoop.hive.ql.io.parquet.serde.
> > ArrayWritableObjectInspector.getObjectInspector(
> > ArrayWritableObjectInspector.java:66)
> > at org.apache.hadoop.hive.ql.io.parquet.serde.
> > ArrayWritableObjectInspector.(ArrayWritableObjectInspector.java:59)
> > at org.apache.hadoop.hive.ql.io.parquet.serde.
> > ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
> > at org.apache.hadoop.hive.metastore.MetaStoreUtils.
> > getDeserializer(MetaStoreUtils.java:339)
> > at org.apache.hadoop.hive.ql.metadata.Table.
> > getDeserializerFromMetaStore(Table.java:283)
> > at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
> > Table.java:189)
> > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
> > Hive.java:597)
> > at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
> > DDLTask.java:4194)
> > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
> > java:281)
> > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
> > at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
> > TaskRunner.java:85)
> >
> >
> >
> >
> >
> > On 2014/9/2 16:45, Sean Owen wrote:
> >
> >> Hm, are you suggesting that the Spark distribution be a bag of 100
> >> JARs? It doesn't quite seem reasonable. It does not remove version
> >> conflicts, just pushes them to run-time, which isn't good. The
> >> as

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Patrick Wendell
Hey Shane,

Thanks for your work so far and I'm really happy to see investment in
this infrastructure. This is a key productivity tool for us and
something we'd love to expand over time to improve the development
process of Spark.

- Patrick

On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas
 wrote:
> Hi Shane!
>
> Thank you for doing the Jenkins upgrade last week. It's nice to know that
> infrastructure is gonna get some dedicated TLC going forward.
>
> Welcome aboard!
>
> Nick
>
>
> On Tue, Sep 2, 2014 at 1:35 PM, shane knapp  wrote:
>
>> so, i had a meeting w/the databricks guys on friday and they recommended i
>> send an email out to the list to say 'hi' and give you guys a quick intro.
>>  :)
>>
>> hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
>> time getting the jenkins build infrastructure up to production quality.
>>  much of this will be 'under the covers' work, like better system level
>> auth, backups, etc, but some will definitely be user facing:  timely
>> jenkins updates, debugging broken build infrastructure and some plugin
>> support.
>>
>> i've been working in the bay area now since 1997 at many different
>> companies, and my last 10 years has been split between google and palantir.
>>  i'm a huge proponent of OSS, and am really happy to be able to help with
>> the work you guys are doing!
>>
>> if anyone has any requests/questions/comments, feel free to drop me a line!
>>
>> shane
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Christopher Nguyen
Welcome, Shane. As a former prof and eng dir at Google, I've been expecting
this to be a first-class engineering college subject. I just didn't expect
it to come through this route :-)

So congrats, and I hope you represent the beginning of a great new trend at
universities.

Sent while mobile. Please excuse typos etc.
On Sep 2, 2014 11:00 AM, "Patrick Wendell"  wrote:

> Hey Shane,
>
> Thanks for your work so far and I'm really happy to see investment in
> this infrastructure. This is a key productivity tool for us and
> something we'd love to expand over time to improve the development
> process of Spark.
>
> - Patrick
>
> On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas
>  wrote:
> > Hi Shane!
> >
> > Thank you for doing the Jenkins upgrade last week. It's nice to know that
> > infrastructure is gonna get some dedicated TLC going forward.
> >
> > Welcome aboard!
> >
> > Nick
> >
> >
> > On Tue, Sep 2, 2014 at 1:35 PM, shane knapp  wrote:
> >
> >> so, i had a meeting w/the databricks guys on friday and they
> recommended i
> >> send an email out to the list to say 'hi' and give you guys a quick
> intro.
> >>  :)
> >>
> >> hi!  i'm shane knapp, the new AMPLab devops engineer, and will be
> spending
> >> time getting the jenkins build infrastructure up to production quality.
> >>  much of this will be 'under the covers' work, like better system level
> >> auth, backups, etc, but some will definitely be user facing:  timely
> >> jenkins updates, debugging broken build infrastructure and some plugin
> >> support.
> >>
> >> i've been working in the bay area now since 1997 at many different
> >> companies, and my last 10 years has been split between google and
> palantir.
> >>  i'm a huge proponent of OSS, and am really happy to be able to help
> with
> >> the work you guys are doing!
> >>
> >> if anyone has any requests/questions/comments, feel free to drop me a
> line!
> >>
> >> shane
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Resource allocation

2014-09-02 Thread rapelly kartheek
Hi,

I want to incorporate some intelligence while choosing the resources for
rdd replication. I thought, if we replicate rdd on specially chosen nodes
based on the capabilities, the next application that requires this rdd can
be executed more efficiently. But, I found that an rdd creatd by an
appplication is owned by only that application and nobody else can access
it.

Can someone tell me what kind of operations can be done on a replicated
rdd. Or to put it other way, what are the benefits of a replicated rdd or
what operations can be performed on a replicated rdd.  I just want to know
how effective is my work going to be.

I'll be happy if some other ideas in the similar line of thought are
suggested.

Thank you!!
Karthik


Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)

Maybe we should add a "developer notes" page to document all these useful
black magic.


On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin  wrote:

> Having a SSD help tremendously with assembly time.
>
> Without that, you can do the following in order for Spark to pick up the
> compiled classes before assembly at runtime.
>
> export SPARK_PREPEND_CLASSES=true
>
>
> On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza 
> wrote:
>
> > This doesn't help for every dependency, but Spark provides an option to
> > build the assembly jar without Hadoop and its dependencies.  We make use
> of
> > this in CDH packaging.
> >
> > -Sandy
> >
> >
> > On Tue, Sep 2, 2014 at 2:12 AM, scwf  wrote:
> >
> > > Hi sean owen,
> > > here are some problems when i used assembly jar
> > > 1 i put spark-assembly-*.jar to the lib directory of my application, it
> > > throw compile error
> > >
> > > Error:scalac: Error: class scala.reflect.BeanInfo not found.
> > > scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo
> not
> > > found.
> > >
> > > at scala.tools.nsc.symtab.Definitions$definitions$.
> > > getModuleOrClass(Definitions.scala:655)
> > >
> > > at scala.tools.nsc.symtab.Definitions$definitions$.
> > > getClass(Definitions.scala:608)
> > >
> > > at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.<
> > > init>(GenJVM.scala:127)
> > >
> > > at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
> > > scala:85)
> > >
> > > at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
> > >
> > > at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
> > >
> > > at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
> > >
> > > at
> > xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
> > >
> > > at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
> > >
> > > at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke(
> > > NativeMethodAccessorImpl.java:39)
> > >
> > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > DelegatingMethodAccessorImpl.java:25)
> > >
> > > at java.lang.reflect.Method.invoke(Method.java:597)
> > >
> > > at sbt.compiler.AnalyzingCompiler.call(
> > > AnalyzingCompiler.scala:102)
> > >
> > > at sbt.compiler.AnalyzingCompiler.compile(
> > > AnalyzingCompiler.scala:48)
> > >
> > > at sbt.compiler.AnalyzingCompiler.compile(
> > > AnalyzingCompiler.scala:41)
> > >
> > > at org.jetbrains.jps.incremental.scala.local.
> > > IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
> > >
> > > at org.jetbrains.jps.incremental.scala.local.LocalServer.
> > > compile(LocalServer.scala:25)
> > >
> > > at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
> > > scala:58)
> > >
> > > at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
> > > Main.scala:21)
> > >
> > > at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
> > > Main.scala)
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke(
> > > NativeMethodAccessorImpl.java:39)
> > >
> > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > DelegatingMethodAccessorImpl.java:25)
> > >
> > > at java.lang.reflect.Method.invoke(Method.java:597)
> > >
> > > at
> com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
> > > 2 i test my branch which updated hive version to org.apache.hive 0.13.1
> > >   it run successfully when use a bag of 3rd jars as dependency but
> throw
> > > error using assembly jar, it seems assembly jar lead to conflict
> > >   ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
> > > at org.apache.hadoop.hive.ql.io.parquet.serde.
> > > ArrayWritableObjectInspector.getObjectInspector(
> > > ArrayWritableObjectInspector.java:66)
> > > at org.apache.hadoop.hive.ql.io.parquet.serde.
> > >
> ArrayWritableObjectInspector.(ArrayWritableObjectInspector.java:59)
> > > at org.apache.hadoop.hive.ql.io.parquet.serde.
> > > ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
> > > at org.apache.hadoop.hive.metastore.MetaStoreUtils.
> > > getDeserializer(MetaStoreUtils.java:339)
> > > at org.apache.hadoop.hive.ql.metadata.Table.
> > > getDeserializerFromMetaStore(Table.java:283)
> > > at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
> > > Table.java:189)
> > > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
> > > Hive.java:597)
> > > at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
> > > DDLTask.java:4194)
> > > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
> > > java:281)
> > 

Re: about spark assembly jar

2014-09-02 Thread Josh Rosen
SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably be 
easier to find): 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools


On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com) wrote:

Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)  

Maybe we should add a "developer notes" page to document all these useful  
black magic.  


On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin  wrote:  

> Having a SSD help tremendously with assembly time.  
>  
> Without that, you can do the following in order for Spark to pick up the  
> compiled classes before assembly at runtime.  
>  
> export SPARK_PREPEND_CLASSES=true  
>  
>  
> On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza   
> wrote:  
>  
> > This doesn't help for every dependency, but Spark provides an option to  
> > build the assembly jar without Hadoop and its dependencies. We make use  
> of  
> > this in CDH packaging.  
> >  
> > -Sandy  
> >  
> >  
> > On Tue, Sep 2, 2014 at 2:12 AM, scwf  wrote:  
> >  
> > > Hi sean owen,  
> > > here are some problems when i used assembly jar  
> > > 1 i put spark-assembly-*.jar to the lib directory of my application, it  
> > > throw compile error  
> > >  
> > > Error:scalac: Error: class scala.reflect.BeanInfo not found.  
> > > scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo  
> not  
> > > found.  
> > >  
> > > at scala.tools.nsc.symtab.Definitions$definitions$.  
> > > getModuleOrClass(Definitions.scala:655)  
> > >  
> > > at scala.tools.nsc.symtab.Definitions$definitions$.  
> > > getClass(Definitions.scala:608)  
> > >  
> > > at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.<  
> > > init>(GenJVM.scala:127)  
> > >  
> > > at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.  
> > > scala:85)  
> > >  
> > > at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)  
> > >  
> > > at scala.tools.nsc.Global$Run.compile(Global.scala:1041)  
> > >  
> > > at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)  
> > >  
> > > at  
> > xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)  
> > >  
> > > at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)  
> > >  
> > > at xsbt.CompilerInterface.run(CompilerInterface.scala:27)  
> > >  
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
> > >  
> > > at sun.reflect.NativeMethodAccessorImpl.invoke(  
> > > NativeMethodAccessorImpl.java:39)  
> > >  
> > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(  
> > > DelegatingMethodAccessorImpl.java:25)  
> > >  
> > > at java.lang.reflect.Method.invoke(Method.java:597)  
> > >  
> > > at sbt.compiler.AnalyzingCompiler.call(  
> > > AnalyzingCompiler.scala:102)  
> > >  
> > > at sbt.compiler.AnalyzingCompiler.compile(  
> > > AnalyzingCompiler.scala:48)  
> > >  
> > > at sbt.compiler.AnalyzingCompiler.compile(  
> > > AnalyzingCompiler.scala:41)  
> > >  
> > > at org.jetbrains.jps.incremental.scala.local.  
> > > IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)  
> > >  
> > > at org.jetbrains.jps.incremental.scala.local.LocalServer.  
> > > compile(LocalServer.scala:25)  
> > >  
> > > at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.  
> > > scala:58)  
> > >  
> > > at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(  
> > > Main.scala:21)  
> > >  
> > > at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(  
> > > Main.scala)  
> > >  
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
> > >  
> > > at sun.reflect.NativeMethodAccessorImpl.invoke(  
> > > NativeMethodAccessorImpl.java:39)  
> > >  
> > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(  
> > > DelegatingMethodAccessorImpl.java:25)  
> > >  
> > > at java.lang.reflect.Method.invoke(Method.java:597)  
> > >  
> > > at  
> com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)  
> > > 2 i test my branch which updated hive version to org.apache.hive 0.13.1  
> > > it run successfully when use a bag of 3rd jars as dependency but  
> throw  
> > > error using assembly jar, it seems assembly jar lead to conflict  
> > > ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo  
> > > at org.apache.hadoop.hive.ql.io.parquet.serde.  
> > > ArrayWritableObjectInspector.getObjectInspector(  
> > > ArrayWritableObjectInspector.java:66)  
> > > at org.apache.hadoop.hive.ql.io.parquet.serde.  
> > >  
> ArrayWritableObjectInspector.(ArrayWritableObjectInspector.java:59)  
> > > at org.apache.hadoop.hive.ql.io.parquet.serde.  
> > > ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)  
> > > at org.apache.hadoop.hive.metastore.MetaStoreUtils.  
> > > getDeserializer(MetaStoreUtils.java:339)  
> > > at org.apache.hadoop.hive.ql.metadata.Table.  
> > > getDeserializerFromMetaStore(Table.java:283)  
> > > at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(  
> > > Table.java:189)  
> > > at org.apache.hadoop.hive.ql.metadata.

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Cool, didn't notice that, thanks Josh!


On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen  wrote:

> SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could
> probably be easier to find):
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
>
>
> On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com)
> wrote:
>
> Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)
>
> Maybe we should add a "developer notes" page to document all these useful
> black magic.
>
>
> On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin  wrote:
>
> > Having a SSD help tremendously with assembly time.
> >
> > Without that, you can do the following in order for Spark to pick up the
> > compiled classes before assembly at runtime.
> >
> > export SPARK_PREPEND_CLASSES=true
> >
> >
> > On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza 
> > wrote:
> >
> > > This doesn't help for every dependency, but Spark provides an option
> to
> > > build the assembly jar without Hadoop and its dependencies. We make
> use
> > of
> > > this in CDH packaging.
> > >
> > > -Sandy
> > >
> > >
> > > On Tue, Sep 2, 2014 at 2:12 AM, scwf  wrote:
> > >
> > > > Hi sean owen,
> > > > here are some problems when i used assembly jar
> > > > 1 i put spark-assembly-*.jar to the lib directory of my application,
> it
> > > > throw compile error
> > > >
> > > > Error:scalac: Error: class scala.reflect.BeanInfo not found.
> > > > scala.tools.nsc.MissingRequirementError: class
> scala.reflect.BeanInfo
> > not
> > > > found.
> > > >
> > > > at scala.tools.nsc.symtab.Definitions$definitions$.
> > > > getModuleOrClass(Definitions.scala:655)
> > > >
> > > > at scala.tools.nsc.symtab.Definitions$definitions$.
> > > > getClass(Definitions.scala:608)
> > > >
> > > > at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.<
> > > > init>(GenJVM.scala:127)
> > > >
> > > > at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
> > > > scala:85)
> > > >
> > > > at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
> > > >
> > > > at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
> > > >
> > > > at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
> > > >
> > > > at
> > > xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
> > > >
> > > > at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
> > > >
> > > > at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
> > > >
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke(
> > > > NativeMethodAccessorImpl.java:39)
> > > >
> > > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > > DelegatingMethodAccessorImpl.java:25)
> > > >
> > > > at java.lang.reflect.Method.invoke(Method.java:597)
> > > >
> > > > at sbt.compiler.AnalyzingCompiler.call(
> > > > AnalyzingCompiler.scala:102)
> > > >
> > > > at sbt.compiler.AnalyzingCompiler.compile(
> > > > AnalyzingCompiler.scala:48)
> > > >
> > > > at sbt.compiler.AnalyzingCompiler.compile(
> > > > AnalyzingCompiler.scala:41)
> > > >
> > > > at org.jetbrains.jps.incremental.scala.local.
> > > > IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
> > > >
> > > > at org.jetbrains.jps.incremental.scala.local.LocalServer.
> > > > compile(LocalServer.scala:25)
> > > >
> > > > at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
> > > > scala:58)
> > > >
> > > > at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
> > > > Main.scala:21)
> > > >
> > > > at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
> > > > Main.scala)
> > > >
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke(
> > > > NativeMethodAccessorImpl.java:39)
> > > >
> > > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > > DelegatingMethodAccessorImpl.java:25)
> > > >
> > > > at java.lang.reflect.Method.invoke(Method.java:597)
> > > >
> > > > at
> > com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
> > > > 2 i test my branch which updated hive version to org.apache.hive
> 0.13.1
> > > > it run successfully when use a bag of 3rd jars as dependency but
> > throw
> > > > error using assembly jar, it seems assembly jar lead to conflict
> > > > ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
> > > > at org.apache.hadoop.hive.ql.io.parquet.serde.
> > > > ArrayWritableObjectInspector.getObjectInspector(
> > > > ArrayWritableObjectInspector.java:66)
> > > > at org.apache.hadoop.hive.ql.io.parquet.serde.
> > > >
> >
> ArrayWritableObjectInspector.(ArrayWritableObjectInspector.java:59)
> > > > at org.apache.hadoop.hive.ql.io.parquet.serde.
> > > > ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
> > > > at org.apache.hadoop.hive.metastore.MetaStoreUtils.
> > > > getDeserializer(MetaStoreUtils.java:339)
> > > > at org.apache.hadoop.hive.ql.metadata.Table.
> > > > getDeserializerFromMetaStore(Table.java:283)
> > > > at org

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Henry Saputra
Welcome Shane =)


- Henry

On Tue, Sep 2, 2014 at 10:35 AM, shane knapp  wrote:
> so, i had a meeting w/the databricks guys on friday and they recommended i
> send an email out to the list to say 'hi' and give you guys a quick intro.
>  :)
>
> hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
> time getting the jenkins build infrastructure up to production quality.
>  much of this will be 'under the covers' work, like better system level
> auth, backups, etc, but some will definitely be user facing:  timely
> jenkins updates, debugging broken build infrastructure and some plugin
> support.
>
> i've been working in the bay area now since 1997 at many different
> companies, and my last 10 years has been split between google and palantir.
>  i'm a huge proponent of OSS, and am really happy to be able to help with
> the work you guys are doing!
>
> if anyone has any requests/questions/comments, feel free to drop me a line!
>
> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Cheng Lian
Welcome Shane! Glad to see that finally a hero jumping out to tame Jenkins
:)


On Tue, Sep 2, 2014 at 12:44 PM, Henry Saputra 
wrote:

> Welcome Shane =)
>
>
> - Henry
>
> On Tue, Sep 2, 2014 at 10:35 AM, shane knapp  wrote:
> > so, i had a meeting w/the databricks guys on friday and they recommended
> i
> > send an email out to the list to say 'hi' and give you guys a quick
> intro.
> >  :)
> >
> > hi!  i'm shane knapp, the new AMPLab devops engineer, and will be
> spending
> > time getting the jenkins build infrastructure up to production quality.
> >  much of this will be 'under the covers' work, like better system level
> > auth, backups, etc, but some will definitely be user facing:  timely
> > jenkins updates, debugging broken build infrastructure and some plugin
> > support.
> >
> > i've been working in the bay area now since 1997 at many different
> > companies, and my last 10 years has been split between google and
> palantir.
> >  i'm a huge proponent of OSS, and am really happy to be able to help with
> > the work you guys are doing!
> >
> > if anyone has any requests/questions/comments, feel free to drop me a
> line!
> >
> > shane
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Will Benton
+1

Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8).


best,
wb


- Original Message -
> From: "Patrick Wendell" 
> To: dev@spark.apache.org
> Sent: Saturday, August 30, 2014 5:07:52 PM
> Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
> 
> Please vote on releasing the following candidate as Apache Spark version
> 1.1.0!
> 
> The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.1.0-rc3/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1030/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
> 
> Please vote on releasing this package as Apache Spark 1.1.0!
> 
> The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 1.1.0
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see
> http://spark.apache.org/
> 
> == Regressions fixed since RC1 ==
> - Build issue for SQL support:
> https://issues.apache.org/jira/browse/SPARK-3234
> - EC2 script version bump to 1.1.0.
> 
> == What justifies a -1 vote for this release? ==
> This vote is happening very late into the QA period compared with
> previous votes, so -1 votes should only occur for significant
> regressions from 1.0.2. Bugs already present in 1.0.X will not block
> this release.
> 
> == What default changes should I be aware of? ==
> 1. The default value of "spark.io.compression.codec" is now "snappy"
> --> Old behavior can be restored by switching to "lzf"
> 
> 2. PySpark now performs external spilling during aggregations.
> --> Old behavior can be restored by setting "spark.shuffle.spill" to "false".
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Cheng Lian
+1

   - Tested Thrift server and SQL CLI locally on OSX 10.9.
   - Checked datanucleus dependencies in distribution tarball built by
   make-distribution.sh without SPARK_HIVE defined.

​


On Tue, Sep 2, 2014 at 2:30 PM, Will Benton  wrote:

> +1
>
> Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK
> 8).
>
>
> best,
> wb
>
>
> - Original Message -
> > From: "Patrick Wendell" 
> > To: dev@spark.apache.org
> > Sent: Saturday, August 30, 2014 5:07:52 PM
> > Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
> >
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.1.0!
> >
> > The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.1.0-rc3/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1030/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.1.0!
> >
> > The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
> > a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.1.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > == Regressions fixed since RC1 ==
> > - Build issue for SQL support:
> > https://issues.apache.org/jira/browse/SPARK-3234
> > - EC2 script version bump to 1.1.0.
> >
> > == What justifies a -1 vote for this release? ==
> > This vote is happening very late into the QA period compared with
> > previous votes, so -1 votes should only occur for significant
> > regressions from 1.0.2. Bugs already present in 1.0.X will not block
> > this release.
> >
> > == What default changes should I be aware of? ==
> > 1. The default value of "spark.io.compression.codec" is now "snappy"
> > --> Old behavior can be restored by switching to "lzf"
> >
> > 2. PySpark now performs external spilling during aggregations.
> > --> Old behavior can be restored by setting "spark.shuffle.spill" to
> "false".
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Checkpointing Pregel

2014-09-02 Thread Jeffrey Picard
Hey guys,

I’m trying to run connected components on graphs that end up running for a 
fairly large number of iterations (25-30) and take 5-6 hours. I find more than 
half the time I end up getting fetch failures and losing an executor after a 
number of iterations. Then it has to go back and recompute pieces that it lost, 
which don’t seem to be getting persisted at the same level as the graph so 
those iterations take exponentially longer and I have to kill the job because 
it’s not worth waiting for it to finish.

The approach I’m currently trying is checkpointing the vertices and edges (and 
maybe the messages?) in Pregel. What I’ve been testing with so far is the below 
patch, which seems to be working (actually I haven’t had any failures since I 
added this change, so I don’t know if I did get one if it would recompute from 
the start or not) but I’m also seeing things like 5 instances of VertexRDDs 
being persisted all at the same time and “reduce at VertexRDD.scala:111” runs 
twice each time. I was wondering if this is the proper / most efficient way of 
doing this checkpointing, and if not what would work better?

diff --git a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala 
b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
index 5e55620..5be40c3 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
@@ -134,6 +134,11 @@ object Pregel extends Logging {
   g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => 
newOpt.getOrElse(old) }
   g.cache()

+  g.vertices.checkpoint()
+  g.vertices.count()
+  g.edges.checkpoint()
+  g.edges.count()
+
   val oldMessages = messages
   // Send new messages. Vertices that didn't get any messages don't appear 
in newVerts, so don't
   // get to send messages. We must cache messages so it can be 
materialized on the next line,
@@ -142,6 +147,7 @@ object Pregel extends Logging {
   // The call to count() materializes `messages`, `newVerts`, and the 
vertices of `g`. This
   // hides oldMessages (depended on by newVerts), newVerts (depended on by 
messages), and the
   // vertices of prevG (depended on by newVerts, oldMessages, and the 
vertices of g).
+ messages.checkpoint()
   activeMessages = messages.count()

   logInfo("Pregel finished iteration " + i)

Best Regards,

Jeffrey Picard


signature.asc
Description: Message signed with OpenPGP using GPGMail


Ask something about spark

2014-09-02 Thread Sanghoon Lee
Hi, I am phoenixlee and a Spark programmer in Korea.

And be a good chance this time, it tries to teach college students and
office workers to Spark.
This course will be done with the support of the government. Can I use the
data(pictures, samples, etc.) in the spark homepage for this course? Of
course, I will put the comments in thanks and webpage URL. It would be a
good opportunity, even though the findings were that there is no teaching
materials "Spark" and education (or community) still in Korea.

Thanks.
ᐧ


Re: Ask something about spark

2014-09-02 Thread Reynold Xin
I think in general that is fine. It would be great if your slides come with
proper attribution.


On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee  wrote:

> Hi, I am phoenixlee and a Spark programmer in Korea.
>
> And be a good chance this time, it tries to teach college students and
> office workers to Spark.
> This course will be done with the support of the government. Can I use the
> data(pictures, samples, etc.) in the spark homepage for this course? Of
> course, I will put the comments in thanks and webpage URL. It would be a
> good opportunity, even though the findings were that there is no teaching
> materials "Spark" and education (or community) still in Korea.
>
> Thanks.
> ᐧ
>


Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Reynold Xin
+1


On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian  wrote:

> +1
>
>- Tested Thrift server and SQL CLI locally on OSX 10.9.
>- Checked datanucleus dependencies in distribution tarball built by
>make-distribution.sh without SPARK_HIVE defined.
>
> ​
>
>
> On Tue, Sep 2, 2014 at 2:30 PM, Will Benton  wrote:
>
> > +1
> >
> > Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle
> JDK
> > 8).
> >
> >
> > best,
> > wb
> >
> >
> > - Original Message -
> > > From: "Patrick Wendell" 
> > > To: dev@spark.apache.org
> > > Sent: Saturday, August 30, 2014 5:07:52 PM
> > > Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
> > >
> > > Please vote on releasing the following candidate as Apache Spark
> version
> > > 1.1.0!
> > >
> > > The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > http://people.apache.org/~pwendell/spark-1.1.0-rc3/
> > >
> > > Release artifacts are signed with the following key:
> > > https://people.apache.org/keys/committer/pwendell.asc
> > >
> > > The staging repository for this release can be found at:
> > >
> https://repository.apache.org/content/repositories/orgapachespark-1030/
> > >
> > > The documentation corresponding to this release can be found at:
> > > http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
> > >
> > > Please vote on releasing this package as Apache Spark 1.1.0!
> > >
> > > The vote is open until Tuesday, September 02, at 23:07 UTC and passes
> if
> > > a majority of at least 3 +1 PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Spark 1.1.0
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see
> > > http://spark.apache.org/
> > >
> > > == Regressions fixed since RC1 ==
> > > - Build issue for SQL support:
> > > https://issues.apache.org/jira/browse/SPARK-3234
> > > - EC2 script version bump to 1.1.0.
> > >
> > > == What justifies a -1 vote for this release? ==
> > > This vote is happening very late into the QA period compared with
> > > previous votes, so -1 votes should only occur for significant
> > > regressions from 1.0.2. Bugs already present in 1.0.X will not block
> > > this release.
> > >
> > > == What default changes should I be aware of? ==
> > > 1. The default value of "spark.io.compression.codec" is now "snappy"
> > > --> Old behavior can be restored by switching to "lzf"
> > >
> > > 2. PySpark now performs external spilling during aggregations.
> > > --> Old behavior can be restored by setting "spark.shuffle.spill" to
> > "false".
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Kan Zhang
+1

Verified PySpark InputFormat/OutputFormat examples.


On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin  wrote:

> +1
>
>
> On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian  wrote:
>
> > +1
> >
> >- Tested Thrift server and SQL CLI locally on OSX 10.9.
> >- Checked datanucleus dependencies in distribution tarball built by
> >make-distribution.sh without SPARK_HIVE defined.
> >
> > ​
> >
> >
> > On Tue, Sep 2, 2014 at 2:30 PM, Will Benton  wrote:
> >
> > > +1
> > >
> > > Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle
> > JDK
> > > 8).
> > >
> > >
> > > best,
> > > wb
> > >
> > >
> > > - Original Message -
> > > > From: "Patrick Wendell" 
> > > > To: dev@spark.apache.org
> > > > Sent: Saturday, August 30, 2014 5:07:52 PM
> > > > Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
> > > >
> > > > Please vote on releasing the following candidate as Apache Spark
> > version
> > > > 1.1.0!
> > > >
> > > > The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
> > > >
> > > > The release files, including signatures, digests, etc. can be found
> at:
> > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3/
> > > >
> > > > Release artifacts are signed with the following key:
> > > > https://people.apache.org/keys/committer/pwendell.asc
> > > >
> > > > The staging repository for this release can be found at:
> > > >
> > https://repository.apache.org/content/repositories/orgapachespark-1030/
> > > >
> > > > The documentation corresponding to this release can be found at:
> > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
> > > >
> > > > Please vote on releasing this package as Apache Spark 1.1.0!
> > > >
> > > > The vote is open until Tuesday, September 02, at 23:07 UTC and passes
> > if
> > > > a majority of at least 3 +1 PMC votes are cast.
> > > >
> > > > [ ] +1 Release this package as Apache Spark 1.1.0
> > > > [ ] -1 Do not release this package because ...
> > > >
> > > > To learn more about Apache Spark, please see
> > > > http://spark.apache.org/
> > > >
> > > > == Regressions fixed since RC1 ==
> > > > - Build issue for SQL support:
> > > > https://issues.apache.org/jira/browse/SPARK-3234
> > > > - EC2 script version bump to 1.1.0.
> > > >
> > > > == What justifies a -1 vote for this release? ==
> > > > This vote is happening very late into the QA period compared with
> > > > previous votes, so -1 votes should only occur for significant
> > > > regressions from 1.0.2. Bugs already present in 1.0.X will not block
> > > > this release.
> > > >
> > > > == What default changes should I be aware of? ==
> > > > 1. The default value of "spark.io.compression.codec" is now "snappy"
> > > > --> Old behavior can be restored by switching to "lzf"
> > > >
> > > > 2. PySpark now performs external spilling during aggregations.
> > > > --> Old behavior can be restored by setting "spark.shuffle.spill" to
> > > "false".
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > > For additional commands, e-mail: dev-h...@spark.apache.org
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> > >
> >
>


quick jenkins restart

2014-09-02 Thread shane knapp
since our queue is really short, i'm waiting for a couple of builds to
finish and will be restarting jenkins to install/update some plugins.  the
github pull request builder looks like it has some fixes to reduce spammy
github calls, and reduce any potential rate limiting.

i'll let everyone know when it's back up...  this should be super quick
(~15 mins for tests to finish, ~2 mins for jenkins to restart).

thanks in advance!

shane


Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Matei Zaharia
+1

Tested on Mac OS X.

Matei

On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote:

+1  

Verified PySpark InputFormat/OutputFormat examples.  


On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin  wrote:  

> +1  
>  
>  
> On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian  wrote:  
>  
> > +1  
> >  
> > - Tested Thrift server and SQL CLI locally on OSX 10.9.  
> > - Checked datanucleus dependencies in distribution tarball built by  
> > make-distribution.sh without SPARK_HIVE defined.  
> >  
> > ​  
> >  
> >  
> > On Tue, Sep 2, 2014 at 2:30 PM, Will Benton  wrote:  
> >  
> > > +1  
> > >  
> > > Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle  
> > JDK  
> > > 8).  
> > >  
> > >  
> > > best,  
> > > wb  
> > >  
> > >  
> > > - Original Message -  
> > > > From: "Patrick Wendell"   
> > > > To: dev@spark.apache.org  
> > > > Sent: Saturday, August 30, 2014 5:07:52 PM  
> > > > Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)  
> > > >  
> > > > Please vote on releasing the following candidate as Apache Spark  
> > version  
> > > > 1.1.0!  
> > > >  
> > > > The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):  
> > > >  
> > >  
> >  
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
>   
> > > >  
> > > > The release files, including signatures, digests, etc. can be found  
> at:  
> > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3/  
> > > >  
> > > > Release artifacts are signed with the following key:  
> > > > https://people.apache.org/keys/committer/pwendell.asc  
> > > >  
> > > > The staging repository for this release can be found at:  
> > > >  
> > https://repository.apache.org/content/repositories/orgapachespark-1030/  
> > > >  
> > > > The documentation corresponding to this release can be found at:  
> > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/  
> > > >  
> > > > Please vote on releasing this package as Apache Spark 1.1.0!  
> > > >  
> > > > The vote is open until Tuesday, September 02, at 23:07 UTC and passes  
> > if  
> > > > a majority of at least 3 +1 PMC votes are cast.  
> > > >  
> > > > [ ] +1 Release this package as Apache Spark 1.1.0  
> > > > [ ] -1 Do not release this package because ...  
> > > >  
> > > > To learn more about Apache Spark, please see  
> > > > http://spark.apache.org/  
> > > >  
> > > > == Regressions fixed since RC1 ==  
> > > > - Build issue for SQL support:  
> > > > https://issues.apache.org/jira/browse/SPARK-3234  
> > > > - EC2 script version bump to 1.1.0.  
> > > >  
> > > > == What justifies a -1 vote for this release? ==  
> > > > This vote is happening very late into the QA period compared with  
> > > > previous votes, so -1 votes should only occur for significant  
> > > > regressions from 1.0.2. Bugs already present in 1.0.X will not block  
> > > > this release.  
> > > >  
> > > > == What default changes should I be aware of? ==  
> > > > 1. The default value of "spark.io.compression.codec" is now "snappy"  
> > > > --> Old behavior can be restored by switching to "lzf"  
> > > >  
> > > > 2. PySpark now performs external spilling during aggregations.  
> > > > --> Old behavior can be restored by setting "spark.shuffle.spill" to  
> > > "false".  
> > > >  
> > > > -  
> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
> > > > For additional commands, e-mail: dev-h...@spark.apache.org  
> > > >  
> > > >  
> > >  
> > > -  
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
> > > For additional commands, e-mail: dev-h...@spark.apache.org  
> > >  
> > >  
> >  
>  


Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Michael Armbrust
+1


On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia 
wrote:

> +1
>
> Tested on Mac OS X.
>
> Matei
>
> On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote:
>
> +1
>
> Verified PySpark InputFormat/OutputFormat examples.
>
>
> On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin  wrote:
>
> > +1
> >
> >
> > On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian 
> wrote:
> >
> > > +1
> > >
> > > - Tested Thrift server and SQL CLI locally on OSX 10.9.
> > > - Checked datanucleus dependencies in distribution tarball built by
> > > make-distribution.sh without SPARK_HIVE defined.
> > >
> > > ​
> > >
> > >
> > > On Tue, Sep 2, 2014 at 2:30 PM, Will Benton  wrote:
> > >
> > > > +1
> > > >
> > > > Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9
> (Oracle
> > > JDK
> > > > 8).
> > > >
> > > >
> > > > best,
> > > > wb
> > > >
> > > >
> > > > - Original Message -
> > > > > From: "Patrick Wendell" 
> > > > > To: dev@spark.apache.org
> > > > > Sent: Saturday, August 30, 2014 5:07:52 PM
> > > > > Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
> > > > >
> > > > > Please vote on releasing the following candidate as Apache Spark
> > > version
> > > > > 1.1.0!
> > > > >
> > > > > The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
> > > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
> > > > >
> > > > > The release files, including signatures, digests, etc. can be found
> > at:
> > > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3/
> > > > >
> > > > > Release artifacts are signed with the following key:
> > > > > https://people.apache.org/keys/committer/pwendell.asc
> > > > >
> > > > > The staging repository for this release can be found at:
> > > > >
> > >
> https://repository.apache.org/content/repositories/orgapachespark-1030/
> > > > >
> > > > > The documentation corresponding to this release can be found at:
> > > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
> > > > >
> > > > > Please vote on releasing this package as Apache Spark 1.1.0!
> > > > >
> > > > > The vote is open until Tuesday, September 02, at 23:07 UTC and
> passes
> > > if
> > > > > a majority of at least 3 +1 PMC votes are cast.
> > > > >
> > > > > [ ] +1 Release this package as Apache Spark 1.1.0
> > > > > [ ] -1 Do not release this package because ...
> > > > >
> > > > > To learn more about Apache Spark, please see
> > > > > http://spark.apache.org/
> > > > >
> > > > > == Regressions fixed since RC1 ==
> > > > > - Build issue for SQL support:
> > > > > https://issues.apache.org/jira/browse/SPARK-3234
> > > > > - EC2 script version bump to 1.1.0.
> > > > >
> > > > > == What justifies a -1 vote for this release? ==
> > > > > This vote is happening very late into the QA period compared with
> > > > > previous votes, so -1 votes should only occur for significant
> > > > > regressions from 1.0.2. Bugs already present in 1.0.X will not
> block
> > > > > this release.
> > > > >
> > > > > == What default changes should I be aware of? ==
> > > > > 1. The default value of "spark.io.compression.codec" is now
> "snappy"
> > > > > --> Old behavior can be restored by switching to "lzf"
> > > > >
> > > > > 2. PySpark now performs external spilling during aggregations.
> > > > > --> Old behavior can be restored by setting "spark.shuffle.spill"
> to
> > > > "false".
> > > > >
> > > > >
> -
> > > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > > > For additional commands, e-mail: dev-h...@spark.apache.org
> > > > >
> > > > >
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > > For additional commands, e-mail: dev-h...@spark.apache.org
> > > >
> > > >
> > >
> >
>


Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Denny Lee
+1  Tested on Mac OSX, Thrift Server, SparkSQL


On September 2, 2014 at 17:29:29, Michael Armbrust (mich...@databricks.com) 
wrote:

+1  


On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia   
wrote:  

> +1  
>  
> Tested on Mac OS X.  
>  
> Matei  
>  
> On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote:  
>  
> +1  
>  
> Verified PySpark InputFormat/OutputFormat examples.  
>  
>  
> On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin  wrote:  
>  
> > +1  
> >  
> >  
> > On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian   
> wrote:  
> >  
> > > +1  
> > >  
> > > - Tested Thrift server and SQL CLI locally on OSX 10.9.  
> > > - Checked datanucleus dependencies in distribution tarball built by  
> > > make-distribution.sh without SPARK_HIVE defined.  
> > >  
> > > ​  
> > >  
> > >  
> > > On Tue, Sep 2, 2014 at 2:30 PM, Will Benton  wrote:  
> > >  
> > > > +1  
> > > >  
> > > > Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9  
> (Oracle  
> > > JDK  
> > > > 8).  
> > > >  
> > > >  
> > > > best,  
> > > > wb  
> > > >  
> > > >  
> > > > - Original Message -  
> > > > > From: "Patrick Wendell"   
> > > > > To: dev@spark.apache.org  
> > > > > Sent: Saturday, August 30, 2014 5:07:52 PM  
> > > > > Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)  
> > > > >  
> > > > > Please vote on releasing the following candidate as Apache Spark  
> > > version  
> > > > > 1.1.0!  
> > > > >  
> > > > > The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):  
> > > > >  
> > > >  
> > >  
> >  
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
>   
> > > > >  
> > > > > The release files, including signatures, digests, etc. can be found  
> > at:  
> > > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3/  
> > > > >  
> > > > > Release artifacts are signed with the following key:  
> > > > > https://people.apache.org/keys/committer/pwendell.asc  
> > > > >  
> > > > > The staging repository for this release can be found at:  
> > > > >  
> > >  
> https://repository.apache.org/content/repositories/orgapachespark-1030/  
> > > > >  
> > > > > The documentation corresponding to this release can be found at:  
> > > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/  
> > > > >  
> > > > > Please vote on releasing this package as Apache Spark 1.1.0!  
> > > > >  
> > > > > The vote is open until Tuesday, September 02, at 23:07 UTC and  
> passes  
> > > if  
> > > > > a majority of at least 3 +1 PMC votes are cast.  
> > > > >  
> > > > > [ ] +1 Release this package as Apache Spark 1.1.0  
> > > > > [ ] -1 Do not release this package because ...  
> > > > >  
> > > > > To learn more about Apache Spark, please see  
> > > > > http://spark.apache.org/  
> > > > >  
> > > > > == Regressions fixed since RC1 ==  
> > > > > - Build issue for SQL support:  
> > > > > https://issues.apache.org/jira/browse/SPARK-3234  
> > > > > - EC2 script version bump to 1.1.0.  
> > > > >  
> > > > > == What justifies a -1 vote for this release? ==  
> > > > > This vote is happening very late into the QA period compared with  
> > > > > previous votes, so -1 votes should only occur for significant  
> > > > > regressions from 1.0.2. Bugs already present in 1.0.X will not  
> block  
> > > > > this release.  
> > > > >  
> > > > > == What default changes should I be aware of? ==  
> > > > > 1. The default value of "spark.io.compression.codec" is now  
> "snappy"  
> > > > > --> Old behavior can be restored by switching to "lzf"  
> > > > >  
> > > > > 2. PySpark now performs external spilling during aggregations.  
> > > > > --> Old behavior can be restored by setting "spark.shuffle.spill"  
> to  
> > > > "false".  
> > > > >  
> > > > >  
> -  
> > > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
> > > > > For additional commands, e-mail: dev-h...@spark.apache.org  
> > > > >  
> > > > >  
> > > >  
> > > > -  
> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
> > > > For additional commands, e-mail: dev-h...@spark.apache.org  
> > > >  
> > > >  
> > >  
> >  
>  


RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Sean McNamara
+1

From: Patrick Wendell [pwend...@gmail.com]
Sent: Saturday, August 30, 2014 4:08 PM
To: dev@spark.apache.org
Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)

Please vote on releasing the following candidate as Apache Spark version 1.1.0!

The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1030/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

Please vote on releasing this package as Apache Spark 1.1.0!

The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== Regressions fixed since RC1 ==
- Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234
- EC2 script version bump to 1.1.0.

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.0.X will not block
this release.

== What default changes should I be aware of? ==
1. The default value of "spark.io.compression.codec" is now "snappy"
--> Old behavior can be restored by switching to "lzf"

2. PySpark now performs external spilling during aggregations.
--> Old behavior can be restored by setting "spark.shuffle.spill" to "false".

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Jeremy Freeman
+1



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: quick jenkins restart

2014-09-02 Thread shane knapp
and we're back and building!


On Tue, Sep 2, 2014 at 5:07 PM, shane knapp  wrote:

> since our queue is really short, i'm waiting for a couple of builds to
> finish and will be restarting jenkins to install/update some plugins.  the
> github pull request builder looks like it has some fixes to reduce spammy
> github calls, and reduce any potential rate limiting.
>
> i'll let everyone know when it's back up...  this should be super quick
> (~15 mins for tests to finish, ~2 mins for jenkins to restart).
>
> thanks in advance!
>
> shane
>


Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Paolo Platter
+1
Tested on HDP 2.1 Sandbox, Thrift Server with Simba Shark ODBC

Paolo

Da: Jeremy Freeman
Data invio: ?mercoled?? ?3? ?settembre? ?2014 ?02?:?34
A: d...@spark.incubator.apache.org

+1



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Nicholas Chammas
In light of the discussion on SPARK-, I'll revoke my "-1" vote. The
issue does not appear to be serious.


On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> -1: I believe I've found a regression from 1.0.2. The report is captured
> in SPARK- .
>
>
> On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.1.0!
>>
>> The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.1.0-rc3/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1030/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.1.0!
>>
>> The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.1.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == Regressions fixed since RC1 ==
>> - Build issue for SQL support:
>> https://issues.apache.org/jira/browse/SPARK-3234
>> - EC2 script version bump to 1.1.0.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening very late into the QA period compared with
>> previous votes, so -1 votes should only occur for significant
>> regressions from 1.0.2. Bugs already present in 1.0.X will not block
>> this release.
>>
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.io.compression.codec" is now "snappy"
>> --> Old behavior can be restored by switching to "lzf"
>>
>> 2. PySpark now performs external spilling during aggregations.
>> --> Old behavior can be restored by setting "spark.shuffle.spill" to
>> "false".
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: about spark assembly jar

2014-09-02 Thread scwf

Yea, SSD + SPARK_PREPEND_CLASSES is great for iterative development!

Then why it is ok with a bag of 3rd jars but throw error with assembly jar, any 
one have idea?

On 2014/9/3 2:57, Cheng Lian wrote:

Cool, didn't notice that, thanks Josh!


On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen mailto:rosenvi...@gmail.com>> wrote:

SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably 
be easier to find): 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools


On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com 
) wrote:


Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)

Maybe we should add a "developer notes" page to document all these useful
black magic.


On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin mailto:r...@databricks.com>> wrote:

> Having a SSD help tremendously with assembly time.
>
> Without that, you can do the following in order for Spark to pick up the
> compiled classes before assembly at runtime.
>
> export SPARK_PREPEND_CLASSES=true
>
>
> On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza mailto:sandy.r...@cloudera.com>>
> wrote:
>
> > This doesn't help for every dependency, but Spark provides an option to
> > build the assembly jar without Hadoop and its dependencies.  We make use
> of
> > this in CDH packaging.
> >
> > -Sandy
> >
> >
> > On Tue, Sep 2, 2014 at 2:12 AM, scwf mailto:wangf...@huawei.com>> wrote:
> >
> > > Hi sean owen,
> > > here are some problems when i used assembly jar
> > > 1 i put spark-assembly-*.jar to the lib directory of my application, 
it
> > > throw compile error
> > >
> > > Error:scalac: Error: class scala.reflect.BeanInfo not found.
> > > scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo
> not
> > > found.
> > >
> > > at scala.tools.nsc.symtab.Definitions$definitions$.
> > > getModuleOrClass(Definitions.scala:655)
> > >
> > > at scala.tools.nsc.symtab.Definitions$definitions$.
> > > getClass(Definitions.scala:608)
> > >
> > > at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.<
> > > init>(GenJVM.scala:127)
> > >
> > > at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
> > > scala:85)
> > >
> > > at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
> > >
> > > at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
> > >
> > > at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
> > >
> > > at
> > xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
> > >
> > > at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
> > >
> > > at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke(
> > > NativeMethodAccessorImpl.java:39)
> > >
> > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > DelegatingMethodAccessorImpl.java:25)
> > >
> > > at java.lang.reflect.Method.invoke(Method.java:597)
> > >
> > > at sbt.compiler.AnalyzingCompiler.call(
> > > AnalyzingCompiler.scala:102)
> > >
> > > at sbt.compiler.AnalyzingCompiler.compile(
> > > AnalyzingCompiler.scala:48)
> > >
> > > at sbt.compiler.AnalyzingCompiler.compile(
> > > AnalyzingCompiler.scala:41)
> > >
> > > at org.jetbrains.jps.incremental.scala.local.
> > > IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
> > >
> > > at org.jetbrains.jps.incremental.scala.local.LocalServer.
> > > compile(LocalServer.scala:25)
> > >
> > > at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
> > > scala:58)
> > >
> > > at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
> > > Main.scala:21)
> > >
> > > at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
> > > Main.scala)
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke(
> > > NativeMethodAccessorImpl.java:39)
> > >
> > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > DelegatingMethodAccessorImpl.java:25)
> > >
> > > at java.lang.reflect.Method.invoke(Method.java:597)
> > >
> > > at
> com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
> > > 2 i test my branch which updated hive version to org.apache.hive 
0.13.1
> > >   it run successfully when use a bag of 3rd jars as dependency but
> throw

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Patrick Wendell
Thanks everyone for voting on this. There were two minor issues (one a
blocker) were found that warrant cutting a new RC. For those who voted
+1 on this release, I'd encourage you to +1 rc4 when it comes out
unless you have been testing issues specific to the EC2 scripts. This
will move the release process along.

SPARK-3332 - Issue with tagging in EC2 scripts
SPARK-3358 - Issue with regression for m3.XX instances

- Patrick

On Tue, Sep 2, 2014 at 6:55 PM, Nicholas Chammas
 wrote:
> In light of the discussion on SPARK-, I'll revoke my "-1" vote. The
> issue does not appear to be serious.
>
>
> On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas
>  wrote:
>>
>> -1: I believe I've found a regression from 1.0.2. The report is captured
>> in SPARK-.
>>
>>
>> On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell 
>> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.1.0!
>>>
>>> The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
>>>
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.1.0-rc3/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1030/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 1.1.0!
>>>
>>> The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
>>> a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.1.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> == Regressions fixed since RC1 ==
>>> - Build issue for SQL support:
>>> https://issues.apache.org/jira/browse/SPARK-3234
>>> - EC2 script version bump to 1.1.0.
>>>
>>> == What justifies a -1 vote for this release? ==
>>> This vote is happening very late into the QA period compared with
>>> previous votes, so -1 votes should only occur for significant
>>> regressions from 1.0.2. Bugs already present in 1.0.X will not block
>>> this release.
>>>
>>> == What default changes should I be aware of? ==
>>> 1. The default value of "spark.io.compression.codec" is now "snappy"
>>> --> Old behavior can be restored by switching to "lzf"
>>>
>>> 2. PySpark now performs external spilling during aggregations.
>>> --> Old behavior can be restored by setting "spark.shuffle.spill" to
>>> "false".
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Spark SQL] off-heap columnar store

2014-09-02 Thread Evan Chan
On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell  wrote:
> I'm not sure what you mean here? Parquet is at its core just a format, you
> could store that data anywhere.
>
> Though it sounds like you saying, correct me if i'm wrong: you basically
> want a columnar abstraction layer where you can provide a different backing
> implementation to keep the columns rather than parquet-mr?
>
> I.e. you want to be able to produce a schema RDD from something like
> vertica, where updates should act as a write through cache back to vertica
> itself?

Something like that.

I'd like,

1)  An API to produce a schema RDD from an RDD of columns, not rows.
  However, an RDD[Column] would not make sense, since it would be
spread out across partitions.  Perhaps what is needed is a
Seq[RDD[ColumnSegment]].The idea is that each RDD would hold the
segments for one column.  The segments represent a range of rows.
This would then read from something like Vertica or Cassandra.

2)  A variant of 1) where you could read this data from Tachyon.
Tachyon is supposed to support a columnar representation of data, it
did for Shark 0.9.x.

The goal is basically to load columnar data from something like
Cassandra into Tachyon, with the compression ratio of columnar
storage, and the speed of InMemoryColumnarTableScan.   If data is
appended into the Tachyon representation, be able to cache it back.
The write back is not as high a priority though.

A workaround would be to read data from Cassandra/Vertica/etc. and
write back into Parquet, but this would take a long time and incur
huge I/O overhead.

>
> I'm sorry it just sounds like its worth clearly defining what your key
> requirement/goal is.
>
>
> On Thu, Aug 28, 2014 at 11:31 PM, Evan Chan  wrote:
>>
>> >
>> >> The reason I'm asking about the columnar compressed format is that
>> >> there are some problems for which Parquet is not practical.
>> >
>> >
>> > Can you elaborate?
>>
>> Sure.
>>
>> - Organization or co has no Hadoop, but significant investment in some
>> other NoSQL store.
>> - Need to efficiently add a new column to existing data
>> - Need to mark some existing rows as deleted or replace small bits of
>> existing data
>>
>> For these use cases, it would be much more efficient and practical if
>> we didn't have to take the origin of the data from the datastore,
>> convert it to Parquet first.  Doing so loses significant latency and
>> causes Ops headaches in having to maintain HDFS. It would be great
>> to be able to load data directly into the columnar format, into the
>> InMemoryColumnarCache.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org