date:20150211

Hi

Can you share the code you are trying to run.

Thanks
Arush

On Wed, Feb 11, 2015 at 9:12 AM, Tianshuo Deng 
wrote:

> I have seen the same problem, It causes some tasks to fail, but not the
> whole job to fail.
> Hope someone could shed some light on what could be the cause of this.
>
> On Mon, Jan 26, 2015 at 9:49 AM, Aaron Davidson 
> wrote:
>
>> It looks like something weird is going on with your object serialization,
>> perhaps a funny form of self-reference which is not detected by
>> ObjectOutputStream's typical loop avoidance. That, or you have some data
>> structure like a linked list with a parent pointer and you have many
>> thousand elements.
>>
>> Assuming the stack trace is coming from an executor, it is probably a
>> problem with the objects you're sending back as results, so I would
>> carefully examine these and maybe try serializing some using
>> ObjectOutputStream manually.
>>
>> If your program looks like
>> foo.map { row => doComplexOperation(row) }.take(10)
>>
>> you can also try changing it to
>> foo.map { row => doComplexOperation(row); 1 }.take(10)
>>
>> to avoid serializing the result of that complex operation, which should
>> help narrow down where exactly the problematic objects are coming from.
>>
>> On Mon, Jan 26, 2015 at 8:31 AM, octavian.ganea <
>> octavian.ga...@inf.ethz.ch> wrote:
>>
>>> Here is the first error I get at the executors:
>>>
>>> 15/01/26 17:27:04 ERROR ExecutorUncaughtExceptionHandler: Uncaught
>>> exception
>>> in thread Thread[handle-message-executor-16,5,main]
>>> java.lang.StackOverflowError
>>> at
>>>
>>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>>> at
>>>
>>> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1840)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOu

Re: Writing to HDFS from spark Streaming

2015-02-11 Thread Sean Owen

That kinda dodges the problem by ignoring generic types. But it may be
simpler than the 'real' solution, which is a bit ugly.

(But first, to double check, are you importing the correct
TextOutputFormat? there are two versions. You use .mapred. with the
old API and .mapreduce. with the new API.)

Here's how I've formally casted around it in similar code:

@SuppressWarnings
Class> outputFormatClass =
(Class>) (Class) TextOutputFormat.class;

and then pass that as the final argument.

On Wed, Feb 11, 2015 at 6:35 AM, Akhil Das  wrote:
> Did you try :
>
> temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, String.class,(Class)
> TextOutputFormat.class);
>
> Thanks
> Best Regards
>
> On Wed, Feb 11, 2015 at 9:40 AM, Bahubali Jain  wrote:
>>
>> Hi,
>> I am facing issues while writing data from a streaming rdd to hdfs..
>>
>> JavaPairDstream temp;
>> ...
>> ...
>> temp.saveAsHadoopFiles("DailyCSV",".txt", String.class,
>> String.class,TextOutputFormat.class);
>>
>>
>> I see compilation issues as below...
>> The method saveAsHadoopFiles(String, String, Class, Class, Class> extends OutputFormat>) in the type JavaPairDStream is
>> not applicable for the arguments (String, String, Class,
>> Class, Class)
>>
>> I see same kind of problem even with saveAsNewAPIHadoopFiles API .
>>
>> Thanks,
>> Baahu
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

high GC in the Kmeans algorithm

2015-02-11 Thread lihu

Hi,
I  run the kmeans(MLlib) in a cluster with 12 workers.  Every work own
a 128G RAM, 24Core. I run 48 task in one machine. the total data is just
40GB.

   When the dimension of the data set is about 10^7, for every task the
duration is about 30s, but the cost for GC is about 20s.

   When I reduce the dimension to 10^4, then the gc is small.

So why gc is so high when the dimension is larger? or this is the
reason caused by MLlib?

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter

One additional comment I would make is that you should be careful with
Updates in Cassandra, it does support them but large amounts of Updates
(i.e changing existing keys) tends to cause fragmentation. If you are
(mostly) adding new keys (e.g new records in the the time series) then
Cassandra can be excellent

cheers


On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter 
wrote:

>   Hi Mike,
>
> I developed a Solution with cassandra and spark, using DSE.
> The main difficult is about cassandra, you need to understand very well
> its data model and its Query patterns.
> Cassandra has better performance than hdfs and it has DR and stronger
> availability.
> Hdfs is a filesystem, cassandra is a dbms.
> Cassandra supports full CRUD without acid.
> Hdfs is more flexible than cassandra.
>
> In my opinion, if you have a real time series, go with Cassandra paying
> attention at your reporting data access patterns.
>
> Paolo
>
> Inviata dal mio Windows Phone
>  --
> Da: Mike Trienis 
> Inviato: ‎11/‎02/‎2015 05:59
> A: user@spark.apache.org
> Oggetto: Datastore HDFS vs Cassandra
>
>   Hi,
>
> I am considering implement Apache Spark on top of Cassandra database after
> listing to related talk and reading through the slides from DataStax. It
> seems to fit well with our time-series data and reporting requirements.
>
>
> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>
> Does anyone have any experiences using Apache Spark and Cassandra,
> including
> limitations (and or) technical difficulties? How does Cassandra compare
> with
> HDFS and what use cases would make HDFS more suitable?
>
> Thanks, Mike.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Christian Betz

Hi

Regarding the Cassandra Data model, there's an excellent post on the ebay tech 
blog: 
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/.
 There's also a slideshare for this somewhere.

Happy hacking

Chris

Von: Franc Carter 
mailto:franc.car...@rozettatech.com>>
Datum: Mittwoch, 11. Februar 2015 10:03
An: Paolo Platter mailto:paolo.plat...@agilelab.it>>
Cc: Mike Trienis mailto:mike.trie...@orcsol.com>>, 
"user@spark.apache.org" 
mailto:user@spark.apache.org>>
Betreff: Re: Datastore HDFS vs Cassandra


One additional comment I would make is that you should be careful with Updates 
in Cassandra, it does support them but large amounts of Updates (i.e changing 
existing keys) tends to cause fragmentation. If you are (mostly) adding new 
keys (e.g new records in the the time series) then Cassandra can be excellent

cheers


On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter 
mailto:paolo.plat...@agilelab.it>> wrote:
Hi Mike,

I developed a Solution with cassandra and spark, using DSE.
The main difficult is about cassandra, you need to understand very well its 
data model and its Query patterns.
Cassandra has better performance than hdfs and it has DR and stronger 
availability.
Hdfs is a filesystem, cassandra is a dbms.
Cassandra supports full CRUD without acid.
Hdfs is more flexible than cassandra.

In my opinion, if you have a real time series, go with Cassandra paying 
attention at your reporting data access patterns.

Paolo

Inviata dal mio Windows Phone

Da: Mike Trienis
Inviato: ?11/?02/?2015 05:59
A: user@spark.apache.org
Oggetto: Datastore HDFS vs Cassandra

Hi,

I am considering implement Apache Spark on top of Cassandra database after
listing to related talk and reading through the slides from DataStax. It
seems to fit well with our time-series data and reporting requirements.

http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data

Does anyone have any experiences using Apache Spark and Cassandra, including
limitations (and or) technical difficulties? How does Cassandra compare with
HDFS and what use cases would make HDFS more suitable?

Thanks, Mike.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org




--

Franc Carter | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  | 
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter

I forgot to mention that if you do decide to use Cassandra I'd highly
recommend jumping on the Cassandra mailing list, if we had taken in come of
the advice on that list things would have been considerably smoother

cheers

On Wed, Feb 11, 2015 at 8:12 PM, Christian Betz <
christian.b...@performance-media.de> wrote:

>   Hi
>
>  Regarding the Cassandra Data model, there's an excellent post on the
> ebay tech blog:
> http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/.
> There's also a slideshare for this somewhere.
>
>  Happy hacking
>
>  Chris
>
>   Von: Franc Carter 
> Datum: Mittwoch, 11. Februar 2015 10:03
> An: Paolo Platter 
> Cc: Mike Trienis , "user@spark.apache.org" <
> user@spark.apache.org>
> Betreff: Re: Datastore HDFS vs Cassandra
>
>
> One additional comment I would make is that you should be careful with
> Updates in Cassandra, it does support them but large amounts of Updates
> (i.e changing existing keys) tends to cause fragmentation. If you are
> (mostly) adding new keys (e.g new records in the the time series) then
> Cassandra can be excellent
>
>  cheers
>
>
> On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter 
> wrote:
>
>>   Hi Mike,
>>
>> I developed a Solution with cassandra and spark, using DSE.
>> The main difficult is about cassandra, you need to understand very well
>> its data model and its Query patterns.
>> Cassandra has better performance than hdfs and it has DR and stronger
>> availability.
>> Hdfs is a filesystem, cassandra is a dbms.
>> Cassandra supports full CRUD without acid.
>> Hdfs is more flexible than cassandra.
>>
>> In my opinion, if you have a real time series, go with Cassandra paying
>> attention at your reporting data access patterns.
>>
>> Paolo
>>
>> Inviata dal mio Windows Phone
>>  --
>> Da: Mike Trienis 
>> Inviato: ?11/?02/?2015 05:59
>> A: user@spark.apache.org
>> Oggetto: Datastore HDFS vs Cassandra
>>
>>   Hi,
>>
>> I am considering implement Apache Spark on top of Cassandra database after
>> listing to related talk and reading through the slides from DataStax. It
>> seems to fit well with our time-series data and reporting requirements.
>>
>>
>> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>>
>> Does anyone have any experiences using Apache Spark and Cassandra,
>> including
>> limitations (and or) technical difficulties? How does Cassandra compare
>> with
>> HDFS and what use cases would make HDFS more suitable?
>>
>> Thanks, Mike.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
>  --
>
> *Franc Carter* | Systems Architect | Rozetta Technology
>
> franc.car...@rozettatech.com  |
> www.rozettatechnology.com
>
> Tel: +61 2 8355 2515
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
> AUSTRALIA
>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: high GC in the Kmeans algorithm

2015-02-11 Thread Sean Owen

Good, worth double-checking that's what you got. That's barely 1GB per
task though. Why run 48 if you have 24 cores?

On Wed, Feb 11, 2015 at 9:03 AM, lihu  wrote:
> I give 50GB to the executor,  so it seem that  there is no reason the memory
> is not enough.
>
> On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen  wrote:
>>
>> Meaning, you have 128GB per machine but how much memory are you giving
>> the executors?
>>
>> On Wed, Feb 11, 2015 at 8:49 AM, lihu  wrote:
>> > What do you mean?  Yes,I an see there  is some data put in the memory
>> > from
>> > the web ui.
>> >
>> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen  wrote:
>> >>
>> >> Are you actually using that memory for executors?
>> >>
>> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu  wrote:
>> >> > Hi,
>> >> > I  run the kmeans(MLlib) in a cluster with 12 workers.  Every
>> >> > work
>> >> > own a
>> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data is
>> >> > just
>> >> > 40GB.
>> >> >
>> >> >When the dimension of the data set is about 10^7, for every task
>> >> > the
>> >> > duration is about 30s, but the cost for GC is about 20s.
>> >> >
>> >> >When I reduce the dimension to 10^4, then the gc is small.
>> >> >
>> >> > So why gc is so high when the dimension is larger? or this is the
>> >> > reason
>> >> > caused by MLlib?
>> >> >
>> >> >
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Best Wishes!
>> >
>> > Li Hu(李浒) | Graduate Student
>> > Institute for Interdisciplinary Information Sciences(IIIS)
>> > Tsinghua University, China
>> >
>> > Email: lihu...@gmail.com
>> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >
>> >
>
>
>
>
> --
> Best Wishes!
>
> Li Hu(李浒) | Graduate Student
> Institute for Interdisciplinary Information Sciences(IIIS)
> Tsinghua University, China
>
> Email: lihu...@gmail.com
> Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: high GC in the Kmeans algorithm

2015-02-11 Thread lihu

I just want to make the best use of CPU,  and test the performance of spark
if there is a lot of task in a single node.

On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen  wrote:

> Good, worth double-checking that's what you got. That's barely 1GB per
> task though. Why run 48 if you have 24 cores?
>
> On Wed, Feb 11, 2015 at 9:03 AM, lihu  wrote:
> > I give 50GB to the executor,  so it seem that  there is no reason the
> memory
> > is not enough.
> >
> > On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen  wrote:
> >>
> >> Meaning, you have 128GB per machine but how much memory are you giving
> >> the executors?
> >>
> >> On Wed, Feb 11, 2015 at 8:49 AM, lihu  wrote:
> >> > What do you mean?  Yes,I an see there  is some data put in the memory
> >> > from
> >> > the web ui.
> >> >
> >> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen 
> wrote:
> >> >>
> >> >> Are you actually using that memory for executors?
> >> >>
> >> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu  wrote:
> >> >> > Hi,
> >> >> > I  run the kmeans(MLlib) in a cluster with 12 workers.  Every
> >> >> > work
> >> >> > own a
> >> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data is
> >> >> > just
> >> >> > 40GB.
> >> >> >
> >> >> >When the dimension of the data set is about 10^7, for every task
> >> >> > the
> >> >> > duration is about 30s, but the cost for GC is about 20s.
> >> >> >
> >> >> >When I reduce the dimension to 10^4, then the gc is small.
> >> >> >
> >> >> > So why gc is so high when the dimension is larger? or this is
> the
> >> >> > reason
> >> >> > caused by MLlib?
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best Wishes!
> >> >
> >> > Li Hu(李浒) | Graduate Student
> >> > Institute for Interdisciplinary Information Sciences(IIIS)
> >> > Tsinghua University, China
> >> >
> >> > Email: lihu...@gmail.com
> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Best Wishes!
> >
> > Li Hu(李浒) | Graduate Student
> > Institute for Interdisciplinary Information Sciences(IIIS)
> > Tsinghua University, China
> >
> > Email: lihu...@gmail.com
> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >
> >
>



-- 
*Best Wishes!*

*Li Hu(李浒) | Graduate Student*

*Institute for Interdisciplinary Information Sciences(IIIS
)*
*Tsinghua University, China*

*Email: lihu...@gmail.com *
*Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
*

Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data, from JSON documents is totally wrong!

2015-02-11 Thread Costin Leau



Aris, if you encountered a bug, it's best to raise an issue with the 
es-hadoop/spark project, namely here [1].

When using SparkSQL the underlying data needs to be present - this is mentioned 
in the docs as well [2]. As for the order,
that does look like a bug and shouldn't occur. Note the reason why the keys are 
re-arranged is in JSON, the object/map
doesn't
guarantee order. To give some predictability, the keys are arranged 
alphabetically.

I suggest continuing this discussion in the issue issue tracker mentioned at 
[1].

[1] https://github.com/elasticsearch/elasticsearch-hadoop/issues
[2] 
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html#spark-sql

On 2/11/15 3:17 AM, Aris wrote:

I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and 
Spark/SparkSQL 1.2.0, from Costin Leau's advice.

I want to query ElasticSearch for a bunch of JSON documents from within 
SparkSQL, and then use a SQL query to simply
query for a column, which is actually a JSON key -- normal things that SparkSQL 
does using the
SQLContext.jsonFile(filePath) facility. The difference I am using the 
ElasticSearch container.

The big problem: when I do something like

SELECT jsonKeyA from tempTable;

I actually get the WRONG KEY out of the JSON documents! I discovered that if I 
have JSON keys physically in the order D,
C, B, A in the json documents, the elastic search connector discovers those 
keys BUT then sorts them alphabetically as
A,B,C,D - so when I SELECT A from tempTable, I actually get column D (because 
the physical JSONs had key D in the first
position). This only happens when reading from elasticsearch and SparkSQL.

It gets much worse: When a key is missing from one of the documents and that 
key should be NULL, the whole application
actually crashes and gives me a java.lang.IndexOutOfBoundsException -- the 
schema that is inferred is totally screwed up.

In the above example with physical JSONs containing keys in the order D,C,B,A, 
if one of the JSON documents is missing
the key/column I am querying for, I get that 
java.lang.IndexOutOfBoundsException exception.

I am using the BUILD-SNAPSHOT because otherwise I couldn't build the 
elasticsearch-spark project, Costin said so.

Any clues here? Any fixes?





--
Costin



--
Costin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: using spark in web services

Hi

Are you able to run the code after eliminating all the spark code. To find
out if the issue is with Jetty or with Spark itself, it could be also due
to conflicting jetty versions in spark and the one you are trying to use.

You can check the dependency graph in maven and check if there are any
dependency conflicts.

mvn dependency:tree -Dverbose


Thanks
Arush

On Mon, Feb 9, 2015 at 4:00 PM, Hafiz Mujadid 
wrote:

> Hi experts! I am trying to use spark in my restful webservices.I am using
> scala lift frramework for writing web services. Here is my boot class
> class Boot extends Bootable {
>   def boot {
> Constants.loadConfiguration
> val sc=new SparkContext(new
> SparkConf().setMaster("local").setAppName("services"))
> // Binding Service as a Restful API
> LiftRules.statelessDispatchTable.append(RestfulService);
> // resolve the trailing slash issue
> LiftRules.statelessRewrite.prepend({
>   case RewriteRequest(ParsePath(path, _, _, true), _, _) if path.last
> ==
> "index" => RewriteResponse(path.init)
> })
>
>   }
> }
>
>
> When i remove this line val sc=new SparkContext(new
> SparkConf().setMaster("local").setAppName("services"))
>
> then it works fine.
> I am starting services using command
>
> java -jar start.jar jetty.port=
>
> and get following exception
>
>
> ERROR net.liftweb.http.provider.HTTPProvider - Failed to Boot! Your
> application may not run properly
> java.lang.NoClassDefFoundError:
> org/eclipse/jetty/server/handler/ContextHandler$NoContext
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.newServletHandler(ServletContextHandler.java:260)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.getServletHandler(ServletContextHandler.java:322)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.relinkHandlers(ServletContextHandler.java:198)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:157)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:135)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:129)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:99)
> at
> org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:96)
> at
> org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:87)
> at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:67)
> at
> org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
> at
> org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:60)
> at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:50)
> at org.apache.spark.ui.SparkUI.(SparkUI.scala:63)
>
>
>
> Any suggestion please?
>
> Am I using right command to run this ?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-in-web-services-tp21550.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] 

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

apply function to all elements of a row matrix

2015-02-11 Thread Donbeo

HI,
I have a row matrix x

scala> x
res3: org.apache.spark.mllib.linalg.distributed.RowMatrix =
org.apache.spark.mllib.linalg.distributed.RowMatrix@63949747

and I would like to apply a function to each element of this matrix. I was
looking for something like:
x map (e => exp(-e*e))   

How can I do that?

Thanks!!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/apply-function-to-all-elements-of-a-row-matrix-tp21596.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Todd Nist

Hi Arush,

So yes I want to create the tables through Spark SQL.  I have placed the
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that
was all I should need to do to have the thriftserver use it.  Perhaps my
hive-site.xml is worng, it currently looks like this:



  hive.metastore.uris
  
  thrift://sandbox.hortonworks.com:9083
  URI for client to contact metastore server



Which leads me to believe it is going to pull form the thriftserver from
Horton?  I will go look at the docs to see if this is right, it is what
Horton says to do.  Do you have an example hive-site.xml by chance that
works with Spark SQL?

I am using 8.3 of tableau with the SparkSQL Connector.

Thanks for the assistance.

-Todd

On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda  wrote:

> BTW what tableau connector are you using?
>
> On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>>  I am a little confused here, why do you want to create the tables in
>> hive. You want to create the tables in spark-sql, right?
>>
>> If you are not able to find the same tables through tableau then thrift
>> is connecting to a diffrent metastore than your spark-shell.
>>
>> One way to specify a metstore to thrift is to provide the path to
>> hive-site.xml while starting thrift using --files hive-site.xml.
>>
>> similarly you can specify the same metastore to your spark-submit or
>> sharp-shell using the same option.
>>
>>
>>
>> On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist  wrote:
>>
>>> Arush,
>>>
>>> As for #2 do you mean something like this from the docs:
>>>
>>> // sc is an existing SparkContext.val sqlContext = new 
>>> org.apache.spark.sql.hive.HiveContext(sc)
>>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value 
>>> STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH 
>>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>>> // Queries are expressed in HiveQLsqlContext.sql("FROM src SELECT key, 
>>> value").collect().foreach(println)
>>>
>>> Or did you have something else in mind?
>>>
>>> -Todd
>>>
>>>
>>> On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist  wrote:
>>>
 Arush,

 Thank you will take a look at that approach in the morning.  I sort of
 figured the answer to #1 was NO and that I would need to do 2 and 3 thanks
 for clarifying it for me.

 -Todd

 On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda <
 ar...@sigmoidanalytics.com> wrote:

> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
> JSON files? NO
> 2.  Do I need to do something to expose these via hive / metastore
> other than creating a table in hive? Create a table in spark sql to expose
> via spark sql
> 3.  Does the thriftserver need to be configured to expose these in
> some fashion, sort of related to question 2 you would need to configure
> thrift to read from the metastore you expect it read from - by default it
> reads from metastore_db directory present in the directory used to launch
> the thrift server.
>  On 11 Feb 2015 01:35, "Todd Nist"  wrote:
>
>> Hi,
>>
>> I'm trying to understand how and what the Tableau connector to
>> SparkSQL is able to access.  My understanding is it needs to connect to 
>> the
>> thriftserver and I am not sure how or if it exposes parquet, json,
>> schemaRDDs, or does it only expose schemas defined in the metastore / 
>> hive.
>>
>>
>> For example, I do the following from the spark-shell which generates
>> a schemaRDD from a csv file and saves it as a JSON file as well as a
>> parquet file.
>>
>> import *org.apache.sql.SQLContext
>> *import com.databricks.spark.csv._
>> val sqlContext = new SQLContext(sc)
>> val test = 
>> sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out")
>> test.saveAsParquetFile("/data/out")
>>
>> When I connect from Tableau, the only thing I see is the "default"
>> schema and nothing in the tables section.
>>
>> So my questions are:
>>
>> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
>> JSON files?
>> 2.  Do I need to do something to expose these via hive / metastore
>> other than creating a table in hive?
>> 3.  Does the thriftserver need to be configured to expose these in
>> some fashion, sort of related to question 2.
>>
>> TIA for the assistance.
>>
>> -Todd
>>
>

>>>
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] 
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>
>
>
> --
>
> [image: Sigmoid Analytics] 
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>

Re: How to efficiently utilize all cores?

2015-02-11 Thread Harika

Hi Aplysia,

Thanks for the reply.

Could you be more specific in terms of what part of the document to look at
as I have already seen it and tried a few of the relevant settings for no
use. 







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-utilize-all-cores-tp21569p21597.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

What do you think about the level of resource manager and file system?

2015-02-11 Thread Fangqi (Roy)

[cid:image004.jpg@01D04629.1F451950] [cid:image005.png@01D04629.1F451950]

Hi guys~

Comparing these two architectures, why BDAS put Yarn and Mesos under the HDFS, 
do you have any special consideration? Or just easy to express the AMPLab stack?

Best regards!

Re: SparkSQL + Tableau Connector

Hi

I used this, though its using a embedded driver and is not a good
approch.It works. You can configure for some other metastore type also. I
have not tried the metastore uri's.






  javax.jdo.option.ConnectionURL


jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true

  URL for the DB






  javax.jdo.option.ConnectionDriverName

  org.apache.derby.jdbc.EmbeddedDriver








On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist  wrote:

> Hi Arush,
>
> So yes I want to create the tables through Spark SQL.  I have placed the
> hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that
> was all I should need to do to have the thriftserver use it.  Perhaps my
> hive-site.xml is worng, it currently looks like this:
>
> 
> 
>   hive.metastore.uris
>   
>   thrift://sandbox.hortonworks.com:9083
>   URI for client to contact metastore server
> 
> 
>
> Which leads me to believe it is going to pull form the thriftserver from
> Horton?  I will go look at the docs to see if this is right, it is what
> Horton says to do.  Do you have an example hive-site.xml by chance that
> works with Spark SQL?
>
> I am using 8.3 of tableau with the SparkSQL Connector.
>
> Thanks for the assistance.
>
> -Todd
>
> On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>> BTW what tableau connector are you using?
>>
>> On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda <
>> ar...@sigmoidanalytics.com> wrote:
>>
>>>  I am a little confused here, why do you want to create the tables in
>>> hive. You want to create the tables in spark-sql, right?
>>>
>>> If you are not able to find the same tables through tableau then thrift
>>> is connecting to a diffrent metastore than your spark-shell.
>>>
>>> One way to specify a metstore to thrift is to provide the path to
>>> hive-site.xml while starting thrift using --files hive-site.xml.
>>>
>>> similarly you can specify the same metastore to your spark-submit or
>>> sharp-shell using the same option.
>>>
>>>
>>>
>>> On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist  wrote:
>>>
 Arush,

 As for #2 do you mean something like this from the docs:

 // sc is an existing SparkContext.val sqlContext = new 
 org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value 
 STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH 
 'examples/src/main/resources/kv1.txt' INTO TABLE src")
 // Queries are expressed in HiveQLsqlContext.sql("FROM src SELECT key, 
 value").collect().foreach(println)

 Or did you have something else in mind?

 -Todd


 On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist  wrote:

> Arush,
>
> Thank you will take a look at that approach in the morning.  I sort of
> figured the answer to #1 was NO and that I would need to do 2 and 3 thanks
> for clarifying it for me.
>
> -Todd
>
> On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
>> JSON files? NO
>> 2.  Do I need to do something to expose these via hive / metastore
>> other than creating a table in hive? Create a table in spark sql to 
>> expose
>> via spark sql
>> 3.  Does the thriftserver need to be configured to expose these in
>> some fashion, sort of related to question 2 you would need to configure
>> thrift to read from the metastore you expect it read from - by default it
>> reads from metastore_db directory present in the directory used to launch
>> the thrift server.
>>  On 11 Feb 2015 01:35, "Todd Nist"  wrote:
>>
>>> Hi,
>>>
>>> I'm trying to understand how and what the Tableau connector to
>>> SparkSQL is able to access.  My understanding is it needs to connect to 
>>> the
>>> thriftserver and I am not sure how or if it exposes parquet, json,
>>> schemaRDDs, or does it only expose schemas defined in the metastore / 
>>> hive.
>>>
>>>
>>> For example, I do the following from the spark-shell which generates
>>> a schemaRDD from a csv file and saves it as a JSON file as well as a
>>> parquet file.
>>>
>>> import *org.apache.sql.SQLContext
>>> *import com.databricks.spark.csv._
>>> val sqlContext = new SQLContext(sc)
>>> val test = 
>>> sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out")
>>> test.saveAsParquetFile("/data/out")
>>>
>>> When I connect from Tableau, the only thing I see is the "default"
>>> schema and nothing in the tables section.
>>>
>>> So my questions are:
>>>
>>> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
>>> JSON files?
>>> 2.  Do I need to do something to expose these via hive / metastore
>>> other than creating a table in hive?
>>> 3.  Does the thriftserver need t

Question related to Spark SQL

2015-02-11 Thread Ashish Mukherjee

Hi,

I am planning to use Spark for a Web-based adhoc reporting tool on massive
date-sets on S3. Real-time queries with filters, aggregations and joins
could be constructed from UI selections.

Online documentation seems to suggest that SharkQL is deprecated and users
should move away from it.  I understand Hive is generally not used for
real-time querying and for Spark SQL to work with other data stores, tables
need to be registered explicitly in code. Also, the This would not be
suitable for a dynamic query construction scenario.

For a real-time , dynamic querying scenario like mine what is the proper
tool to be used with Spark SQL?

Regards,
Ashish

Re: Question related to Spark SQL

I am implementing this approach currently.

A
1.Create data tables in spark-sql and cache them.
2. Configure the hive metastore to read the cached tables and share the
same metastore as spark-sql (You get the spark caching advantage)
3.Run spark code to fetch form the cached tables. In the spark code you can
genrate queries at runtime.


On Wed, Feb 11, 2015 at 4:12 PM, Ashish Mukherjee <
ashish.mukher...@gmail.com> wrote:

> Hi,
>
> I am planning to use Spark for a Web-based adhoc reporting tool on massive
> date-sets on S3. Real-time queries with filters, aggregations and joins
> could be constructed from UI selections.
>
> Online documentation seems to suggest that SharkQL is deprecated and users
> should move away from it.  I understand Hive is generally not used for
> real-time querying and for Spark SQL to work with other data stores, tables
> need to be registered explicitly in code. Also, the This would not be
> suitable for a dynamic query construction scenario.
>
> For a real-time , dynamic querying scenario like mine what is the proper
> tool to be used with Spark SQL?
>
> Regards,
> Ashish
>



-- 

[image: Sigmoid Analytics] 

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Question related to Spark SQL

Hi Ashish,

In order to answer your question , I assume that you are planning to
process data and cache them in the memory.If you are using a thrift server
that comes with Spark then you can query on top of it. And multiple
applications can use the cached data as internally all the requests go to
thrift server.

Spark exposes hive query language and allows you access its data through
spark .So you can consider using HiveQL for querying .

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 4:12 PM, Ashish Mukherjee <
ashish.mukher...@gmail.com> wrote:

> Hi,
>
> I am planning to use Spark for a Web-based adhoc reporting tool on massive
> date-sets on S3. Real-time queries with filters, aggregations and joins
> could be constructed from UI selections.
>
> Online documentation seems to suggest that SharkQL is deprecated and users
> should move away from it.  I understand Hive is generally not used for
> real-time querying and for Spark SQL to work with other data stores, tables
> need to be registered explicitly in code. Also, the This would not be
> suitable for a dynamic query construction scenario.
>
> For a real-time , dynamic querying scenario like mine what is the proper
> tool to be used with Spark SQL?
>
> Regards,
> Ashish
>

Spark Streaming: Flume receiver with Kryo serialization

2015-02-11 Thread Antonio Jesus Navarro

Hi,  I want to include if possible Kryo serialization in a project and
first I'm trying to run FlumeEventCount with Kryo. If I comment  setAll
method, runs correctly, but if I use Kryo params it returns several errors.

15/02/11 11:42:16 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent executor 1
15/02/11 11:42:16 ERROR JobScheduler: Error running job streaming job
142365133 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage
2.0 (TID 8, localhost): ExecutorLostFailure (executor 1 lost)


This is my code.

object flumeKryo {
  def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf(true)
  .setMaster("spark://localhost:7077")
  .setAppName("TestKryo")










*.setAll(Map(  "spark.serializer" ->
"com.gmunoz.flumekryo.WrapperSerializer",
"spark.kryo.registrator" -> "com.gmunoz.flumekryo.MyRegistrator",
"spark.task.maxFailures" -> "1",  "spark.rdd.compress" ->
"true",  "spark.storage.memoryFraction" -> "1",
"spark.core.connection.ack.wait.timeout" -> "600",
"spark.akka.frameSize" -> "50")  )*

val sc = new SparkContext(sparkConf)

sc.addJar("/home/gmunoz/workspace/flumekryo/target/flumekryo-1.0-SNAPSHOT.jar")

val ssc = new StreamingContext(sc, Seconds(2))

// Create a flume stream
val stream = FlumeUtils.createPollingStream(ssc, "localhost",
11000, StorageLevel.MEMORY_ONLY_SER_2)

// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events.").print()

ssc.start()
ssc.awaitTermination()
  }
}

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
Console.err.println("# MyRegistrator called")
kryo.register(classOf[SparkFlumeEvent])
  }
}

class WrapperSerializer(conf: SparkConf) extends KryoSerializer(conf) {
  override def newKryo(): Kryo = {
println("## Called newKryo!")
super.newKryo()
  }


What am I doing wrong?

Thanks!

OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez

Hello guys, 

I am trying to run a Ramdom Forest on 30MB of data. I have a cluster of 4
machines. Each machine has 106 MB of RAM and 16 cores. 

I am getting: 
15/02/11 11:01:23 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-3] shutting down ActorSystem
[sparkDriver]
java.lang.OutOfMemoryError: Java heap space

That's very weird. Any idea of what's wrong with my configuration? 

PS : I am running Spark 1.2



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

what is behind matrix multiplications?

2015-02-11 Thread Donbeo

In Spark it is possible to multiply a distribuited matrix  x and a local
matrix w

val x = new RowMatrix(distribuited_data)
val w: Matrix = Matrices.dense(local_data)
val result = x.multiply(w) . 

What is the process behind this command?  Is the matrix w replicated on each
worker?  Is there a reference that I can use for this? 

Thanks  a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-behind-matrix-multiplications-tp21599.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to log using log4j to local file system inside a Spark application that runs on YARN?

2015-02-11 Thread Emre Sevinc

Hello,

I'm building an Apache Spark Streaming application and cannot make it log
to a file on the local filesystem *when running it on YARN*. How can
achieve this?

I've set log4.properties file so that it can successfully write to a log
file in /tmp directory on the local file system (shown below partially):

 log4j.appender.file=org.apache.log4j.FileAppender
 log4j.appender.file.File=/tmp/application.log
 log4j.appender.file.append=false
 log4j.appender.file.layout=org.apache.log4j.PatternLayout
 log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss}
%-5p %c{1}:%L - %m%n

When I run my Spark application locally by using the following command:

 spark-submit --class myModule.myClass --master local[2] --deploy-mode
client myApp.jar

It runs fine and I can see that log messages are written to
/tmp/application.log on my local file system.

But when I run the same application via YARN, e.g.

 spark-submit --class myModule.myClass --master yarn-client  --name
"myModule" --total-executor-cores 1 --executor-memory 1g myApp.jar

or

 spark-submit --class myModule.myClass --master yarn-cluster  --name
"myModule" --total-executor-cores 1 --executor-memory 1g myApp.jar

I cannot see any /tmp/application.log on the local file system of the
machine that runs YARN.

What am I missing?

-- 
Emre Sevinç

Re: Can't access remote Hive table from spark

2015-02-11 Thread guxiaobo1982

Hi Zhan,
My Single Node Cluster of Hadoop is installed by Ambari 1.7.0, I tried to 
create the /user/xiaobogu directory in hdfs, but both failed with user xiaobogu 
and root



[xiaobogu@lix1 current]$ hadoop dfs -mkdir /user/xiaobogu
 
DEPRECATED: Use of this script to execute hdfs command is deprecated.
 
Instead use the hdfs command for it.
 


 
mkdir: Permission denied: user=xiaobogu, access=WRITE, 
inode="/user":hdfs:hdfs:drwxr-xr-x




root@lix1 bin]# hadoop dfs -mkdir /user/xiaobogu

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.




 

mkdir: Permission denied: user=root, access=WRITE, 
inode="/user":hdfs:hdfs:drwxr-xr-x




I notice there is a hdfs account created by ambari, but what's password for it, 
should I user the hdfs account to create the directory?








-- Original --
From:  "Zhan Zhang";;
Send time: Sunday, Feb 8, 2015 4:11 AM
To: ""; 
Cc: "user@spark.apache.org"; "Cheng 
Lian"; 
Subject:  Re: Can't access remote Hive table from spark



 Yes. You need to create xiaobogu under /user and provide right permission to 
xiaobogu. 
 
 Thanks.
 
 
 Zhan Zhang
 
  On Feb 7, 2015, at 8:15 AM, guxiaobo1982  wrote:
 
  Hi Zhan Zhang,
 
 
 With the pre-bulit version 1.2.0 of spark against the yarn cluster installed 
by ambari 1.7.0, I come with the following errors:
  
[xiaobogu@lix1 spark]$ ./bin/spark-submit --class 
org.apache.spark.examples.SparkPi--master yarn-cluster  --num-executors 3 
--driver-memory 512m  --executor-memory 512m   --executor-cores 1  
lib/spark-examples*.jar 10
 

 
 
Spark assembly has been built with Hive, including Datanucleus jars on classpath
 
15/02/08 00:11:53 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
 
15/02/08 00:11:54 INFO client.RMProxy: Connecting to ResourceManager at 
lix1.bh.com/192.168.100.3:8050
 
15/02/08 00:11:56 INFO yarn.Client: Requesting a new application from cluster 
with 1 NodeManagers
 
15/02/08 00:11:57 INFO yarn.Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (4096 MB per container)
 
15/02/08 00:11:57 INFO yarn.Client: Will allocate AM container, with 896 MB 
memory including 384 MB overhead
 
15/02/08 00:11:57 INFO yarn.Client: Setting up container launch context for our 
AM
 
15/02/08 00:11:57 INFO yarn.Client: Preparing resources for our AM container
 
15/02/08 00:11:58 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
 
Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
Permission denied: user=xiaobogu, access=WRITE, 
inode="/user":hdfs:hdfs:drwxr-xr-x
 
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6515)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6497)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6449)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4251)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
 
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
 
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
 
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
 
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
 
at java.security.AccessController.doPrivileged(Native Method)
 
at javax.security.auth.Subject.doAs(Subject.java:415)
 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
 

 
 
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 
at 
sun.reflect.DelegatingConstructorAcces

Hive/Hbase for low latency

2015-02-11 Thread Siddharth Ubale

Hi ,

I am new to Spark . We have recently moved from Apache Storm to Apache Spark to 
build our OLAP tool .
Now ,earlier we were using Hbase & Phoenix.
We need to re-think what to use in case of Spark.
Should we go ahead with Hbase or Hive or Cassandra for query processing with 
Spark Sql.

Please share ur views.

Thanks,
Siddharth Ubale,

A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Todd

After compiling the Spark 1.2.0 codebase in Intellj Idea,  and run the LocalPi 
example，I got the following slf4j related issue. Does anyone know how to fix 
this? Thanks


Error:scalac: bad symbolic reference. A signature in Logging.class refers to 
type Logger
in package org.slf4j which is not available.
It may be completely missing from the current classpath, or the version on

the classpath might be incompatible with the version used when compiling 
Logging.class.

Re: Hive/Hbase for low latency

Hi Siddarth,

It depends on what you are trying to solve. But the connectivity for
cassandra and spark is good .

The answer depends upon what exactly you are trying to solve.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 7:47 PM, Siddharth Ubale <
siddharth.ub...@syncoms.com> wrote:

>  Hi ,
>
>
>
> I am new to Spark . We have recently moved from Apache Storm to Apache
> Spark to build our OLAP tool .
>
> Now ,earlier we were using Hbase & Phoenix.
>
> We need to re-think what to use in case of Spark.
>
> Should we go ahead with Hbase or Hive or Cassandra for query processing
> with Spark Sql.
>
>
>
> Please share ur views.
>
>
>
> Thanks,
>
> Siddharth Ubale,
>
>
>
>
>

Re: Hive/Hbase for low latency

2015-02-11 Thread Ted Yu

Connectivity to hbase is also avaliable. You can take a look at:

examples//src/main/python/hbase_inputformat.py
examples//src/main/python/hbase_outputformat.py
examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala
examples//src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala

Cheers

On Wed, Feb 11, 2015 at 6:29 AM, VISHNU SUBRAMANIAN <
johnfedrickena...@gmail.com> wrote:

> Hi Siddarth,
>
> It depends on what you are trying to solve. But the connectivity for
> cassandra and spark is good .
>
> The answer depends upon what exactly you are trying to solve.
>
> Thanks,
> Vishnu
>
> On Wed, Feb 11, 2015 at 7:47 PM, Siddharth Ubale <
> siddharth.ub...@syncoms.com> wrote:
>
>>  Hi ,
>>
>>
>>
>> I am new to Spark . We have recently moved from Apache Storm to Apache
>> Spark to build our OLAP tool .
>>
>> Now ,earlier we were using Hbase & Phoenix.
>>
>> We need to re-think what to use in case of Spark.
>>
>> Should we go ahead with Hbase or Hive or Cassandra for query processing
>> with Spark Sql.
>>
>>
>>
>> Please share ur views.
>>
>>
>>
>> Thanks,
>>
>> Siddharth Ubale,
>>
>>
>>
>>
>>
>
>

Spark ML pipeline

2015-02-11 Thread Jianguo Li

Hi,

I really like the pipeline in the spark.ml in Spark1.2 release. Will there
be more machine learning algorithms implemented for the pipeline framework
in the next major release? Any idea when the next major release comes out?

Thanks,

Jianguo

Re: A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Ted Yu

Spark depends on slf4j 1.7.5

Please check your classpath and make sure slf4j is included.

Cheers

On Wed, Feb 11, 2015 at 6:20 AM, Todd  wrote:

> After compiling the Spark 1.2.0 codebase in Intellj Idea,  and run the
> LocalPi example，I got the following slf4j related issue. Does anyone know
> how to fix this? Thanks
>
>
> Error:scalac: bad symbolic reference. A signature in Logging.class refers
> to type Logger
> in package org.slf4j which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling
> Logging.class.
>
>
>
>

Re:Re: A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Todd

Thanks for the reply.
I have the following Maven dependencies which looks correct to me?

Maven: org.slf4j:slf4j-log4j12:1.7.5
Maven: org.slf4j:jcl-over-slf4j:1.7.5
Maven: org.slf4j:jul-to-slf4j:1.7.5
Maven: org.slf4j:slf4j-api:1.7.5
Maven: log4j:log4j:1.2.17







At 2015-02-11 23:27:54, "Ted Yu"  wrote:

Spark depends on slf4j 1.7.5


Please check your classpath and make sure slf4j is included.


Cheers


On Wed, Feb 11, 2015 at 6:20 AM, Todd  wrote:

After compiling the Spark 1.2.0 codebase in Intellj Idea,  and run the LocalPi 
example，I got the following slf4j related issue. Does anyone know how to fix 
this? Thanks


Error:scalac: bad symbolic reference. A signature in Logging.class refers to 
type Logger
in package org.slf4j which is not available.
It may be completely missing from the current classpath, or the version on

the classpath might be incompatible with the version used when compiling 
Logging.class.

Re: Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-11 Thread Peng Cheng

You are right. I've checked the overall stage metrics and looks like the
largest shuffling write is over 9G. The partition completed successfully
but its spilled file can't be removed until all others are finished.
It's very likely caused by a stupid mistake in my design. A lookup table
grows constantly in a loop, every time its union with a new increment will
results in both of them being reshuffled, and partitioner reverted to None.
This can never be efficient with existing API.

Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Yin Huai

Regarding backticks: Right. You need backticks to quote the column name
timestamp because timestamp is a reserved keyword in our parser.

On Tue, Feb 10, 2015 at 3:02 PM, Mohnish Kodnani 
wrote:

> actually i tried in spark shell , got same error and then for some reason
> i tried to back tick the "timestamp" and it worked.
>  val result = sqlContext.sql("select toSeconds(`timestamp`) as t,
> count(rid) as qps from blah group by toSeconds(`timestamp`),qi.clientName")
>
> so, it seems sql context is supporting UDF.
>
>
>
> On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust 
> wrote:
>
>> The simple SQL parser doesn't yet support UDFs.  Try using a HiveContext.
>>
>> On Tue, Feb 10, 2015 at 1:44 PM, Mohnish Kodnani <
>> mohnish.kodn...@gmail.com> wrote:
>>
>>> Hi,
>>> I am trying a very simple registerFunction and it is giving me errors.
>>>
>>> I have a parquet file which I register as temp table.
>>> Then I define a UDF.
>>>
>>> def toSeconds(timestamp: Long): Long = timestamp/10
>>>
>>> sqlContext.registerFunction("toSeconds", toSeconds _)
>>>
>>> val result = sqlContext.sql("select toSeconds(timestamp) from blah");
>>> I get the following error.
>>> java.lang.RuntimeException: [1.18] failure: ``)'' expected but
>>> `timestamp' found
>>>
>>> select toSeconds(timestamp) from blah
>>>
>>> My end goal is as follows:
>>> We have log file with timestamps in microseconds and I would like to
>>> group by entries with second level precision, so eventually I want to run
>>> the query
>>> select toSeconds(timestamp) as t, count(x) from table group by t,x
>>>
>>>
>>>
>>>
>>
>

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-11 Thread Andrew Or

Hi Jianshi,

For YARN, there may be an issue with how a recently patch changes the
accessibility of the shuffle files by the external shuffle service:
https://issues.apache.org/jira/browse/SPARK-5655. It is likely that you
will hit this with 1.2.1, actually. For this reason I would have to
recommend that you use 1.2.2 when it is released, but for now you should
use 1.2.0 for this specific use case.

-Andrew

2015-02-10 23:38 GMT-08:00 Reynold Xin :

> I think we made the binary protocol compatible across all versions, so you
> should be fine with using any one of them. 1.2.1 is probably the best since
> it is the most recent stable release.
>
> On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang 
> wrote:
>
>> Hi,
>>
>> I need to use branch-1.2 and sometimes master builds of Spark for my
>> project. However the officially supported Spark version by our Hadoop admin
>> is only 1.2.0.
>>
>> So, my question is which version/build of spark-yarn-shuffle.jar should I
>> use that works for all four versions? (1.2.0, 1.2.1, 1.2,2, 1.3.0)
>>
>> Thanks,
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>

Re: How can I read this avro file using spark & scala?

2015-02-11 Thread captainfranz

I am confused as to whether avro support was merged into Spark 1.2 or it is
still an independent library.
I see some people writing sqlContext.avroFile similarly to jsonFile but this
does not work for me, nor do I see this in the Scala docs.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re:Re: How can I read this avro file using spark & scala?

2015-02-11 Thread Todd

Databricks provides a sample code on its website...but i can't find it for now.








At 2015-02-12 00:43:07, "captainfranz"  wrote:
>I am confused as to whether avro support was merged into Spark 1.2 or it is
>still an independent library.
>I see some people writing sqlContext.avroFile similarly to jsonFile but this
>does not work for me, nor do I see this in the Scala docs.
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Re: How can I read this avro file using spark & scala?

Check this link.
https://github.com/databricks/spark-avro

Home page for Spark-avro project.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 10:19 PM, Todd  wrote:

> Databricks provides a sample code on its website...but i can't find it for
> now.
>
>
>
>
>
>
> At 2015-02-12 00:43:07, "captainfranz"  wrote:
> >I am confused as to whether avro support was merged into Spark 1.2 or it is
> >still an independent library.
> >I see some people writing sqlContext.avroFile similarly to jsonFile but this
> >does not work for me, nor do I see this in the Scala docs.
> >
> >
> >
> >--
> >View this message in context: 
> >http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html
> >Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >-
> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>

SPARK_LOCAL_DIRS Issue

2015-02-11 Thread TJ Klein

Hi,

Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a different
path then local directory.

On our cluster we have a folder for temporary files (in a central file
system), which is called /scratch.

When setting SPARK_LOCAL_DIRS=/scratch/

I get:

 An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, XXX): java.io.IOException: Function not implemented
at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:91)
at sun.nio.ch.FileChannelImpl.lock(FileChannelImpl.java:1022)
at java.nio.channels.FileChannel.lock(FileChannel.java:1052)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:379)

Using SPARK_LOCAL_DIRS=/tmp, however, works perfectly. Any idea?

Best,
 Tassilo





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-LOCAL-DIRS-Issue-tp21602.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Question related to Spark SQL

I dint mean that. When you try the above approach only one client will have
access to the cached data.

But when you expose your data through a thrift server the case is quite
different.

In the case of thrift server all the request goes to the thrift server and
spark will be able to take the advantage of caching.

That is Thrift server be your sole client to the spark cluster.

check this link
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server

Your applications can connect to your spark cluster through jdbc driver.It
works similar to your hive thrift server.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 10:31 PM, Ashish Mukherjee <
ashish.mukher...@gmail.com> wrote:

> Thanks for your reply, Vishnu.
>
> I assume you are suggesting I build Hive tables and cache them in memory
> and query on top of that for fast, real-time querying.
>
> Perhaps, I should write a generic piece of code like this and submit this
> as a Spark job with the SQL clause as an argument based on user selections
> on the Web interface -
>
> String sqlClause = args[0];
> ...
> JavaHiveContext sqlContext = new 
> org.apache.spark.sql.hive.api.java.HiveContext(sc);// Queries are expressed 
> in HiveQL.Row[] results = sqlContext.sql(sqlClause).collect();
>
>
> Is my understanding right?
>
> Regards,
> Ashish
>
> On Wed, Feb 11, 2015 at 4:42 PM, VISHNU SUBRAMANIAN <
> johnfedrickena...@gmail.com> wrote:
>
>> Hi Ashish,
>>
>> In order to answer your question , I assume that you are planning to
>> process data and cache them in the memory.If you are using a thrift server
>> that comes with Spark then you can query on top of it. And multiple
>> applications can use the cached data as internally all the requests go to
>> thrift server.
>>
>> Spark exposes hive query language and allows you access its data through
>> spark .So you can consider using HiveQL for querying .
>>
>> Thanks,
>> Vishnu
>>
>> On Wed, Feb 11, 2015 at 4:12 PM, Ashish Mukherjee <
>> ashish.mukher...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am planning to use Spark for a Web-based adhoc reporting tool on
>>> massive date-sets on S3. Real-time queries with filters, aggregations and
>>> joins could be constructed from UI selections.
>>>
>>> Online documentation seems to suggest that SharkQL is deprecated and
>>> users should move away from it.  I understand Hive is generally not used
>>> for real-time querying and for Spark SQL to work with other data stores,
>>> tables need to be registered explicitly in code. Also, the This would not
>>> be suitable for a dynamic query construction scenario.
>>>
>>> For a real-time , dynamic querying scenario like mine what is the proper
>>> tool to be used with Spark SQL?
>>>
>>> Regards,
>>> Ashish
>>>
>>
>>
>

Re: SPARK_LOCAL_DIRS Issue

A central location, such as NFS?

If they are temporary for the purpose of further job processing you'll want
to keep them local to the node in the cluster, i.e., in /tmp. If they are
centralized you won't be able to take advantage of data locality and the
central file store will become a bottleneck for further processing.

If /tmp isn't an option because you want to be able to monitor the file
outputs as they occur you can also use HDFS (assuming your Spark nodes are
also HDFS members they will benefit from data locality).

It looks like the problem you are seeing is that a lock cannot be acquired
on the output file in the central file system.

On Wed Feb 11 2015 at 11:55:55 AM TJ Klein  wrote:

> Hi,
>
> Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a different
> path then local directory.
>
> On our cluster we have a folder for temporary files (in a central file
> system), which is called /scratch.
>
> When setting SPARK_LOCAL_DIRS=/scratch/
>
> I get:
>
>  An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 0.0
> (TID 3, XXX): java.io.IOException: Function not implemented
> at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
> at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:91)
> at sun.nio.ch.FileChannelImpl.lock(FileChannelImpl.java:1022)
> at java.nio.channels.FileChannel.lock(FileChannel.java:1052)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:379)
>
> Using SPARK_LOCAL_DIRS=/tmp, however, works perfectly. Any idea?
>
> Best,
>  Tassilo
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/SPARK-LOCAL-DIRS-Issue-tp21602.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Hive/Hbase for low latency

2015-02-11 Thread Ravi Kiran

Hi Siddharth,

 With v 4.3 of Phoenix, you can use the PhoenixInputFormat and
OutputFormat classes to pull/push to Phoenix from Spark.

HTH

Thanks
Ravi


On Wed, Feb 11, 2015 at 6:59 AM, Ted Yu  wrote:

> Connectivity to hbase is also avaliable. You can take a look at:
>
> examples//src/main/python/hbase_inputformat.py
> examples//src/main/python/hbase_outputformat.py
> examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala
>
> examples//src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
>
> Cheers
>
> On Wed, Feb 11, 2015 at 6:29 AM, VISHNU SUBRAMANIAN <
> johnfedrickena...@gmail.com> wrote:
>
>> Hi Siddarth,
>>
>> It depends on what you are trying to solve. But the connectivity for
>> cassandra and spark is good .
>>
>> The answer depends upon what exactly you are trying to solve.
>>
>> Thanks,
>> Vishnu
>>
>> On Wed, Feb 11, 2015 at 7:47 PM, Siddharth Ubale <
>> siddharth.ub...@syncoms.com> wrote:
>>
>>>  Hi ,
>>>
>>>
>>>
>>> I am new to Spark . We have recently moved from Apache Storm to Apache
>>> Spark to build our OLAP tool .
>>>
>>> Now ,earlier we were using Hbase & Phoenix.
>>>
>>> We need to re-think what to use in case of Spark.
>>>
>>> Should we go ahead with Hbase or Hive or Cassandra for query processing
>>> with Spark Sql.
>>>
>>>
>>>
>>> Please share ur views.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Siddharth Ubale,
>>>
>>>
>>>
>>>
>>>
>>
>>
>

getting the cluster elements from kmeans run

2015-02-11 Thread Harini Srinivasan

Hi, 

Is there a way to get the elements of each cluster after running kmeans 
clustering? I am using the Java version.



thanks

Re: getting the cluster elements from kmeans run

You can use model.predict(point) that will help you identify the cluster
center and map it to the point.

rdd.map(x => (x,model.predict(x)))

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 11:06 PM, Harini Srinivasan 
wrote:

> Hi,
>
> Is there a way to get the elements of each cluster after running kmeans
> clustering? I am using the Java version.
>
>
> 
> thanks
>
>

Re: getting the cluster elements from kmeans run

2015-02-11 Thread Suneel Marthi

KMeansModel only returns the "cluster centroids".
To get the # of elements in each cluster, try calling kmeans.predict() on each 
of the points in the data used to build the model.
See 
https://github.com/OryxProject/oryx/blob/master/oryx-app-mllib/src/main/java/com/cloudera/oryx/app/mllib/kmeans/KMeansUpdate.java

Look at method fetchClusterCountsFromModel()

   
 

 From: Harini Srinivasan 
 To: user@spark.apache.org 
 Sent: Wednesday, February 11, 2015 12:36 PM
 Subject: getting the cluster elements from kmeans run
   
Hi, 

Is there a way to get the elements ofeach cluster after running kmeans 
clustering? I am using the Java version.



thanks

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez

cat ../hadoop/spark-install/conf/spark-env.sh 
export SCALA_HOME=/home/hadoop/scala-install 
export SPARK_WORKER_MEMORY=83971m 
export SPARK_MASTER_IP=spark-m 
export SPARK_DAEMON_MEMORY=15744m 
export SPARK_WORKER_DIR=/hadoop/spark/work 
export SPARK_LOCAL_DIRS=/hadoop/spark/tmp 
export SPARK_LOG_DIR=/hadoop/spark/logs 
export
SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.2-hadoop1.jar
 
export MASTER=spark://spark-m:7077 

poiuytrez@spark-m:~$ cat ../hadoop/spark-install/conf/spark-defaults.conf 
spark.master spark://spark-m:7077 
spark.eventLog.enabled true 
spark.eventLog.dir gs://-spark/spark-eventlog-base/spark-m 
spark.executor.memory 83971m 
spark.yarn.executor.memoryOverhead 83971m 


I am using spark-submit.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21605.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez

cat ../hadoop/spark-install/conf/spark-env.sh
export SCALA_HOME=/home/hadoop/scala-install
export SPARK_WORKER_MEMORY=83971m
export SPARK_MASTER_IP=spark-m
export SPARK_DAEMON_MEMORY=15744m
export SPARK_WORKER_DIR=/hadoop/spark/work
export SPARK_LOCAL_DIRS=/hadoop/spark/tmp
export SPARK_LOG_DIR=/hadoop/spark/logs
export
SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.2-hadoop1.jar
export MASTER=spark://spark-m:7077

poiuytrez@spark-m:~$ cat ../hadoop/spark-install/conf/spark-defaults.conf
spark.master spark://spark-m:7077
spark.eventLog.enabled true
spark.eventLog.dir gs://databerries-spark/spark-eventlog-base/spark-m
spark.executor.memory 83971m
spark.yarn.executor.memoryOverhead 83971m


I am using spark-submit.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Can't access remote Hive table from spark

2015-02-11 Thread Zhan Zhang

You need to have right hdfs account, e.g., hdfs,  to create directory and 
assign permission.

Thanks.

Zhan Zhang
On Feb 11, 2015, at 4:34 AM, guxiaobo1982 
mailto:guxiaobo1...@qq.com>> wrote:

Hi Zhan,
My Single Node Cluster of Hadoop is installed by Ambari 1.7.0, I tried to 
create the /user/xiaobogu directory in hdfs, but both failed with user xiaobogu 
and root

[xiaobogu@lix1 current]$ hadoop dfs -mkdir /user/xiaobogu
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

mkdir: Permission denied: user=xiaobogu, access=WRITE, 
inode="/user":hdfs:hdfs:drwxr-xr-x

root@lix1 bin]# hadoop dfs -mkdir /user/xiaobogu
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.


mkdir: Permission denied: user=root, access=WRITE, 
inode="/user":hdfs:hdfs:drwxr-xr-x

I notice there is a hdfs account created by ambari, but what's password for it, 
should I user the hdfs account to create the directory?



-- Original --
From:  "Zhan Zhang";mailto:zzh...@hortonworks.com>>;
Send time: Sunday, Feb 8, 2015 4:11 AM
To: ""mailto:guxiaobo1...@qq.com>>;
Cc: 
"user@spark.apache.org"mailto:user@spark.apache.org>>;
 "Cheng Lian"mailto:lian.cs@gmail.com>>;
Subject:  Re: Can't access remote Hive table from spark

Yes. You need to create xiaobogu under /user and provide right permission to 
xiaobogu.

Thanks.

Zhan Zhang

On Feb 7, 2015, at 8:15 AM, guxiaobo1982 
mailto:guxiaobo1...@qq.com>> wrote:

Hi Zhan Zhang,

With the pre-bulit version 1.2.0 of spark against the yarn cluster installed by 
ambari 1.7.0, I come with the following errors:

[xiaobogu@lix1 spark]$ ./bin/spark-submit --class 
org.apache.spark.examples.SparkPi--master yarn-cluster  --num-executors 3 
--driver-memory 512m  --executor-memory 512m   --executor-cores 1  
lib/spark-examples*.jar 10


Spark assembly has been built with Hive, including Datanucleus jars on classpath

15/02/08 00:11:53 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable

15/02/08 00:11:54 INFO client.RMProxy: Connecting to ResourceManager at 
lix1.bh.com/192.168.100.3:8050

15/02/08 00:11:56 INFO yarn.Client: Requesting a new application from cluster 
with 1 NodeManagers

15/02/08 00:11:57 INFO yarn.Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (4096 MB per container)

15/02/08 00:11:57 INFO yarn.Client: Will allocate AM container, with 896 MB 
memory including 384 MB overhead

15/02/08 00:11:57 INFO yarn.Client: Setting up container launch context for our 
AM

15/02/08 00:11:57 INFO yarn.Client: Preparing resources for our AM container

15/02/08 00:11:58 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.

Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
Permission denied: user=xiaobogu, access=WRITE, 
inode="/user":hdfs:hdfs:drwxr-xr-x

at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)

at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)

at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)

at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)

at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6515)

at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6497)

at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6449)

at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4251)

at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)

at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)

at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)

at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)

at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at 
org.apache.hadoop.security.UserGroupInformatio

Re: Can spark job server be used to visualize streaming data?

2015-02-11 Thread Su She

Thank you Felix and Kelvin. I think I'll def be using the k-means tools in
mlib.

It seems the best way to stream data is by storing in hbase and then using
an api in my viz to extract data? Does anyone have any thoughts on this?

Thanks!


On Tue, Feb 10, 2015 at 11:45 PM, Felix C  wrote:

>  Checkout
>
> https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
>
> In there are links to how that is done.
>
>
> --- Original Message ---
>
> From: "Kelvin Chu" <2dot7kel...@gmail.com>
> Sent: February 10, 2015 12:48 PM
> To: "Su She" 
> Cc: user@spark.apache.org
> Subject: Re: Can spark job server be used to visualize streaming data?
>
>  Hi Su,
>
>  Out of the box, no. But, I know people integrate it with Spark Streaming
> to do real-time visualization. It will take some work though.
>
>  Kelvin
>
> On Mon, Feb 9, 2015 at 5:04 PM, Su She  wrote:
>
>  Hello Everyone,
>
>  I was reading this blog post:
> http://homes.esat.kuleuven.be/~bioiuser/blog/a-d3-visualisation-from-spark-as-a-service/
>
>  and was wondering if this approach can be taken to visualize streaming
> data...not just historical data?
>
>  Thank you!
>
>  -Suh
>
>
>

iteratively modifying an RDD

2015-02-11 Thread rok

I was having trouble with memory exceptions when broadcasting a large lookup
table, so I've resorted to processing it iteratively -- but how can I modify
an RDD iteratively? 

I'm trying something like :

rdd = sc.parallelize(...)
lookup_tables = {...}

for lookup_table in lookup_tables : 
rdd = rdd.map(lambda x: func(x, lookup_table))

If I leave it as is, then only the last "lookup_table" is applied instead of
stringing together all the maps. However, if add a .cache() to the .map then
it seems to work fine. 

A second problem is that the runtime for each iteration roughly doubles at
each iteration so this clearly doesn't seem to be the way to do it. What is
the preferred way of doing such repeated modifications to an RDD and how can
the accumulation of overhead be minimized? 

Thanks!

Rok



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Need a partner

2015-02-11 Thread Nagesh sarvepalli

Hello,

Hope below link helps you to kick-start. It has videos and hand-outs for
practice.

http://spark-summit.org/2014

Regards
Nagesh


On Wed, Feb 11, 2015 at 5:56 AM, prabeesh k  wrote:

> Also you can refer this course in edx: Introduction to Big Data with
> Apache Spark
> 
>
> 
>
> On 11 February 2015 at 02:54, Joanne Contact 
> wrote:
>
>> me too. I don't have enough nodes. Maybe we can set up a cluster together?
>>
>> On Tue, Feb 10, 2015 at 9:51 AM, Mohit Singh  wrote:
>>
>>> I would be interested too.
>>>
>>> On Tue, Feb 10, 2015 at 9:41 AM, Kartik Mehta 
>>> wrote:
>>>
 Hi Sami and fellow Spark friends,

 I too am looking for joint learning, online.

 I have set up spark but need to do on multi nodes on my home server. We
 can form a group and do group learning?

 Thanks,

 Kartik
 On Feb 10, 2015 11:52 AM, "King sami"  wrote:

> Hi,
>
> As I'm beginner in Spark, I'm looking for someone who's also beginner
> to learn and train on Spark together.
> Please contact me if interested
>
> Cordially,
>

>>>
>>>
>>> --
>>> Mohit
>>>
>>> "When you want success as badly as you want the air, then you will get
>>> it. There is no other secret of success."
>>> -Socrates
>>>
>>
>>
>

Re: How to log using log4j to local file system inside a Spark application that runs on YARN?

2015-02-11 Thread Marcelo Vanzin

For Yarn, you need to upload your log4j.properties separately from
your app's jar, because of some internal issues that are too boring to
explain here. :-)

Basically:

  spark-submit --master yarn --files log4j.properties blah blah blah

Having to keep it outside your app jar is sub-optimal, and I think
there's a bug filed to fix this, but so far no one has really spent
time looking at it.


On Wed, Feb 11, 2015 at 4:29 AM, Emre Sevinc  wrote:
> Hello,
>
> I'm building an Apache Spark Streaming application and cannot make it log to
> a file on the local filesystem when running it on YARN. How can achieve
> this?
>
> I've set log4.properties file so that it can successfully write to a log
> file in /tmp directory on the local file system (shown below partially):
>
>  log4j.appender.file=org.apache.log4j.FileAppender
>  log4j.appender.file.File=/tmp/application.log
>  log4j.appender.file.append=false
>  log4j.appender.file.layout=org.apache.log4j.PatternLayout
>  log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss} %-5p
> %c{1}:%L - %m%n
>
> When I run my Spark application locally by using the following command:
>
>  spark-submit --class myModule.myClass --master local[2] --deploy-mode
> client myApp.jar
>
> It runs fine and I can see that log messages are written to
> /tmp/application.log on my local file system.
>
> But when I run the same application via YARN, e.g.
>
>  spark-submit --class myModule.myClass --master yarn-client  --name
> "myModule" --total-executor-cores 1 --executor-memory 1g myApp.jar
>
> or
>
>  spark-submit --class myModule.myClass --master yarn-cluster  --name
> "myModule" --total-executor-cores 1 --executor-memory 1g myApp.jar
>
> I cannot see any /tmp/application.log on the local file system of the
> machine that runs YARN.
>
> What am I missing?
>
>
> --
> Emre Sevinç



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: iteratively modifying an RDD

If you use mapPartitions to iterate the lookup_tables does that improve the
performance?

This link is to Spark docs 1.1 because both latest and 1.2 for Python give
me a 404:
http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions

On Wed Feb 11 2015 at 1:48:42 PM rok  wrote:

> I was having trouble with memory exceptions when broadcasting a large
> lookup
> table, so I've resorted to processing it iteratively -- but how can I
> modify
> an RDD iteratively?
>
> I'm trying something like :
>
> rdd = sc.parallelize(...)
> lookup_tables = {...}
>
> for lookup_table in lookup_tables :
> rdd = rdd.map(lambda x: func(x, lookup_table))
>
> If I leave it as is, then only the last "lookup_table" is applied instead
> of
> stringing together all the maps. However, if add a .cache() to the .map
> then
> it seems to work fine.
>
> A second problem is that the runtime for each iteration roughly doubles at
> each iteration so this clearly doesn't seem to be the way to do it. What is
> the preferred way of doing such repeated modifications to an RDD and how
> can
> the accumulation of overhead be minimized?
>
> Thanks!
>
> Rok
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: pyspark: Java null pointer exception when accessing broadcast variables

Could you share a short script to reproduce this problem?

On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar  wrote:
> I didn't notice other errors -- I also thought such a large broadcast is a
> bad idea but I tried something similar with a much smaller dictionary and
> encountered the same problem. I'm not familiar enough with spark internals
> to know whether the trace indicates an issue with the broadcast variables or
> perhaps something different?
>
> The driver and executors have 50gb of ram so memory should be fine.
>
> Thanks,
>
> Rok
>
> On Feb 11, 2015 12:19 AM, "Davies Liu"  wrote:
>>
>> It's brave to broadcast 8G pickled data, it will take more than 15G in
>> memory for each Python worker,
>> how much memory do you have in executor and driver?
>> Do you see any other exceptions in driver and executors? Something
>> related to serialization in JVM.
>>
>> On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar  wrote:
>> > I get this in the driver log:
>>
>> I think this should happen on executor, or you called first() or
>> take() on the RDD?
>>
>> > java.lang.NullPointerException
>> > at
>> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> > at
>> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> > at
>> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> > at
>> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>> >
>> > and on one of the executor's stderr:
>> >
>> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly
>> > (crashed)
>> > org.apache.spark.api.python.PythonException: Traceback (most recent call
>> > last):
>> >   File
>> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py",
>> > line 57, in main
>> > split_index = read_int(infile)
>> >   File
>> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py",
>> > line 511, in read_int
>> > raise EOFError
>> > EOFError
>> >
>> > at
>> > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
>> > at
>> > org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:174)
>> > at
>> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>> > at
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>> > at
>> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> > at
>> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>> > Caused by: java.lang.NullPointerException
>> > at
>> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> > at
>> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> > at
>> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>> > at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>> > ... 4 more
>> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly
>> > (crashed)
>> > org.apache.spark.api.python.PythonException: Traceback (most recent call
>> > last):
>> >   File
>> > "/clus

Re: iteratively modifying an RDD

Yes I actually do use mapPartitions already
On Feb 11, 2015 7:55 PM, "Charles Feduke"  wrote:

> If you use mapPartitions to iterate the lookup_tables does that improve
> the performance?
>
> This link is to Spark docs 1.1 because both latest and 1.2 for Python give
> me a 404:
> http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions
>
> On Wed Feb 11 2015 at 1:48:42 PM rok  wrote:
>
>> I was having trouble with memory exceptions when broadcasting a large
>> lookup
>> table, so I've resorted to processing it iteratively -- but how can I
>> modify
>> an RDD iteratively?
>>
>> I'm trying something like :
>>
>> rdd = sc.parallelize(...)
>> lookup_tables = {...}
>>
>> for lookup_table in lookup_tables :
>> rdd = rdd.map(lambda x: func(x, lookup_table))
>>
>> If I leave it as is, then only the last "lookup_table" is applied instead
>> of
>> stringing together all the maps. However, if add a .cache() to the .map
>> then
>> it seems to work fine.
>>
>> A second problem is that the runtime for each iteration roughly doubles at
>> each iteration so this clearly doesn't seem to be the way to do it. What
>> is
>> the preferred way of doing such repeated modifications to an RDD and how
>> can
>> the accumulation of overhead be minimized?
>>
>> Thanks!
>>
>> Rok
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: iteratively modifying an RDD

We have moved to use Sphinx to generate the Python API docs, so the
link is different than 1.0/1

http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.mapPartitions

On Wed, Feb 11, 2015 at 10:55 AM, Charles Feduke
 wrote:
> If you use mapPartitions to iterate the lookup_tables does that improve the
> performance?
>
> This link is to Spark docs 1.1 because both latest and 1.2 for Python give
> me a 404:
> http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions
>
> On Wed Feb 11 2015 at 1:48:42 PM rok  wrote:
>>
>> I was having trouble with memory exceptions when broadcasting a large
>> lookup
>> table, so I've resorted to processing it iteratively -- but how can I
>> modify
>> an RDD iteratively?
>>
>> I'm trying something like :
>>
>> rdd = sc.parallelize(...)
>> lookup_tables = {...}
>>
>> for lookup_table in lookup_tables :
>> rdd = rdd.map(lambda x: func(x, lookup_table))
>>
>> If I leave it as is, then only the last "lookup_table" is applied instead
>> of
>> stringing together all the maps. However, if add a .cache() to the .map
>> then
>> it seems to work fine.
>>
>> A second problem is that the runtime for each iteration roughly doubles at
>> each iteration so this clearly doesn't seem to be the way to do it. What
>> is
>> the preferred way of doing such repeated modifications to an RDD and how
>> can
>> the accumulation of overhead be minimized?
>>
>> Thanks!
>>
>> Rok
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Open file limit settings for Spark on Yarn job

2015-02-11 Thread Arun Luthra

I'm using Spark 1.1.0 with sort-based shuffle.

I found that I can work around the issue by applying repartition(N) with a
small enough N after creating the RDD, though I'm losing some
speed/parallelism by doing this. For my algorithm I need to stay with
groupByKey.

On Tue, Feb 10, 2015 at 11:41 PM, Felix C  wrote:

>  Alternatively, is there another way to do it?
> groupByKey has been called out as expensive and should be avoid (it causes
> shuffling of data).
>
> I've generally found it possible to use reduceByKey instead
>
> --- Original Message ---
>
> From: "Arun Luthra" 
> Sent: February 10, 2015 1:16 PM
> To: user@spark.apache.org
> Subject: Open file limit settings for Spark on Yarn job
>
>  Hi,
>
>  I'm running Spark on Yarn from an edge node, and the tasks on the run
> Data Nodes. My job fails with the "Too many open files" error once it gets
> to groupByKey(). Alternatively I can make it fail immediately if I
> repartition the data when I create the RDD.
>
>  Where do I need to make sure that ulimit -n is high enough?
>
>  On the edge node it is small, 1024, but on the data nodes, the "yarn"
> user has a high limit, 32k. But is the yarn user the relevant user? And, is
> the 1024 limit for myself on the edge node a problem or is that limit not
> relevant?
>
>  Arun
>

Re: pyspark: Java null pointer exception when accessing broadcast variables

I think the problem was related to the broadcasts being too large -- I've
now split it up into many smaller operations but it's still not quite there
-- see
http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html

Thanks,

Rok

On Wed, Feb 11, 2015, 19:59 Davies Liu  wrote:

> Could you share a short script to reproduce this problem?
>
> On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar  wrote:
> > I didn't notice other errors -- I also thought such a large broadcast is
> a
> > bad idea but I tried something similar with a much smaller dictionary and
> > encountered the same problem. I'm not familiar enough with spark
> internals
> > to know whether the trace indicates an issue with the broadcast
> variables or
> > perhaps something different?
> >
> > The driver and executors have 50gb of ram so memory should be fine.
> >
> > Thanks,
> >
> > Rok
> >
> > On Feb 11, 2015 12:19 AM, "Davies Liu"  wrote:
> >>
> >> It's brave to broadcast 8G pickled data, it will take more than 15G in
> >> memory for each Python worker,
> >> how much memory do you have in executor and driver?
> >> Do you see any other exceptions in driver and executors? Something
> >> related to serialization in JVM.
> >>
> >> On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar 
> wrote:
> >> > I get this in the driver log:
> >>
> >> I think this should happen on executor, or you called first() or
> >> take() on the RDD?
> >>
> >> > java.lang.NullPointerException
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
> >> > at scala.collection.Iterator$class.foreach(Iterator.scala:
> 727)
> >> > at
> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> >> > at
> >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> >> > at scala.collection.AbstractIterable.foreach(
> Iterable.scala:54)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> > at
> >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(
> PythonRDD.scala:203)
> >> >
> >> > and on one of the executor's stderr:
> >> >
> >> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly
> >> > (crashed)
> >> > org.apache.spark.api.python.PythonException: Traceback (most recent
> call
> >> > last):
> >> >   File
> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/
> pyspark/worker.py",
> >> > line 57, in main
> >> > split_index = read_int(infile)
> >> >   File
> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/
> pyspark/serializers.py",
> >> > line 511, in read_int
> >> > raise EOFError
> >> > EOFError
> >> >
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$$anon$1.read(
> PythonRDD.scala:137)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$$anon$1.(
> PythonRDD.scala:174)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
> >> > at
> >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> >> > at
> >> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> > at
> >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(
> PythonRDD.scala:203)
> >> > Caused by: java.lang.NullPointerException
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
> >> > at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
> >> > at scala.collection.Iterator$class.foreach(Iterator.scala:
> 727)
> >> > at
>

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Tassilo Klein

Thanks for the info. The file system in use is a Lustre file system.

Best,
 Tassilo

On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke 
wrote:

> A central location, such as NFS?
>
> If they are temporary for the purpose of further job processing you'll
> want to keep them local to the node in the cluster, i.e., in /tmp. If they
> are centralized you won't be able to take advantage of data locality and
> the central file store will become a bottleneck for further processing.
>
> If /tmp isn't an option because you want to be able to monitor the file
> outputs as they occur you can also use HDFS (assuming your Spark nodes are
> also HDFS members they will benefit from data locality).
>
> It looks like the problem you are seeing is that a lock cannot be acquired
> on the output file in the central file system.
>
> On Wed Feb 11 2015 at 11:55:55 AM TJ Klein  wrote:
>
>> Hi,
>>
>> Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a different
>> path then local directory.
>>
>> On our cluster we have a folder for temporary files (in a central file
>> system), which is called /scratch.
>>
>> When setting SPARK_LOCAL_DIRS=/scratch/
>>
>> I get:
>>
>>  An error occurred while calling
>> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 0
>> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 0.0
>> (TID 3, XXX): java.io.IOException: Function not implemented
>> at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
>> at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:91)
>> at sun.nio.ch.FileChannelImpl.lock(FileChannelImpl.java:1022)
>> at java.nio.channels.FileChannel.lock(FileChannel.java:1052)
>> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:379)
>>
>> Using SPARK_LOCAL_DIRS=/tmp, however, works perfectly. Any idea?
>>
>> Best,
>>  Tassilo
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/SPARK-LOCAL-DIRS-Issue-tp21602.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Mohnish Kodnani

that explains a lot...
Is there a list of reserved keywords ?


On Wed, Feb 11, 2015 at 7:56 AM, Yin Huai  wrote:

> Regarding backticks: Right. You need backticks to quote the column name
> timestamp because timestamp is a reserved keyword in our parser.
>
> On Tue, Feb 10, 2015 at 3:02 PM, Mohnish Kodnani <
> mohnish.kodn...@gmail.com> wrote:
>
>> actually i tried in spark shell , got same error and then for some reason
>> i tried to back tick the "timestamp" and it worked.
>>  val result = sqlContext.sql("select toSeconds(`timestamp`) as t,
>> count(rid) as qps from blah group by toSeconds(`timestamp`),qi.clientName")
>>
>> so, it seems sql context is supporting UDF.
>>
>>
>>
>> On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust > > wrote:
>>
>>> The simple SQL parser doesn't yet support UDFs.  Try using a HiveContext.
>>>
>>> On Tue, Feb 10, 2015 at 1:44 PM, Mohnish Kodnani <
>>> mohnish.kodn...@gmail.com> wrote:
>>>
 Hi,
 I am trying a very simple registerFunction and it is giving me errors.

 I have a parquet file which I register as temp table.
 Then I define a UDF.

 def toSeconds(timestamp: Long): Long = timestamp/10

 sqlContext.registerFunction("toSeconds", toSeconds _)

 val result = sqlContext.sql("select toSeconds(timestamp) from blah");
 I get the following error.
 java.lang.RuntimeException: [1.18] failure: ``)'' expected but
 `timestamp' found

 select toSeconds(timestamp) from blah

 My end goal is as follows:
 We have log file with timestamps in microseconds and I would like to
 group by entries with second level precision, so eventually I want to run
 the query
 select toSeconds(timestamp) as t, count(x) from table group by t,x




>>>
>>
>

Re: iteratively modifying an RDD

On Wed, Feb 11, 2015 at 10:47 AM, rok  wrote:
> I was having trouble with memory exceptions when broadcasting a large lookup
> table, so I've resorted to processing it iteratively -- but how can I modify
> an RDD iteratively?
>
> I'm trying something like :
>
> rdd = sc.parallelize(...)
> lookup_tables = {...}
>
> for lookup_table in lookup_tables :
> rdd = rdd.map(lambda x: func(x, lookup_table))
>
> If I leave it as is, then only the last "lookup_table" is applied instead of
> stringing together all the maps. However, if add a .cache() to the .map then
> it seems to work fine.

This is the something related to Python closure implementation, you should
do it like this:

def create_func(lookup_table):
 return lambda x: func(x, lookup_table)

for lookup_table in lookup_tables:
rdd = rdd.map(create_func(lookup_table))

The Python closure just remember the variable, not copy the value of it.
In the loop, `lookup_table` is the same variable. When we serialize the final
rdd, all the closures are referring to the same `lookup_table`, which points
to the last value.

When we create the closure in a function, Python create a variable for
each closure, so it works.

> A second problem is that the runtime for each iteration roughly doubles at
> each iteration so this clearly doesn't seem to be the way to do it. What is
> the preferred way of doing such repeated modifications to an RDD and how can
> the accumulation of overhead be minimized?
>
> Thanks!
>
> Rok
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

exception with json4s render

2015-02-11 Thread Jonathan Haddad

I'm trying to use the json4s library in a spark job to push data back into
kafka.  Everything was working fine when I was hard coding a string, but
now that I'm trying to render a string from a simple map it's failing.  The
code works in sbt console.

working console code:
https://gist.github.com/rustyrazorblade/daa50bf05ff0d48ac6af

failing spark job line:
https://github.com/rustyrazorblade/killranalytics/blob/master/spark/src/main/scala/RawEventProcessing.scala#L114

exception: https://gist.github.com/rustyrazorblade/1e220d87d41cfcad2bb9

I've seen examples of using render / compact when I searched the ML
archives, so I'm kind of at a loss here.

Thanks in advance for any help.

Jon

Re: what is behind matrix multiplications?

2015-02-11 Thread Reza Zadeh

Yes, the local matrix is broadcast to each worker. Here is the code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L407

In 1.3 we will have Block matrix multiplication too, which will allow
distributed matrix multiplication. This is thanks to Burak Yavuz from
Stanford and now at Databricks.

On Wed, Feb 11, 2015 at 4:12 AM, Donbeo  wrote:

> In Spark it is possible to multiply a distribuited matrix  x and a local
> matrix w
>
> val x = new RowMatrix(distribuited_data)
> val w: Matrix = Matrices.dense(local_data)
> val result = x.multiply(w) .
>
> What is the process behind this command?  Is the matrix w replicated on
> each
> worker?  Is there a reference that I can use for this?
>
> Thanks  a lot!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/what-is-behind-matrix-multiplications-tp21599.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: exception with json4s render

2015-02-11 Thread Mohnish Kodnani

I was getting similar error after I upgraded to spark 1.2.1 from 1.1.1
Are you by any chance using json4s 3.2.11.
I downgraded to 3.2.10 and that seemed to have worked. But I didnt try to
spend much time debugging the issue than that.



On Wed, Feb 11, 2015 at 11:13 AM, Jonathan Haddad  wrote:

> I'm trying to use the json4s library in a spark job to push data back into
> kafka.  Everything was working fine when I was hard coding a string, but
> now that I'm trying to render a string from a simple map it's failing.  The
> code works in sbt console.
>
> working console code:
> https://gist.github.com/rustyrazorblade/daa50bf05ff0d48ac6af
>
> failing spark job line:
> https://github.com/rustyrazorblade/killranalytics/blob/master/spark/src/main/scala/RawEventProcessing.scala#L114
>
> exception: https://gist.github.com/rustyrazorblade/1e220d87d41cfcad2bb9
>
> I've seen examples of using render / compact when I searched the ML
> archives, so I'm kind of at a loss here.
>
> Thanks in advance for any help.
>
> Jon
>

Re: SPARK_LOCAL_DIRS Issue

Take a look at this:

http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre

Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf
(linked from that article)

to get a better idea of what your options are.

If its possible to avoid writing to [any] disk I'd recommend that route,
since that's the performance advantage Spark has over vanilla Hadoop.

On Wed Feb 11 2015 at 2:10:36 PM Tassilo Klein  wrote:

> Thanks for the info. The file system in use is a Lustre file system.
>
> Best,
>  Tassilo
>
> On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke  > wrote:
>
>> A central location, such as NFS?
>>
>> If they are temporary for the purpose of further job processing you'll
>> want to keep them local to the node in the cluster, i.e., in /tmp. If they
>> are centralized you won't be able to take advantage of data locality and
>> the central file store will become a bottleneck for further processing.
>>
>> If /tmp isn't an option because you want to be able to monitor the file
>> outputs as they occur you can also use HDFS (assuming your Spark nodes are
>> also HDFS members they will benefit from data locality).
>>
>> It looks like the problem you are seeing is that a lock cannot be
>> acquired on the output file in the central file system.
>>
>> On Wed Feb 11 2015 at 11:55:55 AM TJ Klein  wrote:
>>
>>> Hi,
>>>
>>> Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a different
>>> path then local directory.
>>>
>>> On our cluster we have a folder for temporary files (in a central file
>>> system), which is called /scratch.
>>>
>>> When setting SPARK_LOCAL_DIRS=/scratch/
>>>
>>> I get:
>>>
>>>  An error occurred while calling
>>> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task 0
>>> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>> 0.0
>>> (TID 3, XXX): java.io.IOException: Function not implemented
>>> at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
>>> at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:
>>> 91)
>>> at sun.nio.ch.FileChannelImpl.lock(FileChannelImpl.java:1022)
>>> at java.nio.channels.FileChannel.lock(FileChannel.java:1052)
>>> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:379)
>>>
>>> Using SPARK_LOCAL_DIRS=/tmp, however, works perfectly. Any idea?
>>>
>>> Best,
>>>  Tassilo
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/SPARK-LOCAL-DIRS-Issue-tp21602.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Tassilo Klein

Thanks a lot. I will have a look at it.

On Wed, Feb 11, 2015 at 2:20 PM, Charles Feduke 
wrote:

> Take a look at this:
>
> http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre
>
> Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf
> (linked from that article)
>
> to get a better idea of what your options are.
>
> If its possible to avoid writing to [any] disk I'd recommend that route,
> since that's the performance advantage Spark has over vanilla Hadoop.
>
> On Wed Feb 11 2015 at 2:10:36 PM Tassilo Klein  wrote:
>
>> Thanks for the info. The file system in use is a Lustre file system.
>>
>> Best,
>>  Tassilo
>>
>> On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke <
>> charles.fed...@gmail.com> wrote:
>>
>>> A central location, such as NFS?
>>>
>>> If they are temporary for the purpose of further job processing you'll
>>> want to keep them local to the node in the cluster, i.e., in /tmp. If they
>>> are centralized you won't be able to take advantage of data locality and
>>> the central file store will become a bottleneck for further processing.
>>>
>>> If /tmp isn't an option because you want to be able to monitor the file
>>> outputs as they occur you can also use HDFS (assuming your Spark nodes are
>>> also HDFS members they will benefit from data locality).
>>>
>>> It looks like the problem you are seeing is that a lock cannot be
>>> acquired on the output file in the central file system.
>>>
>>> On Wed Feb 11 2015 at 11:55:55 AM TJ Klein  wrote:
>>>
 Hi,

 Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a
 different
 path then local directory.

 On our cluster we have a folder for temporary files (in a central file
 system), which is called /scratch.

 When setting SPARK_LOCAL_DIRS=/scratch/

 I get:

  An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Task 0
 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
 stage 0.0
 (TID 3, XXX): java.io.IOException: Function not implemented
 at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
 at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:
 91)
 at sun.nio.ch.FileChannelImpl.lock(FileChannelImpl.java:1022)
 at java.nio.channels.FileChannel.lock(FileChannel.java:1052)
 at org.apache.spark.util.Utils$.fetchFile(Utils.scala:379)

 Using SPARK_LOCAL_DIRS=/tmp, however, works perfectly. Any idea?

 Best,
  Tassilo

 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/SPARK-LOCAL-DIRS-Issue-tp21602.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>

Re: SPARK_LOCAL_DIRS Issue

And just glancing at the Spark source code around where the stack trace
originates:

val lockFile = new File(localDir, lockFileName)
  val raf = new RandomAccessFile(lockFile, "rw")
  // Only one executor entry.
  // The FileLock is only used to control synchronization for executors
download file,
  // it's always safe regardless of lock type (mandatory or advisory).
  val lock = raf.getChannel().lock()
  val cachedFile = new File(localDir, cachedFileName)
  try {
if (!cachedFile.exists()) {
  doFetchFile(url, localDir, cachedFileName, conf, securityMgr,
hadoopConf)
}
  } finally {
lock.release()
  }

I think Spark is making assumptions about the underlying file system that
isn't safe to make (locking? I don't know enough about POSIX to know
whether locking is part of the spec). Maybe file a bug report after someone
from the dev team chimes in on this issue.

On Wed Feb 11 2015 at 2:20:34 PM Charles Feduke 
wrote:

> Take a look at this:
>
> http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre
>
> Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf
> (linked from that article)
>
> to get a better idea of what your options are.
>
> If its possible to avoid writing to [any] disk I'd recommend that route,
> since that's the performance advantage Spark has over vanilla Hadoop.
>
> On Wed Feb 11 2015 at 2:10:36 PM Tassilo Klein  wrote:
>
>> Thanks for the info. The file system in use is a Lustre file system.
>>
>> Best,
>>  Tassilo
>>
>> On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke <
>> charles.fed...@gmail.com> wrote:
>>
>>> A central location, such as NFS?
>>>
>>> If they are temporary for the purpose of further job processing you'll
>>> want to keep them local to the node in the cluster, i.e., in /tmp. If they
>>> are centralized you won't be able to take advantage of data locality and
>>> the central file store will become a bottleneck for further processing.
>>>
>>> If /tmp isn't an option because you want to be able to monitor the file
>>> outputs as they occur you can also use HDFS (assuming your Spark nodes are
>>> also HDFS members they will benefit from data locality).
>>>
>>> It looks like the problem you are seeing is that a lock cannot be
>>> acquired on the output file in the central file system.
>>>
>>> On Wed Feb 11 2015 at 11:55:55 AM TJ Klein  wrote:
>>>
 Hi,

 Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a
 different
 path then local directory.

 On our cluster we have a folder for temporary files (in a central file
 system), which is called /scratch.

 When setting SPARK_LOCAL_DIRS=/scratch/

 I get:

  An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Task 0
 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
 stage 0.0
 (TID 3, XXX): java.io.IOException: Function not implemented
 at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
 at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:9
 1)
 at sun.nio.ch.FileChannelImpl.lock(FileChannelImpl.java:1022)
 at java.nio.channels.FileChannel.lock(FileChannel.java:1052)
 at org.apache.spark.util.Utils$.fetchFile(Utils.scala:379)

 Using SPARK_LOCAL_DIRS=/tmp, however, works perfectly. Any idea?

 Best,
  Tassilo

 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/SPARK-LOCAL-DIRS-Issue-tp21602.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>

Re: exception with json4s render

2015-02-11 Thread Jonathan Haddad

Actually, yes, I was using 3.2.11.  I thought I would need the UUID encoder
that seems to have been added in that version, but I'm not using it.  I've
downgraded to 3.2.10 and it seems to work.

I searched through the spark repo and it looks like it's got 3.2.10 in a
pom.  I don't know the first thing about how dependencies are resolved but
I'm guessing it's related?

On Wed Feb 11 2015 at 11:20:42 AM Mohnish Kodnani 
wrote:

> I was getting similar error after I upgraded to spark 1.2.1 from 1.1.1
> Are you by any chance using json4s 3.2.11.
> I downgraded to 3.2.10 and that seemed to have worked. But I didnt try to
> spend much time debugging the issue than that.
>
>
>
> On Wed, Feb 11, 2015 at 11:13 AM, Jonathan Haddad 
> wrote:
>
>> I'm trying to use the json4s library in a spark job to push data back
>> into kafka.  Everything was working fine when I was hard coding a string,
>> but now that I'm trying to render a string from a simple map it's failing.
>> The code works in sbt console.
>>
>> working console code:
>> https://gist.github.com/rustyrazorblade/daa50bf05ff0d48ac6af
>>
>> failing spark job line:
>> https://github.com/rustyrazorblade/killranalytics/blob/master/spark/src/main/scala/RawEventProcessing.scala#L114
>>
>> exception: https://gist.github.com/rustyrazorblade/1e220d87d41cfcad2bb9
>>
>> I've seen examples of using render / compact when I searched the ML
>> archives, so I'm kind of at a loss here.
>>
>> Thanks in advance for any help.
>>
>> Jon
>>
>
>

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Tassilo Klein

Thanks. Yes, I think it might not always make sense to lock files,
particularly if every executor is getting its own path.

On Wed, Feb 11, 2015 at 2:31 PM, Charles Feduke 
wrote:

> And just glancing at the Spark source code around where the stack trace
> originates:
>
> val lockFile = new File(localDir, lockFileName)
>   val raf = new RandomAccessFile(lockFile, "rw")
>   // Only one executor entry.
>   // The FileLock is only used to control synchronization for
> executors download file,
>   // it's always safe regardless of lock type (mandatory or advisory).
>   val lock = raf.getChannel().lock()
>   val cachedFile = new File(localDir, cachedFileName)
>   try {
> if (!cachedFile.exists()) {
>   doFetchFile(url, localDir, cachedFileName, conf, securityMgr,
> hadoopConf)
> }
>   } finally {
> lock.release()
>   }
>
> I think Spark is making assumptions about the underlying file system that
> isn't safe to make (locking? I don't know enough about POSIX to know
> whether locking is part of the spec). Maybe file a bug report after someone
> from the dev team chimes in on this issue.
>
>
> On Wed Feb 11 2015 at 2:20:34 PM Charles Feduke 
> wrote:
>
>> Take a look at this:
>>
>> http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre
>>
>> Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf
>> (linked from that article)
>>
>> to get a better idea of what your options are.
>>
>> If its possible to avoid writing to [any] disk I'd recommend that route,
>> since that's the performance advantage Spark has over vanilla Hadoop.
>>
>> On Wed Feb 11 2015 at 2:10:36 PM Tassilo Klein  wrote:
>>
>>> Thanks for the info. The file system in use is a Lustre file system.
>>>
>>> Best,
>>>  Tassilo
>>>
>>> On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke <
>>> charles.fed...@gmail.com> wrote:
>>>
 A central location, such as NFS?

 If they are temporary for the purpose of further job processing you'll
 want to keep them local to the node in the cluster, i.e., in /tmp. If they
 are centralized you won't be able to take advantage of data locality and
 the central file store will become a bottleneck for further processing.

 If /tmp isn't an option because you want to be able to monitor the file
 outputs as they occur you can also use HDFS (assuming your Spark nodes are
 also HDFS members they will benefit from data locality).

 It looks like the problem you are seeing is that a lock cannot be
 acquired on the output file in the central file system.

 On Wed Feb 11 2015 at 11:55:55 AM TJ Klein  wrote:

> Hi,
>
> Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a
> different
> path then local directory.
>
> On our cluster we have a folder for temporary files (in a central file
> system), which is called /scratch.
>
> When setting SPARK_LOCAL_DIRS=/scratch/
>
> I get:
>
>  An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> stage 0.0
> (TID 3, XXX): java.io.IOException: Function not implemented
> at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
> at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:
> 91)
> at sun.nio.ch.FileChannelImpl.lock(FileChannelImpl.java:1022)
> at java.nio.channels.FileChannel.lock(FileChannel.java:1052)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:379)
>
> Using SPARK_LOCAL_DIRS=/tmp, however, works perfectly. Any idea?
>
> Best,
>  Tassilo
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/SPARK-LOCAL-DIRS-Issue-tp21602.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>>>

Re: exception with json4s render

2015-02-11 Thread Mohnish Kodnani

Same here.. I am a newbie to all this as well.
But this is just what I found and I lack the expertise to figure out why
things dont work in 3.2.11 json4s.
May be some one in the group with more expertise can take a crack at it.
But this is what unblocked me from moving forward.


On Wed, Feb 11, 2015 at 11:31 AM, Jonathan Haddad  wrote:

> Actually, yes, I was using 3.2.11.  I thought I would need the UUID
> encoder that seems to have been added in that version, but I'm not using
> it.  I've downgraded to 3.2.10 and it seems to work.
>
> I searched through the spark repo and it looks like it's got 3.2.10 in a
> pom.  I don't know the first thing about how dependencies are resolved but
> I'm guessing it's related?
>
> On Wed Feb 11 2015 at 11:20:42 AM Mohnish Kodnani <
> mohnish.kodn...@gmail.com> wrote:
>
>> I was getting similar error after I upgraded to spark 1.2.1 from 1.1.1
>> Are you by any chance using json4s 3.2.11.
>> I downgraded to 3.2.10 and that seemed to have worked. But I didnt try to
>> spend much time debugging the issue than that.
>>
>>
>>
>> On Wed, Feb 11, 2015 at 11:13 AM, Jonathan Haddad 
>> wrote:
>>
>>> I'm trying to use the json4s library in a spark job to push data back
>>> into kafka.  Everything was working fine when I was hard coding a string,
>>> but now that I'm trying to render a string from a simple map it's failing.
>>> The code works in sbt console.
>>>
>>> working console code:
>>> https://gist.github.com/rustyrazorblade/daa50bf05ff0d48ac6af
>>>
>>> failing spark job line:
>>> https://github.com/rustyrazorblade/killranalytics/blob/master/spark/src/main/scala/RawEventProcessing.scala#L114
>>>
>>> exception: https://gist.github.com/rustyrazorblade/1e220d87d41cfcad2bb9
>>>
>>> I've seen examples of using render / compact when I searched the ML
>>> archives, so I'm kind of at a loss here.
>>>
>>> Thanks in advance for any help.
>>>
>>> Jon
>>>
>>
>>

Re: exception with json4s render

I was having a similar problem to this trying to use the Scala Jackson
module yesterday. I tried setting `spark.files.userClassPathFirst` to true
but I was still having problems due to the older version of Jackson that
Spark has a dependency on. (I think its an old org.codehaus version.)

I ended up solving my problem by using Spray JSON (
https://github.com/spray/spray-json) which has no dependency on Jackson and
has great control over the JSON rendering process.

http://engineering.ooyala.com/blog/comparing-scala-json-libraries - based
on that I looked for something that didn't rely on Jackson.

Now that I see that there is some success with json4s on Spark 1.2.x I'll
have to give that a try.

On Wed Feb 11 2015 at 2:32:59 PM Jonathan Haddad  wrote:

> Actually, yes, I was using 3.2.11.  I thought I would need the UUID
> encoder that seems to have been added in that version, but I'm not using
> it.  I've downgraded to 3.2.10 and it seems to work.
>
> I searched through the spark repo and it looks like it's got 3.2.10 in a
> pom.  I don't know the first thing about how dependencies are resolved but
> I'm guessing it's related?
>
> On Wed Feb 11 2015 at 11:20:42 AM Mohnish Kodnani <
> mohnish.kodn...@gmail.com> wrote:
>
>> I was getting similar error after I upgraded to spark 1.2.1 from 1.1.1
>> Are you by any chance using json4s 3.2.11.
>> I downgraded to 3.2.10 and that seemed to have worked. But I didnt try to
>> spend much time debugging the issue than that.
>>
>>
>>
>> On Wed, Feb 11, 2015 at 11:13 AM, Jonathan Haddad 
>> wrote:
>>
>>> I'm trying to use the json4s library in a spark job to push data back
>>> into kafka.  Everything was working fine when I was hard coding a string,
>>> but now that I'm trying to render a string from a simple map it's failing.
>>> The code works in sbt console.
>>>
>>> working console code:
>>> https://gist.github.com/rustyrazorblade/daa50bf05ff0d48ac6af
>>>
>>> failing spark job line:
>>> https://github.com/rustyrazorblade/killranalytics/blob/master/spark/src/main/scala/RawEventProcessing.scala#L114
>>>
>>> exception: https://gist.github.com/rustyrazorblade/1e220d87d41cfcad2bb9
>>>
>>> I've seen examples of using render / compact when I searched the ML
>>> archives, so I'm kind of at a loss here.
>>>
>>> Thanks in advance for any help.
>>>
>>> Jon
>>>
>>
>>

RE: SparkSQL + Tableau Connector

I'm using mysql as the metastore DB with Spark 1.2.I simply copy the 
hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in 
/etc/spark/, everything works fine now.
My setup looks like this.
Tableau => Spark ThriftServer2 => HiveServer2
It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, 
it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. 
 Is this expected?  
I thought it should at least use the catalyst engine and talk to the underlying 
HDFS like what HiveContext API does to pull in the data into RDD.  Did I 
misunderstood the purpose of Spark ThriftServer2?


Date: Wed, 11 Feb 2015 16:07:40 +0530
Subject: Re: SparkSQL + Tableau Connector
From: ar...@sigmoidanalytics.com
To: tsind...@gmail.com
CC: user@spark.apache.org

Hi
I used this, though its using a embedded driver and is not a good approch.It 
works. You can configure for some other metastore type also. I have not tried 
the metastore uri's.















  javax.jdo.option.ConnectionURL

  
jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true

  URL for the DB








  javax.jdo.option.ConnectionDriverName

  org.apache.derby.jdbc.EmbeddedDriver










On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist  wrote:
Hi Arush,
So yes I want to create the tables through Spark SQL.  I have placed the 
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was 
all I should need to do to have the thriftserver use it.  Perhaps my 
hive-site.xml is worng, it currently looks like this:

  hive.metastore.uris
thrift://sandbox.hortonworks.com:9083  URI for 
client to contact metastore server
Which leads me to believe it is going to pull form the thriftserver from 
Horton?  I will go look at the docs to see if this is right, it is what Horton 
says to do.  Do you have an example hive-site.xml by chance that works with 
Spark SQL?
I am using 8.3 of tableau with the SparkSQL Connector.
Thanks for the assistance.
-Todd
On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda  
wrote:
BTW what tableau connector are you using?
On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda  
wrote:
 I am a little confused here, why do you want to create the tables in hive. You 
want to create the tables in spark-sql, right?
If you are not able to find the same tables through tableau then thrift is 
connecting to a diffrent metastore than your spark-shell.
One way to specify a metstore to thrift is to provide the path to hive-site.xml 
while starting thrift using --files hive-site.xml.
similarly you can specify the same metastore to your spark-submit or 
sharp-shell using the same option.



On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist  wrote:
Arush,
As for #2 do you mean something like this from the docs:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)Or did 
you have something else in mind?
-Todd

On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist  wrote:
Arush,
Thank you will take a look at that approach in the morning.  I sort of figured 
the answer to #1 was NO and that I would need to do 2 and 3 thanks for 
clarifying it for me.
-Todd
On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda  
wrote:
1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON 
files? NO

2.  Do I need to do something to expose these via hive / metastore other than 
creating a table in hive? Create a table in spark sql to expose via spark sql

3.  Does the thriftserver need to be configured to expose these in some 
fashion, sort of related to question 2 you would need to configure thrift to 
read from the metastore you expect it read from - by default it reads from 
metastore_db directory present in the directory used to launch the thrift 
server.


On 11 Feb 2015 01:35, "Todd Nist"  wrote:
Hi,
I'm trying to understand how and what the Tableau connector to SparkSQL is able 
to access.  My understanding is it needs to connect to the thriftserver and I 
am not sure how or if it exposes parquet, json, schemaRDDs, or does it only 
expose schemas defined in the metastore / hive.  
For example, I do the following from the spark-shell which generates a 
schemaRDD from a csv file and saves it as a JSON file as well as a parquet file.
import org.apache.sql.SQLContext
import com.databricks.spark.csv._

val sqlContext = new SQLContext(sc)
val test = sqlContext.csfFile("/data/test.csv")
test.toJSON().saveAsTextFile("/data/out")
test.saveAsParquetFile("/data/out")









When I connect from Tableau, the only thing I see is the "default" schema and 
nothing in the tables section.
So my questions are:

1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON fi

Re: Mesos coarse mode not working (fine grained does)

2015-02-11 Thread Hans van den Bogert

Bumping 1on1 conversation to mailinglist:

On 10 Feb 2015, at 13:24, Hans van den Bogert  wrote:

> 
> It’s self built, I can’t otherwise as I can’t install packages on the cluster 
> here.
> 
> The problem seems with libtool. When compiling Mesos on a host with apr-devel 
> and apr-util-devel the shared libraries are named libapr*.so without prefix 
> (the ones with prefix are also installed of course). On our compute nodes no 
> *devel packages are installed, just the binary package, which have are named 
> libapr*.so.0 . But even the “make install”-ed binaries still refer to the 
> devel-packages’ shared library. I’m not sure if this is intended behaviour by 
> libtool, because it is the one changing at start/runtime the binaries’ RPATH 
> (which are initially well defined) to the  libapr*.so. 
> 
> But this is probably autoconf fu, just hoping someone here has dealt with the 
> same issue.
> 
> On 09 Feb 2015, at 20:37, Tim Chen  wrote:
> 
>> I'm still trying to grasp what your environment setup is like, it's odd to 
>> see a g++ stderr when you running mesos.
>> 
>> Are you building Mesos yourself and running it, or you've installed it 
>> through some package?
>> 
>> Tim
>> 
>> 
>> 
>> On Mon, Feb 9, 2015 at 1:03 AM, Hans van den Bogert  
>> wrote:
>> Okay, I was kind of ambiguous, I assume you mean this one:
>> 
>> [vdbogert@node002 ~]$ cat 
>> /local/vdbogert/var/lib/mesos/slaves/20150206-110658-16813322-5050-5515-S0/frameworks/20150208-200943-16813322-5050-26370-/executors/3/runs/latest/stdout
>> [vdbogert@node002 ~]$
>> 
>> it’s empty.
>> 
>> On 09 Feb 2015, at 06:22, Tim Chen  wrote:
>> 
>>> Hi Hans,
>>> 
>>> I was referring to the stdout/stderr of the task, not the slave.
>>> 
>>> Tim
>>> 
>>> On Sun, Feb 8, 2015 at 1:21 PM, Hans van den Bogert  
>>> wrote:
>>> 
>>> 
>>> 
>>> > Hi there,
>>> >
>>> > It looks like while trying to launch the executor (or one of the process 
>>> > like the fetcher to fetch the uris)
>>> The fetching seems to have succeeded as well as extracting, as the 
>>> “spark-1.2.0-bin-hadoop2.4” directory exists in the slave sandbox. 
>>> Furthermore, it seems the executor URI is superfluous in my environment as 
>>> I’ve checked the code, and if an URI is not provided, the task will not 
>>> refer to an extracted distro, but to a directory with the same path as the 
>>> current  spark distro, which makes sense in a cluster environment where 
>>> data is on a network-shared disk. I’ve tried *not* supplying an 
>>> spark.executor.uri and fine-grained mode still works fine. Coarse-grained 
>>> mode still  fails with the same libapr* errors.
>>> 
>>> > was failing because of the dependencies problem you see. Your mesos-slave 
>>> > shouldn't be able to run though, were you running 0.20.0 slave and 
>>> > upgraded to 0.21.0? We introduced the dependencies for libapr and libsvn 
>>> > for Mesos 0.21.0.
>>> I’ve only ever tried compiling 0.21.0. I’ve checked all binaries in 
>>> MESOS_HOME/build/src/.libs with ‘ldd’ and all are referring to a correct 
>>> existing libapr*-1.so.0 (mind the trailing “.0”).
>>> 
>>> > What's the stdout for the task like?
>>> 
>>> > > Mesos slaves' stdout are empty.
>>> 
>>> 
>>> It’s a pity spark’s logging in this case is pretty marginal, as is mesos’. 
>>> One can’t log the (raw) task-descriptions as far as I can see, which would 
>>> be very helpful in this case.
>>> I could resort to building spark from source as well and add some logging, 
>>> but I’m afraid I will introduce other peculiarities. Do you think it’s my 
>>> only option?
>>> 
>>> Thanks,
>>> 
>>> H.
>>> 
>>> > Tim
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Feb 9, 2015 at 4:10 AM, Hans van den Bogert 
>>> >  wrote:
>>> > I wasn’t thorough, the complete stderr includes:
>>> >
>>> > g++: /usr/lib64/libaprutil-1.so: No such file or directory
>>> > g++: /usr/lib64/libapr-1.so: No such file or directoryn
>>> > (including that trailing ’n')
>>> >
>>> > Though I can’t figure out how the process indirection is going from the 
>>> > frontend spark application to mesos executors and where this shared 
>>> > library error comes from.
>>> >
>>> > Hope someone can shed some light,
>>> >
>>> > Thanks
>>> >
>>> > On 08 Feb 2015, at 14:15, Hans van den Bogert  
>>> > wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > >
>>> > > I’m trying to get coarse mode to work under mesos(0.21.0), I thought 
>>> > > this would be a trivial change as Mesos was working well in 
>>> > > fine-grained mode.
>>> > >
>>> > > However the mesos tasks fail, I can’t pinpoint where things go wrong.
>>> > >
>>> > > This is a mesos stderr log from a slave:
>>> > >
>>> > >Fetching URI 'http://upperpaste.com/spark-1.2.0-bin-hadoop2.4.tgz'
>>> > >I0208 12:57:45.415575 25720 fetcher.cpp:126] Downloading 
>>> > > 'http://upperpaste.com/spark-1.2.0-bin-hadoop2.4.tgz' to 
>>> > > '/local/vdbogert/var/lib/mesos//slaves/20150206-110658-16813322-5050-5515-S1/frameworks/20150208-125721-906005770-5050-32371-/executors/0/run

RE: Is the Thrift server right for me?

I have ThriftServer2 up and running, however, I notice that it relays the query 
to HiveServer2 when I pass the hive-site.xml to it.
I'm not sure if this is the expected behavior, but based on what I have up and 
running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez 
query. In this case, I could just connect directly to HiveServer2 if Hive is 
all you need.
If you are programmer and want to mash up data from Hive with other tables and 
data in Spark, then Spark ThriftServer2 seems to be a good integration point at 
some use case.
Please correct me if I misunderstood the purpose of Spark ThriftServer2.

> Date: Thu, 8 Jan 2015 14:49:00 -0700
> From: sjbru...@uwaterloo.ca
> To: user@spark.apache.org
> Subject: Is the Thrift server right for me?
> 
> I'm building a system that collects data using Spark Streaming, does some
> processing with it, then saves the data. I want the data to be queried by
> multiple applications, and it sounds like the Thrift JDBC/ODBC server might
> be the right tool to handle the queries. However,  the documentation for the
> Thrift server
> 
>   
> seems to be written for Hive users who are moving to Spark. I never used
> Hive before I started using Spark, so it is not clear to me how best to use
> this.
> 
> I've tried putting data into Hive, then serving it with the Thrift server.
> But I have not been able to update the data in Hive without first shutting
> down the server. This is a problem because new data is always being streamed
> in, and so the data must continuously be updated.
> 
> The system I'm building is supposed to replace a system that stores the data
> in MongoDB. The dataset has now grown so large that the database index does
> not fit in memory, which causes major performance problems in MongoDB.
> 
> If the Thrift server is the right tool for me, how can I set it up for my
> application? If it is not the right tool, what else can I use?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-the-Thrift-server-right-for-me-tp21044.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: iteratively modifying an RDD

Aha great! Thanks for the clarification!
On Feb 11, 2015 8:11 PM, "Davies Liu"  wrote:

> On Wed, Feb 11, 2015 at 10:47 AM, rok  wrote:
> > I was having trouble with memory exceptions when broadcasting a large
> lookup
> > table, so I've resorted to processing it iteratively -- but how can I
> modify
> > an RDD iteratively?
> >
> > I'm trying something like :
> >
> > rdd = sc.parallelize(...)
> > lookup_tables = {...}
> >
> > for lookup_table in lookup_tables :
> > rdd = rdd.map(lambda x: func(x, lookup_table))
> >
> > If I leave it as is, then only the last "lookup_table" is applied
> instead of
> > stringing together all the maps. However, if add a .cache() to the .map
> then
> > it seems to work fine.
>
> This is the something related to Python closure implementation, you should
> do it like this:
>
> def create_func(lookup_table):
>  return lambda x: func(x, lookup_table)
>
> for lookup_table in lookup_tables:
> rdd = rdd.map(create_func(lookup_table))
>
> The Python closure just remember the variable, not copy the value of it.
> In the loop, `lookup_table` is the same variable. When we serialize the
> final
> rdd, all the closures are referring to the same `lookup_table`, which
> points
> to the last value.
>
> When we create the closure in a function, Python create a variable for
> each closure, so it works.
>
> > A second problem is that the runtime for each iteration roughly doubles
> at
> > each iteration so this clearly doesn't seem to be the way to do it. What
> is
> > the preferred way of doing such repeated modifications to an RDD and how
> can
> > the accumulation of overhead be minimized?
> >
> > Thanks!
> >
> > Rok
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

RE: Is the Thrift server right for me?

2015-02-11 Thread Judy Nash

It should relay the queries to spark (i.e. you shouldn't see any MR job on 
Hadoop & you should see activities on the spark app on headnode UI).

Check your hive-site.xml. Are you directing to the hive server 2 port instead 
of spark thrift port?
Their default ports are both 1.

From: Andrew Lee [mailto:alee...@hotmail.com]
Sent: Wednesday, February 11, 2015 12:00 PM
To: sjbrunst; user@spark.apache.org
Subject: RE: Is the Thrift server right for me?

I have ThriftServer2 up and running, however, I notice that it relays the query 
to HiveServer2 when I pass the hive-site.xml to it.

I'm not sure if this is the expected behavior, but based on what I have up and 
running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez 
query. In this case, I could just connect directly to HiveServer2 if Hive is 
all you need.

If you are programmer and want to mash up data from Hive with other tables and 
data in Spark, then Spark ThriftServer2 seems to be a good integration point at 
some use case.

Please correct me if I misunderstood the purpose of Spark ThriftServer2.

> Date: Thu, 8 Jan 2015 14:49:00 -0700
> From: sjbru...@uwaterloo.ca
> To: user@spark.apache.org
> Subject: Is the Thrift server right for me?
>
> I'm building a system that collects data using Spark Streaming, does some
> processing with it, then saves the data. I want the data to be queried by
> multiple applications, and it sounds like the Thrift JDBC/ODBC server might
> be the right tool to handle the queries. However, the documentation for the
> Thrift server
> 
> seems to be written for Hive users who are moving to Spark. I never used
> Hive before I started using Spark, so it is not clear to me how best to use
> this.
>
> I've tried putting data into Hive, then serving it with the Thrift server.
> But I have not been able to update the data in Hive without first shutting
> down the server. This is a problem because new data is always being streamed
> in, and so the data must continuously be updated.
>
> The system I'm building is supposed to replace a system that stores the data
> in MongoDB. The dataset has now grown so large that the database index does
> not fit in memory, which causes major performance problems in MongoDB.
>
> If the Thrift server is the right tool for me, how can I set it up for my
> application? If it is not the right tool, what else can I use?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-the-Thrift-server-right-for-me-tp21044.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: 
> user-unsubscr...@spark.apache.org
> For additional commands, e-mail: 
> user-h...@spark.apache.org
>

RE: Is the Thrift server right for me?

Thanks Judy.
You are right. The query is going to Spark ThriftServer2. I have it setup on a 
different port number.
I got the wrong perception b/c there were other jobs running at the same time. 
It should be Spark jobs instead of Hive jobs.
From: judyn...@exchange.microsoft.com
To: alee...@hotmail.com; sjbru...@uwaterloo.ca; user@spark.apache.org
Subject: RE: Is the Thrift server right for me?
Date: Wed, 11 Feb 2015 20:12:03 +

It should relay the queries to spark (i.e. you shouldn’t see any MR job on 
Hadoop & you should see activities on the spark app on headnode UI).

Check your hive-site.xml. Are you directing to the hive server 2 port instead 
of spark thrift port?

Their default ports are both 1.

From: Andrew Lee [mailto:alee...@hotmail.com]

Sent: Wednesday, February 11, 2015 12:00 PM

To: sjbrunst; user@spark.apache.org

Subject: RE: Is the Thrift server right for me?

I have ThriftServer2 up and running, however, I notice that it relays the query 
to HiveServer2 when I pass the hive-site.xml to it.

I'm not sure if this is the expected behavior, but based on what I have up and 
running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez 
query. In this case, I could
 just connect directly to HiveServer2 if Hive is all you need.

If you are programmer and want to mash up data from Hive with other tables and 
data in Spark, then Spark ThriftServer2 seems to be a good integration point at 
some use case.

Please correct me if I misunderstood the purpose of Spark ThriftServer2.

> Date: Thu, 8 Jan 2015 14:49:00 -0700

> From: sjbru...@uwaterloo.ca

> To: user@spark.apache.org

> Subject: Is the Thrift server right for me?

> 

> I'm building a system that collects data using Spark Streaming, does some

> processing with it, then saves the data. I want the data to be queried by

> multiple applications, and it sounds like the Thrift JDBC/ODBC server might

> be the right tool to handle the queries. However, the documentation for the

> Thrift server

> 

> seems to be written for Hive users who are moving to Spark. I never used

> Hive before I started using Spark, so it is not clear to me how best to use

> this.

> 

> I've tried putting data into Hive, then serving it with the Thrift server.

> But I have not been able to update the data in Hive without first shutting

> down the server. This is a problem because new data is always being streamed

> in, and so the data must continuously be updated.

> 

> The system I'm building is supposed to replace a system that stores the data

> in MongoDB. The dataset has now grown so large that the database index does

> not fit in memory, which causes major performance problems in MongoDB.

> 

> If the Thrift server is the right tool for me, how can I set it up for my

> application? If it is not the right tool, what else can I use?

> 

> 

> 

> --

> View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-the-Thrift-server-right-for-me-tp21044.html

> Sent from the Apache Spark User List mailing list archive at Nabble.com.

> 

> -

> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

> For additional commands, e-mail: user-h...@spark.apache.org

>

RE: SparkSQL + Tableau Connector

Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the 
logs since there were other activities going on on the cluster.

From: alee...@hotmail.com
To: ar...@sigmoidanalytics.com; tsind...@gmail.com
CC: user@spark.apache.org
Subject: RE: SparkSQL + Tableau Connector
Date: Wed, 11 Feb 2015 11:56:44 -0800

I'm using mysql as the metastore DB with Spark 1.2.I simply copy the 
hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in 
/etc/spark/, everything works fine now.
My setup looks like this.
Tableau => Spark ThriftServer2 => HiveServer2
It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, 
it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. 
 Is this expected?  
I thought it should at least use the catalyst engine and talk to the underlying 
HDFS like what HiveContext API does to pull in the data into RDD.  Did I 
misunderstood the purpose of Spark ThriftServer2?

Date: Wed, 11 Feb 2015 16:07:40 +0530
Subject: Re: SparkSQL + Tableau Connector
From: ar...@sigmoidanalytics.com
To: tsind...@gmail.com
CC: user@spark.apache.org

Hi
I used this, though its using a embedded driver and is not a good approch.It 
works. You can configure for some other metastore type also. I have not tried 
the metastore uri's.

  javax.jdo.option.ConnectionURL

jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true

  URL for the DB

  javax.jdo.option.ConnectionDriverName

  org.apache.derby.jdbc.EmbeddedDriver

On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist  wrote:
Hi Arush,
So yes I want to create the tables through Spark SQL.  I have placed the 
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was 
all I should need to do to have the thriftserver use it.  Perhaps my 
hive-site.xml is worng, it currently looks like this:

  hive.metastore.uris
thrift://sandbox.hortonworks.com:9083  URI for 
client to contact metastore server
Which leads me to believe it is going to pull form the thriftserver from 
Horton?  I will go look at the docs to see if this is right, it is what Horton 
says to do.  Do you have an example hive-site.xml by chance that works with 
Spark SQL?
I am using 8.3 of tableau with the SparkSQL Connector.
Thanks for the assistance.
-Todd
On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda  
wrote:
BTW what tableau connector are you using?
On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda  
wrote:
 I am a little confused here, why do you want to create the tables in hive. You 
want to create the tables in spark-sql, right?
If you are not able to find the same tables through tableau then thrift is 
connecting to a diffrent metastore than your spark-shell.
One way to specify a metstore to thrift is to provide the path to hive-site.xml 
while starting thrift using --files hive-site.xml.
similarly you can specify the same metastore to your spark-submit or 
sharp-shell using the same option.

On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist  wrote:
Arush,
As for #2 do you mean something like this from the docs:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)Or did 
you have something else in mind?
-Todd

On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist  wrote:
Arush,
Thank you will take a look at that approach in the morning.  I sort of figured 
the answer to #1 was NO and that I would need to do 2 and 3 thanks for 
clarifying it for me.
-Todd
On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda  
wrote:
1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON 
files? NO

2.  Do I need to do something to expose these via hive / metastore other than 
creating a table in hive? Create a table in spark sql to expose via spark sql

3.  Does the thriftserver need to be configured to expose these in some 
fashion, sort of related to question 2 you would need to configure thrift to 
read from the metastore you expect it read from - by default it reads from 
metastore_db directory present in the directory used to launch the thrift 
server.

On 11 Feb 2015 01:35, "Todd Nist"  wrote:
Hi,
I'm trying to understand how and what the Tableau connector to SparkSQL is able 
to access.  My understanding is it needs to connect to the thriftserver and I 
am not sure how or if it exposes parquet, json, schemaRDDs, or does it only 
expose schemas defined in the metastore / hive.  
For example, I do the following from the spark-shell which generates a 
schemaRDD from a csv file and saves it as a JSON file as well as a parquet file.
import org.apache.sql.SQLContext
import com.databricks.spark.csv._

val sqlContext = new SQLContext(sc)
val test = sq

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2015-02-11 Thread Lan

Hi Alexey and Daniel,

I'm using Spark 1.2.0 and still having the same error, as described below.

Do you have any news on this? Really appreciate your responses!!!

"a Spark cluster of 1 master VM SparkV1 and 1 worker VM SparkV4 (the error
is the same if I have 2 workers). They are connected without a problem now.
But when I submit a job (as in
https://spark.apache.org/docs/latest/quick-start.html) at the master: 

>spark-submit --master spark://SparkV1:7077 examples/src/main/python/pi.py 

it seems to run ok and returns "Pi is roughly...", but the worker has the
following Error: 

15/02/07 15:22:33 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@SparkV4:47986] <-
[akka.tcp://sparkExecutor@SparkV4:46630]: Error [Shut down address:
akka.tcp://sparkExecutor@SparkV4:46630] [ 
akka.remote.ShutDownAssociation: Shut down address:
akka.tcp://sparkExecutor@SparkV4:46630 
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The
remote system terminated the association because it is shutting down. 
] 

More about the setup: each VM has only 4GB RAM, running Ubuntu, using
spark-1.2.0, built for Hadoop 2.6.0 or 2.4.0. "




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/akka-remote-transport-Transport-InvalidAssociationException-The-remote-system-terminated-the-associan-tp20071p21607.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

No executors allocated on yarn with latest master branch

2015-02-11 Thread Anders Arpteg

Hi,

Compiled the latest master of Spark yesterday (2015-02-10) for Hadoop 2.2
and failed executing jobs in yarn-cluster mode for that build. Works
successfully with spark 1.2 (and also master from 2015-01-16), so something
has changed since then that prevents the job from receiving any executors
on the cluster.

Basic symptoms are that the jobs fires up the AM, but after examining the
"executors" page in the web ui, only the driver is listed, no executors are
ever received, and the driver keep waiting forever. Has anyone seemed
similar problems?

Thanks for any insights,
Anders

Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

2015-02-11 Thread Aris

Thank you Costin. I wrote out to the user list, I got no replies there.

I will take this exact message and put it on the Github bug tracking system.

One quick clarification: I read the elasticsearch documentation thoroughly,
and I saw the warning about structured data vs. unstructured data, but it's
a bit confusing. Normally, when SparkSQL sees a lot of JSON documents of
the "same schema", but perhaps a few instances of those documents might be
missing a key X, a query which SELECT's X will just come back as NULL for
the JSON documents which don't have that key X.

Are you saying this behavior does not happen when I connect to
ElasticSearch? Every single JSON document must contain every single key, or
else the application crashes? So If one single JSON document is missing key
X from the elasticsearch data, the application throws an Exception?

Thank you!
Aris



On Wed, Feb 11, 2015 at 1:21 AM, Costin Leau 
wrote:

> Aris, if you encountered a bug, it's best to raise an issue with the
> es-hadoop/spark project, namely here [1].
>
> When using SparkSQL the underlying data needs to be present - this is
> mentioned in the docs as well [2]. As for the order,
> that does look like a bug and shouldn't occur. Note the reason why the
> keys are re-arranged is in JSON, the object/map doesn't
> guarantee order. To give some predictability, the keys are arranged
> alphabetically.
>
> I suggest continuing this discussion in the issue issue tracker mentioned
> at [1].
>
> [1] https://github.com/elasticsearch/elasticsearch-hadoop/issues
> [2] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/
> master/spark.html#spark-sql
>
>
> On 2/11/15 3:17 AM, Aris wrote:
>
>> I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and
>> Spark/SparkSQL 1.2.0, from Costin Leau's advice.
>>
>> I want to query ElasticSearch for a bunch of JSON documents from within
>> SparkSQL, and then use a SQL query to simply
>> query for a column, which is actually a JSON key -- normal things that
>> SparkSQL does using the
>> SQLContext.jsonFile(filePath) facility. The difference I am using the
>> ElasticSearch container.
>>
>> The big problem: when I do something like
>>
>> SELECT jsonKeyA from tempTable;
>>
>> I actually get the WRONG KEY out of the JSON documents! I discovered that
>> if I have JSON keys physically in the order D,
>> C, B, A in the json documents, the elastic search connector discovers
>> those keys BUT then sorts them alphabetically as
>> A,B,C,D - so when I SELECT A from tempTable, I actually get column D
>> (because the physical JSONs had key D in the first
>> position). This only happens when reading from elasticsearch and SparkSQL.
>>
>> It gets much worse: When a key is missing from one of the documents and
>> that key should be NULL, the whole application
>> actually crashes and gives me a java.lang.IndexOutOfBoundsException --
>> the schema that is inferred is totally screwed up.
>>
>> In the above example with physical JSONs containing keys in the order
>> D,C,B,A, if one of the JSON documents is missing
>> the key/column I am querying for, I get that 
>> java.lang.IndexOutOfBoundsException
>> exception.
>>
>> I am using the BUILD-SNAPSHOT because otherwise I couldn't build the
>> elasticsearch-spark project, Costin said so.
>>
>> Any clues here? Any fixes?
>>
>>
>>
>>
> --
> Costin
>

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

Hello,

I have many questions about joins, but arguably just one.

specifically about memory and containers that are overstepping their limits, as
per errors dotted around all over the place, but something like:
http://mail-archives.apache.org/mod_mbox/spark-issues/201405.mbox/%3CJIRA.12716648.1401112206043.18368.1401118322196@arcas%3E

I have a job (mostly like this link: http://hastebin.com/quwamoreko.scala
, but with a write-to-files-based-on-keys
thing at the end) that is doing a join between a medium sized (like, 150,000
entries, classRDD in the link) RDD to a larger (108 million entry, objRDD in
the link) RDD…

the keys and values for each entry are quite small. In the linked join most
objects will have 10 or so classes and most classes 100k associated objects.
Though a few (10 or so?) classes will have millions of objects and some objects
hundreds of classes.

The issue i'm having is that (on an m2.xlarge ec2 instance) my container is
overstepping the memory limits and being shut down

This confuses me and makes me question my fundamental understanding of joins.

I thought joins were a reduce operation that happened on disk. Further, my
joins don’t seem to hold very much in memory, indeed at any given point a pair
of strings and another string is all i seem to hold.

The container limit is 7Gb according to the error in my container logs and has
been apparently reasonable for jobs i’ve run in the past.
But again, I don’t see where in my program i am actually keeping anything in
memory at all.
And yet sure enough, after about 30 minutes of running, over a time period of
like 2 or so minutes, one of the containers grows from about 800mb to 7Gb and
is promptly killed.

So, my questions, what could be going on here and how can i fix it? Is this
just some fundamental feature of my data or is there anything else i can do?

Further rider questions: Is there some logger settings I can use for the logs
to tell me exactly where in my job has been reached? i.e. which RDD is being
constructed or which join is being performed? The RDD numbers and stages aren’t
all that helpful and though i know the spark UI exists some logs i can refer
back to when my cluster has long died would be great.

Cheers
- Sina

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

Hello,

I have many questions about joins, but arguably just one.

The issue i'm having is that (on an m2.xlarge ec2 instance) my container is
overstepping the memory limits and being shut down

This confuses me and makes me question my fundamental understanding of joins.

So, my questions, what could be going on here and how can i fix it? Is this
just some fundamental feature of my data or is there anything else i can do?

Cheers
- Sina

Re: iteratively modifying an RDD

the runtime for each consecutive iteration is still roughly twice as long as 
for the previous one -- is there a way to reduce whatever overhead is 
accumulating? 

On Feb 11, 2015, at 8:11 PM, Davies Liu  wrote:

> On Wed, Feb 11, 2015 at 10:47 AM, rok  wrote:
>> I was having trouble with memory exceptions when broadcasting a large lookup
>> table, so I've resorted to processing it iteratively -- but how can I modify
>> an RDD iteratively?
>> 
>> I'm trying something like :
>> 
>> rdd = sc.parallelize(...)
>> lookup_tables = {...}
>> 
>> for lookup_table in lookup_tables :
>>rdd = rdd.map(lambda x: func(x, lookup_table))
>> 
>> If I leave it as is, then only the last "lookup_table" is applied instead of
>> stringing together all the maps. However, if add a .cache() to the .map then
>> it seems to work fine.
> 
> This is the something related to Python closure implementation, you should
> do it like this:
> 
> def create_func(lookup_table):
> return lambda x: func(x, lookup_table)
> 
> for lookup_table in lookup_tables:
>rdd = rdd.map(create_func(lookup_table))
> 
> The Python closure just remember the variable, not copy the value of it.
> In the loop, `lookup_table` is the same variable. When we serialize the final
> rdd, all the closures are referring to the same `lookup_table`, which points
> to the last value.
> 
> When we create the closure in a function, Python create a variable for
> each closure, so it works.
> 
>> A second problem is that the runtime for each iteration roughly doubles at
>> each iteration so this clearly doesn't seem to be the way to do it. What is
>> the preferred way of doing such repeated modifications to an RDD and how can
>> the accumulation of overhead be minimized?
>> 
>> Thanks!
>> 
>> Rok
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark ML pipeline

2015-02-11 Thread Reynold Xin

Yes. Next release (Spark 1.3) is coming out end of Feb / early Mar.

On Wed, Feb 11, 2015 at 7:22 AM, Jianguo Li 
wrote:

> Hi,
>
> I really like the pipeline in the spark.ml in Spark1.2 release. Will
> there be more machine learning algorithms implemented for the pipeline
> framework in the next major release? Any idea when the next major release
> comes out?
>
> Thanks,
>
> Jianguo
>

Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet

I think the word "partition" here is a tad different than the term
"partition" that we use in Spark. Basically, I want something similar to
Guava's Iterables.partition [1], that is, If I have an RDD[People] and I
want to run an algorithm that can be optimized by working on 30 people at a
time, I'd like to be able to say:

val rdd: RDD[People] = .
val partitioned: RDD[Seq[People]] = rdd.partition(30)

I also don't want any shuffling- everything can still be processed locally.


[1]
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#partition(java.lang.Iterable,%20int)

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra

rdd.mapPartitions { iter =>
  val grouped = iter.grouped(batchSize)
  for (group <- grouped) { ... }
}

On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet  wrote:

> I think the word "partition" here is a tad different than the term
> "partition" that we use in Spark. Basically, I want something similar to
> Guava's Iterables.partition [1], that is, If I have an RDD[People] and I
> want to run an algorithm that can be optimized by working on 30 people at a
> time, I'd like to be able to say:
>
> val rdd: RDD[People] = .
> val partitioned: RDD[Seq[People]] = rdd.partition(30)
>
> I also don't want any shuffling- everything can still be processed
> locally.
>
>
> [1]
> http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#partition(java.lang.Iterable,%20int)
>

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

Hello,

I have many questions about joins, but arguably just one.

The issue i'm having is that (on an m2.xlarge ec2 instance) my container is
overstepping the memory limits and being shut down

This confuses me and makes me question my fundamental understanding of joins.

So, my questions, what could be going on here and how can i fix it? Is this
just some fundamental feature of my data or is there anything else i can do?

Cheers
- Sina

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet

Doesn't iter still need to fit entirely into memory?

On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra 
wrote:

> rdd.mapPartitions { iter =>
>   val grouped = iter.grouped(batchSize)
>   for (group <- grouped) { ... }
> }
>
> On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet  wrote:
>
>> I think the word "partition" here is a tad different than the term
>> "partition" that we use in Spark. Basically, I want something similar to
>> Guava's Iterables.partition [1], that is, If I have an RDD[People] and I
>> want to run an algorithm that can be optimized by working on 30 people at a
>> time, I'd like to be able to say:
>>
>> val rdd: RDD[People] = .
>> val partitioned: RDD[Seq[People]] = rdd.partition(30)
>>
>> I also don't want any shuffling- everything can still be processed
>> locally.
>>
>>
>> [1]
>> http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#partition(java.lang.Iterable,%20int)
>>
>
>

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra

No, only each group should need to fit.

On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet  wrote:

> Doesn't iter still need to fit entirely into memory?
>
> On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra 
> wrote:
>
>> rdd.mapPartitions { iter =>
>>   val grouped = iter.grouped(batchSize)
>>   for (group <- grouped) { ... }
>> }
>>
>> On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet  wrote:
>>
>>> I think the word "partition" here is a tad different than the term
>>> "partition" that we use in Spark. Basically, I want something similar to
>>> Guava's Iterables.partition [1], that is, If I have an RDD[People] and I
>>> want to run an algorithm that can be optimized by working on 30 people at a
>>> time, I'd like to be able to say:
>>>
>>> val rdd: RDD[People] = .
>>> val partitioned: RDD[Seq[People]] = rdd.partition(30)
>>>
>>> I also don't want any shuffling- everything can still be processed
>>> locally.
>>>
>>>
>>> [1]
>>> http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#partition(java.lang.Iterable,%20int)
>>>
>>
>>
>

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov

Hello

Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Spark Sql?

Thanks
Dima




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: iteratively modifying an RDD

On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar  wrote:
> the runtime for each consecutive iteration is still roughly twice as long as 
> for the previous one -- is there a way to reduce whatever overhead is 
> accumulating?

Sorry, I didn't fully understand you question, which two are you comparing?

PySpark will try to combine the multiple map() together, then you will get
a task which need all the lookup_tables (the same size as before).

You could add a checkpoint after some of the iterations.

> On Feb 11, 2015, at 8:11 PM, Davies Liu  wrote:
>
>> On Wed, Feb 11, 2015 at 10:47 AM, rok  wrote:
>>> I was having trouble with memory exceptions when broadcasting a large lookup
>>> table, so I've resorted to processing it iteratively -- but how can I modify
>>> an RDD iteratively?
>>>
>>> I'm trying something like :
>>>
>>> rdd = sc.parallelize(...)
>>> lookup_tables = {...}
>>>
>>> for lookup_table in lookup_tables :
>>>rdd = rdd.map(lambda x: func(x, lookup_table))
>>>
>>> If I leave it as is, then only the last "lookup_table" is applied instead of
>>> stringing together all the maps. However, if add a .cache() to the .map then
>>> it seems to work fine.
>>
>> This is the something related to Python closure implementation, you should
>> do it like this:
>>
>> def create_func(lookup_table):
>> return lambda x: func(x, lookup_table)
>>
>> for lookup_table in lookup_tables:
>>rdd = rdd.map(create_func(lookup_table))
>>
>> The Python closure just remember the variable, not copy the value of it.
>> In the loop, `lookup_table` is the same variable. When we serialize the final
>> rdd, all the closures are referring to the same `lookup_table`, which points
>> to the last value.
>>
>> When we create the closure in a function, Python create a variable for
>> each closure, so it works.
>>
>>> A second problem is that the runtime for each iteration roughly doubles at
>>> each iteration so this clearly doesn't seem to be the way to do it. What is
>>> the preferred way of doing such repeated modifications to an RDD and how can
>>> the accumulation of overhead be minimized?
>>>
>>> Thanks!
>>>
>>> Rok
>>>
>>>
>>>
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Containers on EC2 instances go over their memory limits