Re: Advantage of using cache()

2014-08-22 Thread Nieyuan
Because map-reduce tasks like join will save shuffle data to disk . So the
only diffrence with caching or no-caching version is :
  >> .map { case (x, (n, i)) => (x, n)}



-
Thanks,
Nieyuan
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Advantage-of-using-cache-tp12480p12634.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: LDA example?

2014-08-22 Thread Burak Yavuz
You can check out this pull request: https://github.com/apache/spark/pull/476

LDA is on the roadmap for the 1.2 release, hopefully we will officially support 
it then!

Best,
Burak

- Original Message -
From: "Denny Lee" 
To: user@spark.apache.org
Sent: Thursday, August 21, 2014 10:10:35 PM
Subject: LDA example?

Quick question - is there a handy sample / example of how to use the LDA 
algorithm within Spark MLLib?  

Thanks!
Denny



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: LDA example?

2014-08-22 Thread Debasish Das
Hi Burak,

This LDA implementation is friendly to the equality and positivity als code
that I added in the following JIRA to formulate robust plsa

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-2426

Should I build upon the PR that you pointed ? I want to run some
experiments to see topic purity from reuters and 20 newsgroup dataset...

Right now I see 3 lda PR and I am not sure which one will merge to mllib.
The matrix factorization formulation outlined in this PR fits the direction
we would like to take...

Thanks.
Deb
On Aug 22, 2014 12:31 AM, "Burak Yavuz"  wrote:

> You can check out this pull request:
> https://github.com/apache/spark/pull/476
>
> LDA is on the roadmap for the 1.2 release, hopefully we will officially
> support it then!
>
> Best,
> Burak
>
> - Original Message -
> From: "Denny Lee" 
> To: user@spark.apache.org
> Sent: Thursday, August 21, 2014 10:10:35 PM
> Subject: LDA example?
>
> Quick question - is there a handy sample / example of how to use the LDA
> algorithm within Spark MLLib?
>
> Thanks!
> Denny
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: DStream start a separate DStream

2014-08-22 Thread Mayur Rustagi
Why dont you directly use DStream created as output of windowing process?
Any reason
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Thu, Aug 21, 2014 at 8:38 PM, Josh J  wrote:

> Hi,
>
> I would like to have a sliding window dstream perform a streaming
> computation and store these results. Once these results are stored, I then
> would like to process the results. Though I must wait until the final
> computation done for all tuples in the sliding window, before I begin the
> new DStream. How can I accomplish this with spark?
>
> Sincerely,
> Josh
>


iterator cause NotSerializableException

2014-08-22 Thread Kevin Jung
Hi
The following code gives me 'Task not serializable:
java.io.NotSerializableException: scala.collection.mutable.ArrayOps$ofInt'

var x = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3)
var iter = Array(5).toIterator
var value = 5
var value2 = iter.next

x.map( q => q*value).collect //Line 1, it works.

x.map( q=> q*value2).collect //Line 2, error

'value' and 'value2' look like exactly same, but why does this happen?
The iterator from RDD.toLocalIterator cause this too.
I tested it in spark-shell on Spark 1.0.2.

Thanks
Kevin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/iterator-cause-NotSerializableException-tp12638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



countByWindow save the count ?

2014-08-22 Thread Josh J
Hi,

Hopefully a simple question. Though is there an example of where to save
the output of countByWindow ? I would like to save the results to external
storage (kafka or redis). The examples show only stream.print()

Thanks,
Josh


Installation On Windows machine

2014-08-22 Thread Mishra, Abhishek
Hello Team,

I was just trying to install spark on my windows server 2012 machine and use it 
in my project; but unfortunately I do not find any documentation for the same. 
Please let me know if we have drafted anything for spark users on Windows. I am 
really in need of it as we are using Windows machine for Hadoop and other tools 
and so cannot move back to Linux OS or anything. We run Hadoop on hortonworks 
HDP2.0  platform and also recently I came across Spark and so wanted use this 
even in my project for my Analytics work. Please suggest me links or documents 
where I can move ahead with my installation and usage. I want to run it on Java.

Looking forward for a reply,

Thanking you in Advance,
Sincerely,
Abhishek

Thanks,

Abhishek Mishra
Software Engineer
Innovation Delivery CoE (IDC)

Xerox Services India
4th Floor Tapasya, Infopark,
Kochi, Kerala, India 682030

m +91-989-516-8770

www.xerox.com/businessservices



On Spark Standalone mode, Where the driver program will run?

2014-08-22 Thread taoist...@gmail.com
Hi all,
1. On Spark Standalone mode, client sumbit application. Where the driver 
program will run? client or master?
2. Standalone is reliable? can use in production mode ?



taoist...@gmail.com


[PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I am using PySpark with IPython notebook.


data = sc.parallelize(range(1000), 10)

#successful
data.map(lambda x: x+1).collect() 

#Error
data.count()



Something
similar:http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-td3415.html

But it does not figure out how to solve it. Any one help?

---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 data.count()

/home/workspace/spark-1.0.2-bin-hadoop2/python/pyspark/rdd.pyc in
count(self)
735 3
736 """
--> 737 return self.mapPartitions(lambda i: [sum(1 for _ in
i)]).sum()
738 
739 def stats(self):

/home/workspace/spark-1.0.2-bin-hadoop2/python/pyspark/rdd.pyc in sum(self)
726 6.0
727 """
--> 728 return self.mapPartitions(lambda x:
[sum(x)]).reduce(operator.add)
729 
730 def count(self):

/home/workspace/spark-1.0.2-bin-hadoop2/python/pyspark/rdd.pyc in
reduce(self, f)
646 if acc is not None:
647 yield acc
--> 648 vals = self.mapPartitions(func).collect()
649 return reduce(f, vals)
650 

/home/workspace/spark-1.0.2-bin-hadoop2/python/pyspark/rdd.pyc in
collect(self)
610 """
611 with _JavaStackTrace(self.context) as st:
--> 612   bytesInJava = self._jrdd.collect().iterator()
613 return
list(self._collect_iterator_through_file(bytesInJava))
614 

/home/workspace/spark-1.0.2-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)
535 answer = self.gateway_client.send_command(command)
536 return_value = get_return_value(answer, self.gateway_client,
--> 537 self.target_id, self.name)
538 
539 for temp_arg in temp_args:

/home/workspace/spark-1.0.2-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(


org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File "/home/workspace/spark-0.9.1-bin-hadoop2/python/pyspark/worker.py",
line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File
"/home/workspace/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py",
line 182, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File
"/home/workspace/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py",
line 117, in dump_stream
for obj in iterator:
  File
"/home/workspace/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py",
line 171, in _batched
for item in iterator:
  File "/home/workspace/spark-1.0.2-bin-hadoop2/python/pyspark/rdd.py", line
642, in func
TypeError: an integer is required

   
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
   
org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:145)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActo

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I'm running pyspark with Python 2.7.8 under Virtualenv

System Python Version: Python 2.6.x 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Extracting an element from the feature vector in LabeledPoint

2014-08-22 Thread LPG
Hi all,

Somehow related to this question and this data structure, what is the best
way to extract features using names instead of positions? Of course, it is
previously necessary to store the names in some way...

Thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Extracting-an-element-from-the-feature-vector-in-LabeledPoint-tp0p12644.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



"Block input-* already exists on this machine; not re-adding it" warnings

2014-08-22 Thread Aniket Bhatnagar
Hi everyone

I back ported kinesis-asl to spark 1.0.2 and ran a quick test on my local
machine. It seems to be working fine but I keep getting the following
warnings. I am not sure what it means and weather it is something to worry
about or not.

2014-08-22 15:53:43,803 [pool-1-thread-7] WARN
 o.apache.spark.storage.BlockManager - Block input-0-1408703023600 already
exists on this machine; not re-adding it

Thoughts?

Thanks,
Aniket


Re: Finding Rank in Spark

2014-08-22 Thread athiradas
Does anyone knw a way to do this?

I tried it by sorting it and writing an auto increment function.

But since its parallel computing the result is wrong.

Is there anyway? please reply



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-Rank-in-Spark-tp12028p12647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Understanding how to create custom DStreams in Spark streaming

2014-08-22 Thread Aniket Bhatnagar
Hi everyone

Sorry about the noob question, but I am struggling to understand ways to
create DStreams in Spark. Here is my understanding based on what I could
gather from documentation and studying Spark code (as well as some hunch).
Please correct me if I am wrong.

1. In most cases, one would either extend ReceiverInputDStream
or InputDStream to create a custom DStream that pulls data from an external
system.
 - ReceiverInputDStream is used to distributed data receiving code (i.e.
Receiver) to workers. N instances of ReceiverInputDStream results in
distributing to N workers. No control on which worker nodes executes which
instance of receiving code.
 - InputDStream is used to run receiving code in driver. The driver creates
RDDs which are distributed to workers nodes which run processing logic. No
way to control on how RDD gets distributed to workers unless one does
repartitioning of generated RDDs.

2. DStreams or RDDs get no feedback on whether the processing was
successful or not. This means, one can't implement re-pull in case of
failures.

The above makes me realize that it is not trivial to implement a streaming
use case with atleast once processing guarantee. Any thoughts on building
reliable real time processing system using Spark will be appreciated.


Losing Executors on cluster with RDDs of 100GB

2014-08-22 Thread Yadid Ayzenberg

Hi all,

I have a spark cluster of 30 machines, 16GB / 8 cores on each running in 
standalone mode. Previously my application was working well ( several 
RDDs the largest being around 50G).
When I started processing larger amounts of data (RDDs of 100G) my app 
is losing executors. Im currently just loading them from a database, 
rePartitioning and persisting to disk (with replication x2)
I have spark.executor.memory= 9G, memoryFraction = 0.5, 
spark.worker.timeout =120, spark.akka.askTimeout=30, 
spark.storage.blockManagerHeartBeatMs=3.
I haven't change the default of my worker memory so its at 512m (should 
this be larger) ?


I've been getting the following messages from my app:

 [error] o.a.s.s.TaskSchedulerImpl - Lost executor 3 on myserver1: 
worker lost
[error] o.a.s.s.TaskSchedulerImpl - Lost executor 13 on myserver2: 
Unknown executor exit code (137) (died from signal 9?)
[error] a.r.EndpointWriter - AssociationError 
[akka.tcp://spark@master:59406] -> 
[akka.tcp://sparkExecutor@myserver2:32955]: Error [Association failed 
with [akka.tcp://sparkExecutor@myserver2:32955]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkexecu...@myserver2.com:32955]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: myserver2/198.18.102.160:32955

]
[error] a.r.EndpointWriter - AssociationError 
[akka.tcp://spark@master:59406] -> [akka.tcp://spark@myserver1:53855]: 
Error [Association failed with [akka.tcp://spark@myserver1:53855]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://spark@myserver1:53855]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: myserver1/198.18.102.160:53855

]

The worker logs and executor logs do not contain errors. Any ideas what 
the problem is ?


Yadid

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
Do I have to deploy Python to every machine to make "$PYSPARK_PYTHON" work
correctly?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12651.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Do we have to install the snappy when running the shuffle jobs

2014-08-22 Thread carlmartin
Hi everyone!
Nowadays Spark has set the Snappy as the default compression codec in 
spark-1.1.0-Snapshot.‍
So if I want run a shuffle job, do I have to install snappy in linux?

Manipulating/Analyzing CSV files in Spark on local machine

2014-08-22 Thread Hingorani, Vineet
Hello all,

I am new to Spark and I want to analyze csv file using Spark on my local 
machine. The csv files contains airline database and I want to get a few 
descriptive statistics (e.g. maximum of one column, mean, standard deviation in 
a column, etc.) for my file. I am reading the file using simple 
sc.textFile("file.csv"). The queries are:


1.  Is there any optimal way of reading the file so that loading takes less 
amount of time in Spark. The file can be of 3GB.

2.  How to handle column manipulations according to the type of queries 
given above.

Thank you

Regards,
Vineet Hingorani




why classTag not typeTag?

2014-08-22 Thread Mohit Jaggi
Folks,
I am wondering why Spark uses ClassTag in RDD[T: ClassTag] instead of the
more functional TypeTag option.
I have some code that needs TypeTag functionality and I don't know if a
typeTag can be converted to a classTag.

Mohit.


Re: pyspark/yarn and inconsistent number of executors

2014-08-22 Thread Sandy Ryza
Hi Calvin,

When you say "until all the memory in the cluster is allocated and the job
gets killed", do you know what's going on?  Spark apps should never be
killed for requesting / using too many resources?  Any associated error
message?

Unfortunately there are no tools currently for tweaking the number of
executors in an automated manner.  An option to use the entire YARN cluster
seems useful. I just filed a JIRA for it -
https://issues.apache.org/jira/browse/SPARK-3183.

-Sandy


On Tue, Aug 19, 2014 at 12:51 PM, Calvin  wrote:

> I've set up a YARN (Hadoop 2.4.1) cluster with Spark 1.0.1 and I've
> been seeing some inconsistencies with out of memory errors
> (java.lang.OutOfMemoryError: unable to create new native thread) when
> increasing the number of executors for a simple job (wordcount).
>
> The general format of my submission is:
>
> spark-submit \
>  --master yarn-client \
>  --num-executors=$EXECUTORS \
>  --executor-cores 1 \
>  --executor-memory 2G \
>  --driver-memory 3G \
>  count.py intput output
>
> If I run without specifying the number of executors, it defaults to
> two (3 containers: 2 executors, 1 driver). Is there any mechanism to
> let a spark application scale to the capacity of the YARN cluster
> automatically?
>
> Similarly, for low numbers of executors I get what I asked for (e.g.,
> 10 executors results in 11 containers running, 20 executors results in
> 21 containers, etc) until a particular threshold... when I specify 50
> containers, Spark seems to start asking for more and more containers
> until all the memory in the cluster is allocated and the job gets
> killed.
>
> I don't understand that particular behavior—if anyone has any
> thoughts, that would be great if you could share your experiences.
>
> Wouldn't it be preferable to have Spark stop requesting containers if
> the cluster is at capacity rather than kill the job or error out?
>
> Does anyone have any recommendations on how to tweak the number of
> executors in an automated manner?
>
> Thanks,
> Calvin
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
Hi Sankar,

You need to create an external table in order to specify the location of
data (i.e. using CREATE EXTERNAL TABLE user1  LOCATION).  You can take
a look at this page

for
reference.

Thanks,

Yin


On Thu, Aug 21, 2014 at 11:12 PM, S Malligarjunan <
smalligarju...@yahoo.com.invalid> wrote:

> Hello All,
>
> When i execute the following query
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> CREATE TABLE user1 (time string, id string, u_id string, c_ip string,
> user_agent string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES
> TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://
> hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>
> I am getting the following error
> org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: CREATE
> TABLE user1 (time string, id string, u_id string, c_ip string, user_agent
> string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY
> '
> ' STORED AS TEXTFILE LOCATION 's3n://
> hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>  at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:215)
> at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:98)
>  at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:102)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:22)
>  at $iwC$$iwC$$iwC$$iwC.(:27)
> at $iwC$$iwC$$iwC.(:29)
>  at $iwC$$iwC.(:31)
> at $iwC.(:33)
>  at (:35)
>
> Kindly let me know what could be the issue here.
>
> I have cloned spark from github. Using Hadoop 1.0.3
>
> Thanks and Regards,
> Sankar S.
>
>


Re: Spark SQL Parser error

2014-08-22 Thread S Malligarjunan
Hello Yin,

I have tried  the create external table command as well. I get the same error.
Please help me to find the root cause.
 
Thanks and Regards,
Sankar S.  



On Friday, 22 August 2014, 22:43, Yin Huai  wrote:
 


Hi Sankar,

You need to create an external table in order to specify the location of data 
(i.e. using CREATE EXTERNAL TABLE user1  LOCATION).  You can take a look at 
this page for reference. 

Thanks,

Yin



On Thu, Aug 21, 2014 at 11:12 PM, S Malligarjunan 
 wrote:

Hello All,
>
>
>When i execute the following query 
>
>
>
>
>val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
>
>CREATE TABLE user1 (time string, id string, u_id string, c_ip string, 
>user_agent string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES 
>TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 
>'s3n://hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>
>
>I am getting the following error 
>org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: CREATE TABLE 
>user1 (time string, id string, u_id string, c_ip string, user_agent string) 
>ROW FORMAT DELIMITED FIELDS TERMINATED BY '' LINES TERMINATED BY '
>' STORED AS TEXTFILE LOCATION 
>'s3n://hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:215)
>at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:98)
>at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:102)
>at $iwC$$iwC$$iwC$$iwC$$iwC.(:22)
>at $iwC$$iwC$$iwC$$iwC.(:27)
>at $iwC$$iwC$$iwC.(:29)
>at $iwC$$iwC.(:31)
>at $iwC.(:33)
>at (:35)
>
>
>Kindly let me know what could be the issue here.
>
>
>I have cloned spark from github. Using Hadoop 1.0.3 
> 
>Thanks and Regards,
>Sankar S.  
>
>

Re: Spark SQL Parser error

2014-08-22 Thread S Malligarjunan
Hello Yin,

Forgot to mention one thing, the same query works fine in Hive and Shark..
 
Thanks and Regards,
Sankar S.  



On , S Malligarjunan  wrote:
 


Hello Yin,

I have tried  the create external table command as well. I get the same error.
Please help me to find the root cause.
 
Thanks and Regards,
Sankar S.  



On Friday, 22 August 2014, 22:43, Yin Huai  wrote:
 


Hi Sankar,

You need to create an external table in order to specify the location of data 
(i.e. using CREATE EXTERNAL TABLE user1  LOCATION).  You can take a look at 
this page for reference. 

Thanks,

Yin



On Thu, Aug 21, 2014 at 11:12 PM, S Malligarjunan 
 wrote:

Hello All,
>
>
>When i execute the following query 
>
>
>
>
>val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
>
>CREATE TABLE user1 (time string, id string, u_id string, c_ip string, 
>user_agent string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES 
>TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 
>'s3n://hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>
>
>I am getting the following error 
>org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: CREATE TABLE 
>user1 (time string, id string, u_id string, c_ip string, user_agent string) 
>ROW FORMAT DELIMITED FIELDS TERMINATED BY '' LINES TERMINATED BY '
>' STORED AS TEXTFILE LOCATION 
>'s3n://hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:215)
>at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:98)
>at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:102)
>at $iwC$$iwC$$iwC$$iwC$$iwC.(:22)
>at $iwC$$iwC$$iwC$$iwC.(:27)
>at $iwC$$iwC$$iwC.(:29)
>at $iwC$$iwC.(:31)
>at $iwC.(:33)
>at (:35)
>
>
>Kindly let me know what could be the issue here.
>
>
>I have cloned spark from github. Using Hadoop 1.0.3 
> 
>Thanks and Regards,
>Sankar S.  
>
>

Re: Spark SQL Parser error

2014-08-22 Thread S Malligarjunan
Hello Yin/All.

@Yin - Thanks for helping. I solved the sql parser error. I am getting the 
following exception now

scala> hiveContext.hql("ADD JAR s3n://hadoop.anonymous.com/lib/myudf.jar");
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
14/08/22 17:58:55 INFO SessionState: converting to local 
s3n://hadoop.anonymous.com/lib/myudf.jar
14/08/22 17:58:56 ERROR SessionState: Unable to register 
/tmp/3d273a4c-0494-4bec-80fe-86aa56f11684_resources/myudf.jar
Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be 
cast to java.net.URLClassLoader
java.lang.ClassCastException: 
org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to 
java.net.URLClassLoader
at org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1680)

 
Thanks and Regards,
Sankar S.  



On Friday, 22 August 2014, 22:53, S Malligarjunan 
 wrote:
 


Hello Yin,

Forgot to mention one thing, the same query works fine in Hive and Shark..
 
Thanks and Regards,
Sankar S.  



On , S Malligarjunan  wrote:
 


Hello Yin,

I have tried  the create external table command as well. I get the same error.
Please help me to find the root cause.
 
Thanks and Regards,
Sankar S.  



On Friday, 22 August 2014, 22:43, Yin Huai  wrote:
 


Hi Sankar,

You need to create an external table in order to specify the location of data 
(i.e. using CREATE EXTERNAL TABLE user1  LOCATION).  You can take a look at 
this page for reference. 

Thanks,

Yin



On Thu, Aug 21, 2014 at 11:12 PM, S Malligarjunan 
 wrote:

Hello All,
>
>
>When i execute the following query 
>
>
>
>
>val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
>
>CREATE TABLE user1 (time string, id string, u_id string, c_ip string, 
>user_agent string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES 
>TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 
>'s3n://hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>
>
>I am getting the following error 
>org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: CREATE TABLE 
>user1 (time string, id string, u_id string, c_ip string, user_agent string) 
>ROW FORMAT DELIMITED FIELDS TERMINATED BY '' LINES TERMINATED BY '
>' STORED AS TEXTFILE LOCATION 
>'s3n://hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:215)
>at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:98)
>at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:102)
>at $iwC$$iwC$$iwC$$iwC$$iwC.(:22)
>at $iwC$$iwC$$iwC$$iwC.(:27)
>at $iwC$$iwC$$iwC.(:29)
>at $iwC$$iwC.(:31)
>at $iwC.(:33)
>at (:35)
>
>
>Kindly let me know what could be the issue here.
>
>
>I have cloned spark from github. Using Hadoop 1.0.3 
> 
>Thanks and Regards,
>Sankar S.  
>
>

Re: Finding previous and next element in a sorted RDD

2014-08-22 Thread cjwang
It would be nice if an RDD that was massaged by OrderedRDDFunction could know
its "neighbors".




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-tp12621p12664.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Hive From Spark

2014-08-22 Thread Andrew Lee
Hopefully there could be some progress on SPARK-2420. It looks like shading may 
be the voted solution among downgrading.
Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark 1.1.2? 
By the way, regarding bin/spark-sql? Is this more of a debugging tool for Spark 
job integrating with Hive? How does people use spark-sql? I'm trying to 
understand the rationale and motivation behind this script, any idea?

> Date: Thu, 21 Aug 2014 16:31:08 -0700
> Subject: Re: Hive From Spark
> From: van...@cloudera.com
> To: l...@yahoo-inc.com.invalid
> CC: user@spark.apache.org; u...@spark.incubator.apache.org; pwend...@gmail.com
> 
> Hi Du,
> 
> I don't believe the Guava change has made it to the 1.1 branch. The
> Guava doc says "hashInt" was added in 12.0, so what's probably
> happening is that you have and old version of Guava in your classpath
> before the Spark jars. (Hadoop ships with Guava 11, so that may be the
> source of your problem.)
> 
> On Thu, Aug 21, 2014 at 4:23 PM, Du Li  wrote:
> > Hi,
> >
> > This guava dependency conflict problem should have been fixed as of
> > yesterday according to https://issues.apache.org/jira/browse/SPARK-2420
> >
> > However, I just got java.lang.NoSuchMethodError:
> > com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
> > by the following code snippet and “mvn3 test” on Mac. I built the latest
> > version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local
> > maven repo. From my pom file I explicitly excluded guava from almost all
> > possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
> > hadoop-client. This snippet is abstracted from a larger project. So the
> > pom.xml includes many dependencies although not all are required by this
> > snippet. The pom.xml is attached.
> >
> > Anybody knows what to fix it?
> >
> > Thanks,
> > Du
> > ---
> >
> > package com.myself.test
> >
> > import org.scalatest._
> > import org.apache.hadoop.io.{NullWritable, BytesWritable}
> > import org.apache.spark.{SparkContext, SparkConf}
> > import org.apache.spark.SparkContext._
> >
> > class MyRecord(name: String) extends Serializable {
> >   def getWritable(): BytesWritable = {
> > new
> > BytesWritable(Option(name).getOrElse("\\N").toString.getBytes("UTF-8"))
> >   }
> >
> >   final override def equals(that: Any): Boolean = {
> > if( !that.isInstanceOf[MyRecord] )
> >   false
> > else {
> >   val other = that.asInstanceOf[MyRecord]
> >   this.getWritable == other.getWritable
> > }
> >   }
> > }
> >
> > class MyRecordTestSuite extends FunSuite {
> >   // construct an MyRecord by Consumer.schema
> >   val rec: MyRecord = new MyRecord("James Bond")
> >
> >   test("generated SequenceFile should be readable from spark") {
> > val path = "./testdata/"
> >
> > val conf = new SparkConf(false).setMaster("local").setAppName("test data
> > exchange with Hive")
> > conf.set("spark.driver.host", "localhost")
> > val sc = new SparkContext(conf)
> > val rdd = sc.makeRDD(Seq(rec))
> > rdd.map((x: MyRecord) => (NullWritable.get(), x.getWritable()))
> >   .saveAsSequenceFile(path)
> >
> > val bytes = sc.sequenceFile(path, classOf[NullWritable],
> > classOf[BytesWritable]).first._2
> > assert(rec.getWritable() == bytes)
> >
> > sc.stop()
> > System.clearProperty("spark.driver.port")
> >   }
> > }
> >
> >
> > From: Andrew Lee 
> > Reply-To: "user@spark.apache.org" 
> > Date: Monday, July 21, 2014 at 10:27 AM
> > To: "user@spark.apache.org" ,
> > "u...@spark.incubator.apache.org" 
> >
> > Subject: RE: Hive From Spark
> >
> > Hi All,
> >
> > Currently, if you are running Spark HiveContext API with Hive 0.12, it won't
> > work due to the following 2 libraries which are not consistent with Hive
> > 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common
> > practice, they should be consistent to work inter-operable).
> >
> > These are under discussion in the 2 JIRA tickets:
> >
> > https://issues.apache.org/jira/browse/HIVE-7387
> >
> > https://issues.apache.org/jira/browse/SPARK-2420
> >
> > When I ran the command by tweaking the classpath and build for Spark
> > 1.0.1-rc3, I was able to create table through HiveContext, however, when I
> > fetch the data, due to incompatible API calls in Guava, it breaks. This is
> > critical since it needs to map the cllumns to the RDD schema.
> >
> > Hive and Hadoop are using an older version of guava libraries (11.0.1) where
> > Spark Hive is using guava 14.0.1+.
> > The community isn't willing to downgrade to 11.0.1 which is the current
> > version for Hadoop 2.2 and Hive 0.12.
> > Be aware of protobuf version as well in Hive 0.12 (it uses protobuf 2.4).
> >
> > scala>
> >
> > scala> import org.apache.spark.SparkContext
> > import org.apache.spark.SparkContext
> >
> > scala> import org.apache.spark.sql.hive._
> > import org.apache.spark.sql.hive._
> >
> > scala>
> >
> > scala> val hiveContext = new org.apac

importing scala libraries from python?

2014-08-22 Thread Jonathan Haddad
This is probably a bit ridiculous, but I'm wondering if it's possible
to use scala libraries in a python module?  The Cassandra connector
here https://github.com/datastax/spark-cassandra-connector is in
Scala, would I need a Python version of that library to use Python
Spark?

Personally I have no issue with using Scala, but I'm exploring if
it'll be possible to integrate spark into my Python Cassandra object
mapper, cqlengine.

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Hive From Spark

2014-08-22 Thread Marcelo Vanzin
SPARK-2420 is fixed. I don't think it will be in 1.1, though - might
be too risky at this point.

I'm not familiar with spark-sql.

On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee  wrote:
> Hopefully there could be some progress on SPARK-2420. It looks like shading
> may be the voted solution among downgrading.
>
> Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark
> 1.1.2?
>
> By the way, regarding bin/spark-sql? Is this more of a debugging tool for
> Spark job integrating with Hive?
> How does people use spark-sql? I'm trying to understand the rationale and
> motivation behind this script, any idea?
>
>
>> Date: Thu, 21 Aug 2014 16:31:08 -0700
>
>> Subject: Re: Hive From Spark
>> From: van...@cloudera.com
>> To: l...@yahoo-inc.com.invalid
>> CC: user@spark.apache.org; u...@spark.incubator.apache.org;
>> pwend...@gmail.com
>
>>
>> Hi Du,
>>
>> I don't believe the Guava change has made it to the 1.1 branch. The
>> Guava doc says "hashInt" was added in 12.0, so what's probably
>> happening is that you have and old version of Guava in your classpath
>> before the Spark jars. (Hadoop ships with Guava 11, so that may be the
>> source of your problem.)
>>
>> On Thu, Aug 21, 2014 at 4:23 PM, Du Li  wrote:
>> > Hi,
>> >
>> > This guava dependency conflict problem should have been fixed as of
>> > yesterday according to https://issues.apache.org/jira/browse/SPARK-2420
>> >
>> > However, I just got java.lang.NoSuchMethodError:
>> >
>> > com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
>> > by the following code snippet and “mvn3 test” on Mac. I built the latest
>> > version of spark (1.1.0-SNAPSHOT) and installed the jar files to the
>> > local
>> > maven repo. From my pom file I explicitly excluded guava from almost all
>> > possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
>> > hadoop-client. This snippet is abstracted from a larger project. So the
>> > pom.xml includes many dependencies although not all are required by this
>> > snippet. The pom.xml is attached.
>> >
>> > Anybody knows what to fix it?
>> >
>> > Thanks,
>> > Du
>> > ---
>> >
>> > package com.myself.test
>> >
>> > import org.scalatest._
>> > import org.apache.hadoop.io.{NullWritable, BytesWritable}
>> > import org.apache.spark.{SparkContext, SparkConf}
>> > import org.apache.spark.SparkContext._
>> >
>> > class MyRecord(name: String) extends Serializable {
>> > def getWritable(): BytesWritable = {
>> > new
>> > BytesWritable(Option(name).getOrElse("\\N").toString.getBytes("UTF-8"))
>> > }
>> >
>> > final override def equals(that: Any): Boolean = {
>> > if( !that.isInstanceOf[MyRecord] )
>> > false
>> > else {
>> > val other = that.asInstanceOf[MyRecord]
>> > this.getWritable == other.getWritable
>> > }
>> > }
>> > }
>> >
>> > class MyRecordTestSuite extends FunSuite {
>> > // construct an MyRecord by Consumer.schema
>> > val rec: MyRecord = new MyRecord("James Bond")
>> >
>> > test("generated SequenceFile should be readable from spark") {
>> > val path = "./testdata/"
>> >
>> > val conf = new SparkConf(false).setMaster("local").setAppName("test data
>> > exchange with Hive")
>> > conf.set("spark.driver.host", "localhost")
>> > val sc = new SparkContext(conf)
>> > val rdd = sc.makeRDD(Seq(rec))
>> > rdd.map((x: MyRecord) => (NullWritable.get(), x.getWritable()))
>> > .saveAsSequenceFile(path)
>> >
>> > val bytes = sc.sequenceFile(path, classOf[NullWritable],
>> > classOf[BytesWritable]).first._2
>> > assert(rec.getWritable() == bytes)
>> >
>> > sc.stop()
>> > System.clearProperty("spark.driver.port")
>> > }
>> > }
>> >
>> >
>> > From: Andrew Lee 
>> > Reply-To: "user@spark.apache.org" 
>> > Date: Monday, July 21, 2014 at 10:27 AM
>> > To: "user@spark.apache.org" ,
>> > "u...@spark.incubator.apache.org" 
>> >
>> > Subject: RE: Hive From Spark
>> >
>> > Hi All,
>> >
>> > Currently, if you are running Spark HiveContext API with Hive 0.12, it
>> > won't
>> > work due to the following 2 libraries which are not consistent with Hive
>> > 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a
>> > common
>> > practice, they should be consistent to work inter-operable).
>> >
>> > These are under discussion in the 2 JIRA tickets:
>> >
>> > https://issues.apache.org/jira/browse/HIVE-7387
>> >
>> > https://issues.apache.org/jira/browse/SPARK-2420
>> >
>> > When I ran the command by tweaking the classpath and build for Spark
>> > 1.0.1-rc3, I was able to create table through HiveContext, however, when
>> > I
>> > fetch the data, due to incompatible API calls in Guava, it breaks. This
>> > is
>> > critical since it needs to map the cllumns to the RDD schema.
>> >
>> > Hive and Hadoop are using an older version of guava libraries (11.0.1)
>> > where
>> > Spark Hive is using guava 14.0.1+.
>> > The community isn't willing to downgrade to 11.0.1 which is the current
>> > version for Hadoop 2.2 and Hive 0.12.
>> > Be aware of protobuf version as well in Hive 0.12 (it uses protobuf
>

How to start master and workers on Windows

2014-08-22 Thread Steve Lewis
 Thank you for your advice - I really need this help and promise to post a
blog entry once it works

I ran
>bin\spark-class.cmd org.apache.spark.deploy.master.Master
and this ran successfully and I got a web page at http://localhost:8080

this says
Spark Master at spark://192.168.1.4:7077
*My machine could be called localhost, asterix or 192.169.1.4*
Then in a separate cmd window I tryed
 >bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://
192.168.1.4:7077
14/08/22 11:52:42 INFO SecurityManager: Using Spark's default log4j
profile: org/apache/spark/log4j-defaults.properties
14/08/22 11:52:42 INFO SecurityManager: Changing view acls to: Steve
14/08/22 11:52:42 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(Steve)
14/08/22 11:52:43 INFO Slf4jLogger: Slf4jLogger started
14/08/22 11:52:43 INFO Remoting: Starting remoting
14/08/22 11:52:43 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkWorker@192.168.1.4:53068]
14/08/22 11:52:43 INFO Worker: Starting Spark worker 192.168.1.4:53068 with
6 cores, 14.9 GB RAM
14/08/22 11:52:43 INFO Worker: Spark home: e:\spark-1.0.2-bin-hadoop1\bin\..
14/08/22 11:52:43 INFO WorkerWebUI: Started WorkerWebUI at
http://192.168.1.4:8081
14/08/22 11:52:43 INFO Worker: Connecting to master spark://192.168.1.4:7077
...
14/08/22 11:52:43 INFO Worker: Successfully registered with master spark://
192.168.1.4:7077


Now I fire up IntelliJ and try to run JavaWordCount with the command line
including
-Dspark.master=spark://192.168.1.4:7077

And it says

14/08/22 11:52:58 INFO Worker: Asked to launch executor
app-20140822115258-0002/0 for JavaWordCount
14/08/22 11:52:58 INFO ExecutorRunner: Launch command:
"C:\Progra~1\Java\jdk1.6.0_25/bin/java" "-cp"
"E:\ThirdParty\spark-1.0.2\bin\..\conf;E:\ThirdParty\spark-1.0.2\bin\..\assembly\target\scala-2.10\spark-assembly-1.0.2-hadoop1.0.3-mapr-3.0.3.jar;;E:\ThirdParty\spark-1.0.2\bin\..\lib_managed\jars\datanucleus-api-jdo-3.2.1.jar;E:\ThirdParty\spark-1.0.2\bin\..\lib_managed\jars\datanucleus-core-3.2.2.jar;E:\ThirdParty\spark-1.0.2\bin\..\lib_managed\jars\datanucleus-rdbms-3.2.1.jar"
"-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://
spark@192.168.1.4:53092/user/CoarseGrainedScheduler" "0" "192.168.1.4" "6"
"akka.tcp://sparkWorker@192.168.1.4:53068/user/Worker"
"app-20140822115258-0002"
14/08/22 11:53:01 INFO Worker: Asked to kill executor
app-20140822115258-0002/0
14/08/22 11:53:01 INFO ExecutorRunner: Runner thread for executor
app-20140822115258-0002/0 interrupted
14/08/22 11:53:01 INFO ExecutorRunner: Killing process!
Exception in thread "ExecutorRunner for app-20140822115258-0002/0"
java.lang.InterruptedException
at java.lang.ProcessImpl.waitFor(Native Method)
at org.apache.spark.deploy.worker.ExecutorRunner.org
$apache$spark$deploy$worker$ExecutorRunner$$killProcess(ExecutorRunner.scala:80)
at
org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:157)
at
org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:58)
14/08/22 11:53:01 INFO LocalActorRef: Message


 Clearly I am making progress but now need to know how to launch -
preferably in a debugger against the worker I have launched


RE: Hive From Spark

2014-08-22 Thread Jeremy Chambers
Ø  How does people use spark-sql

See:

http://spark.apache.org/docs/latest/sql-programming-guide.html

From: Andrew Lee [mailto:alee...@hotmail.com]
Sent: Friday, August 22, 2014 2:25 PM
To: Marcelo Vanzin
Cc: user@spark.apache.org; u...@spark.incubator.apache.org; Patrick Wendell
Subject: RE: Hive From Spark

Hopefully there could be some progress on SPARK-2420. It looks like shading may 
be the voted solution among downgrading.

Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark 1.1.2?

By the way, regarding bin/spark-sql? Is this more of a debugging tool for Spark 
job integrating with Hive?
How does people use spark-sql? I'm trying to understand the rationale and 
motivation behind this script, any idea?


> Date: Thu, 21 Aug 2014 16:31:08 -0700
> Subject: Re: Hive From Spark
> From: van...@cloudera.com
> To: l...@yahoo-inc.com.invalid
> CC: user@spark.apache.org; 
> u...@spark.incubator.apache.org; 
> pwend...@gmail.com
>
> Hi Du,
>
> I don't believe the Guava change has made it to the 1.1 branch. The
> Guava doc says "hashInt" was added in 12.0, so what's probably
> happening is that you have and old version of Guava in your classpath
> before the Spark jars. (Hadoop ships with Guava 11, so that may be the
> source of your problem.)
>
> On Thu, Aug 21, 2014 at 4:23 PM, Du Li 
> mailto:l...@yahoo-inc.com.invalid>> wrote:
> > Hi,
> >
> > This guava dependency conflict problem should have been fixed as of
> > yesterday according to https://issues.apache.org/jira/browse/SPARK-2420
> >
> > However, I just got java.lang.NoSuchMethodError:
> > com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
> > by the following code snippet and "mvn3 test" on Mac. I built the latest
> > version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local
> > maven repo. From my pom file I explicitly excluded guava from almost all
> > possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
> > hadoop-client. This snippet is abstracted from a larger project. So the
> > pom.xml includes many dependencies although not all are required by this
> > snippet. The pom.xml is attached.
> >
> > Anybody knows what to fix it?
> >
> > Thanks,
> > Du
> > ---
> >
> > package com.myself.test
> >
> > import org.scalatest._
> > import org.apache.hadoop.io.{NullWritable, BytesWritable}
> > import org.apache.spark.{SparkContext, SparkConf}
> > import org.apache.spark.SparkContext._
> >
> > class MyRecord(name: String) extends Serializable {
> > def getWritable(): BytesWritable = {
> > new
> > BytesWritable(Option(name).getOrElse("\\N").toString.getBytes("UTF-8"))
> > }
> >
> > final override def equals(that: Any): Boolean = {
> > if( !that.isInstanceOf[MyRecord] )
> > false
> > else {
> > val other = that.asInstanceOf[MyRecord]
> > this.getWritable == other.getWritable
> > }
> > }
> > }
> >
> > class MyRecordTestSuite extends FunSuite {
> > // construct an MyRecord by Consumer.schema
> > val rec: MyRecord = new MyRecord("James Bond")
> >
> > test("generated SequenceFile should be readable from spark") {
> > val path = "./testdata/"
> >
> > val conf = new SparkConf(false).setMaster("local").setAppName("test data
> > exchange with Hive")
> > conf.set("spark.driver.host", "localhost")
> > val sc = new SparkContext(conf)
> > val rdd = sc.makeRDD(Seq(rec))
> > rdd.map((x: MyRecord) => (NullWritable.get(), x.getWritable()))
> > .saveAsSequenceFile(path)
> >
> > val bytes = sc.sequenceFile(path, classOf[NullWritable],
> > classOf[BytesWritable]).first._2
> > assert(rec.getWritable() == bytes)
> >
> > sc.stop()
> > System.clearProperty("spark.driver.port")
> > }
> > }
> >
> >
> > From: Andrew Lee mailto:alee...@hotmail.com>>
> > Reply-To: "user@spark.apache.org" 
> > mailto:user@spark.apache.org>>
> > Date: Monday, July 21, 2014 at 10:27 AM
> > To: "user@spark.apache.org" 
> > mailto:user@spark.apache.org>>,
> > "u...@spark.incubator.apache.org" 
> > mailto:u...@spark.incubator.apache.org>>
> >
> > Subject: RE: Hive From Spark
> >
> > Hi All,
> >
> > Currently, if you are running Spark HiveContext API with Hive 0.12, it won't
> > work due to the following 2 libraries which are not consistent with Hive
> > 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common
> > practice, they should be consistent to work inter-operable).
> >
> > These are under discussion in the 2 JIRA tickets:
> >
> > https://issues.apache.org/jira/browse/HIVE-7387
> >
> > https://issues.apache.org/jira/browse/SPARK-2420
> >
> > When I ran the command by tweaking the classpath and build for Spark
> > 1.0.1-rc3, I was able to create table through HiveContext, however, when I
> > fetch the data, due to incompatible API calls in Guava,

FetchFailed when collect at YARN cluster

2014-08-22 Thread Jiayu Zhou
Hi,

I am having this FetchFailed issue when the driver is about to collect about
2.5M lines of short strings (about 10 characters each line) from a YARN
cluster with 400 nodes:

*14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 205.0 in stage
0.0 (TID 1228, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com,
37899, 0), shuffleId=0, mapId=420, reduceId=205)
14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 603.0 in stage
0.0 (TID 1626, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com,
37899, 0), shuffleId=0, mapId=420, reduceId=603)*

And other than this FetchFailed, I am not able to see anything else from the
log file (no OOM errors shown).  

This does not happen when there is only 2M lines. I guess it might because
of the akka message size, and then I used the following 

spark.akka.frameSize  100
spark.akka.timeout  200

And that does not help as well. Has anyone experienced similar problems? 

Thanks,
Jiayu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
Hello Sankar,

"Add JAR" in SQL is not supported at the moment. We are working on it (
https://issues.apache.org/jira/browse/SPARK-2219). For now, can you try
SparkContext.addJar or using "--jars " to launch spark shell?

Thanks,

Yin


On Fri, Aug 22, 2014 at 2:01 PM, S Malligarjunan 
wrote:

> Hello Yin/All.
>
> @Yin - Thanks for helping. I solved the sql parser error. I am getting the
> following exception now
>
> scala> hiveContext.hql("ADD JAR s3n://hadoop.anonymous.com/lib/myudf.jar
> ");
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
> 14/08/22 17:58:55 INFO SessionState: converting to local s3n://
> hadoop.anonymous.com/lib/myudf.jar
> 14/08/22 17:58:56 ERROR SessionState: Unable to register
> /tmp/3d273a4c-0494-4bec-80fe-86aa56f11684_resources/myudf.jar
> Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot
> be cast to java.net.URLClassLoader
> java.lang.ClassCastException:
> org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to
> java.net.URLClassLoader
>  at
> org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1680)
>
>
> Thanks and Regards,
> Sankar S.
>
>
>
>   On Friday, 22 August 2014, 22:53, S Malligarjunan
>  wrote:
>
>
>  Hello Yin,
>
> Forgot to mention one thing, the same query works fine in Hive and Shark..
>
> Thanks and Regards,
> Sankar S.
>
>
>
>   On , S Malligarjunan  wrote:
>
>
>  Hello Yin,
>
> I have tried  the create external table command as well. I get the same
> error.
> Please help me to find the root cause.
>
> Thanks and Regards,
> Sankar S.
>
>
>
>   On Friday, 22 August 2014, 22:43, Yin Huai 
> wrote:
>
>
> Hi Sankar,
>
> You need to create an external table in order to specify the location of
> data (i.e. using CREATE EXTERNAL TABLE user1  LOCATION).  You can
> take a look at this page
> 
>  for
> reference.
>
> Thanks,
>
> Yin
>
>
> On Thu, Aug 21, 2014 at 11:12 PM, S Malligarjunan <
> smalligarju...@yahoo.com.invalid> wrote:
>
> Hello All,
>
> When i execute the following query
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> CREATE TABLE user1 (time string, id string, u_id string, c_ip string,
> user_agent string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES
> TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://
> hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>
> I am getting the following error
> org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: CREATE
> TABLE user1 (time string, id string, u_id string, c_ip string, user_agent
> string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY
> '
> ' STORED AS TEXTFILE LOCATION 's3n://
> hadoop.anonymous.com/output/qa/cnv_px_ip_gnc/ds=2014-06-14/')
>  at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:215)
> at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:98)
>  at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:102)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:22)
>  at $iwC$$iwC$$iwC$$iwC.(:27)
> at $iwC$$iwC$$iwC.(:29)
>  at $iwC$$iwC.(:31)
> at $iwC.(:33)
>  at (:35)
>
> Kindly let me know what could be the issue here.
>
> I have cloned spark from github. Using Hadoop 1.0.3
>
> Thanks and Regards,
> Sankar S.
>
>
>
>
>
>
>
>
>


[PySpark] order of values in GroupByKey()

2014-08-22 Thread Arpan Ghosh
Is there any way to control the ordering of values for each key during a
groupByKey() operation? Is there some sort of implicit ordering in place
already?

Thanks

Arpan


spark streaming - realtime reports - storing current state of resources

2014-08-22 Thread salemi
Hi All,
I have set of 1000k Workers of a company with different attribute associated
with them. I like at anytime to be able to report on their current state and
update the reports every 5 second. 

Spark Streaming allows you to receive events about the Workers state changes
and process them. Where would I store of the state of the 1000k workers so I
can change the state of the workers in realtime and query them in real time?
with Spark Streaming?


thanks,
Ali



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-realtime-reports-storing-current-state-of-resources-tp12674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



cache table with JDBC

2014-08-22 Thread ken
I am using Spark's Thrift server to connect to Hive and use JDBC to issue
queries. Is there a way to cache table in Sparck by using JDBC call?

Thanks,
Ken



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/cache-table-with-JDBC-tp12675.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Hive From Spark

2014-08-22 Thread Du Li
I thought the fix had been pushed to the apache master ref. commit
"[SPARK-2848] Shade Guava in uber-jars" By Marcelo Vanzin on 8/20. So my
previous email was based on own build of the apache master, which turned
out not working yet.

Marcelo: Please correct me if I got that commit wrong.

Thanks,
Du



On 8/22/14, 11:41 AM, "Marcelo Vanzin"  wrote:

>SPARK-2420 is fixed. I don't think it will be in 1.1, though - might
>be too risky at this point.
>
>I'm not familiar with spark-sql.
>
>On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee  wrote:
>> Hopefully there could be some progress on SPARK-2420. It looks like
>>shading
>> may be the voted solution among downgrading.
>>
>> Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark
>> 1.1.2?
>>
>> By the way, regarding bin/spark-sql? Is this more of a debugging tool
>>for
>> Spark job integrating with Hive?
>> How does people use spark-sql? I'm trying to understand the rationale
>>and
>> motivation behind this script, any idea?
>>
>>
>>> Date: Thu, 21 Aug 2014 16:31:08 -0700
>>
>>> Subject: Re: Hive From Spark
>>> From: van...@cloudera.com
>>> To: l...@yahoo-inc.com.invalid
>>> CC: user@spark.apache.org; u...@spark.incubator.apache.org;
>>> pwend...@gmail.com
>>
>>>
>>> Hi Du,
>>>
>>> I don't believe the Guava change has made it to the 1.1 branch. The
>>> Guava doc says "hashInt" was added in 12.0, so what's probably
>>> happening is that you have and old version of Guava in your classpath
>>> before the Spark jars. (Hadoop ships with Guava 11, so that may be the
>>> source of your problem.)
>>>
>>> On Thu, Aug 21, 2014 at 4:23 PM, Du Li 
>>>wrote:
>>> > Hi,
>>> >
>>> > This guava dependency conflict problem should have been fixed as of
>>> > yesterday according to
>>>https://issues.apache.org/jira/browse/SPARK-2420
>>> >
>>> > However, I just got java.lang.NoSuchMethodError:
>>> >
>>> > 
>>>com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha
>>>shCode;
>>> > by the following code snippet and ³mvn3 test² on Mac. I built the
>>>latest
>>> > version of spark (1.1.0-SNAPSHOT) and installed the jar files to the
>>> > local
>>> > maven repo. From my pom file I explicitly excluded guava from almost
>>>all
>>> > possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
>>> > hadoop-client. This snippet is abstracted from a larger project. So
>>>the
>>> > pom.xml includes many dependencies although not all are required by
>>>this
>>> > snippet. The pom.xml is attached.
>>> >
>>> > Anybody knows what to fix it?
>>> >
>>> > Thanks,
>>> > Du
>>> > ---
>>> >
>>> > package com.myself.test
>>> >
>>> > import org.scalatest._
>>> > import org.apache.hadoop.io.{NullWritable, BytesWritable}
>>> > import org.apache.spark.{SparkContext, SparkConf}
>>> > import org.apache.spark.SparkContext._
>>> >
>>> > class MyRecord(name: String) extends Serializable {
>>> > def getWritable(): BytesWritable = {
>>> > new
>>> > 
>>>BytesWritable(Option(name).getOrElse("\\N").toString.getBytes("UTF-8"))
>>> > }
>>> >
>>> > final override def equals(that: Any): Boolean = {
>>> > if( !that.isInstanceOf[MyRecord] )
>>> > false
>>> > else {
>>> > val other = that.asInstanceOf[MyRecord]
>>> > this.getWritable == other.getWritable
>>> > }
>>> > }
>>> > }
>>> >
>>> > class MyRecordTestSuite extends FunSuite {
>>> > // construct an MyRecord by Consumer.schema
>>> > val rec: MyRecord = new MyRecord("James Bond")
>>> >
>>> > test("generated SequenceFile should be readable from spark") {
>>> > val path = "./testdata/"
>>> >
>>> > val conf = new SparkConf(false).setMaster("local").setAppName("test
>>>data
>>> > exchange with Hive")
>>> > conf.set("spark.driver.host", "localhost")
>>> > val sc = new SparkContext(conf)
>>> > val rdd = sc.makeRDD(Seq(rec))
>>> > rdd.map((x: MyRecord) => (NullWritable.get(), x.getWritable()))
>>> > .saveAsSequenceFile(path)
>>> >
>>> > val bytes = sc.sequenceFile(path, classOf[NullWritable],
>>> > classOf[BytesWritable]).first._2
>>> > assert(rec.getWritable() == bytes)
>>> >
>>> > sc.stop()
>>> > System.clearProperty("spark.driver.port")
>>> > }
>>> > }
>>> >
>>> >
>>> > From: Andrew Lee 
>>> > Reply-To: "user@spark.apache.org" 
>>> > Date: Monday, July 21, 2014 at 10:27 AM
>>> > To: "user@spark.apache.org" ,
>>> > "u...@spark.incubator.apache.org" 
>>> >
>>> > Subject: RE: Hive From Spark
>>> >
>>> > Hi All,
>>> >
>>> > Currently, if you are running Spark HiveContext API with Hive 0.12,
>>>it
>>> > won't
>>> > work due to the following 2 libraries which are not consistent with
>>>Hive
>>> > 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a
>>> > common
>>> > practice, they should be consistent to work inter-operable).
>>> >
>>> > These are under discussion in the 2 JIRA tickets:
>>> >
>>> > https://issues.apache.org/jira/browse/HIVE-7387
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-2420
>>> >
>>> > When I ran the command by tweaking the classpath and build for Spark
>>> > 1.0.1-rc3, I was able to cre

Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue
seems related to Hadoop1.  When switching to using
spark-1.0.2-bin-hadoop*2*, the issue disappears.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I forgot to say, I am using bin/spark-shell, spark-1.0.2
That host has scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_11)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12678.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: AppMaster OOME on YARN

2014-08-22 Thread Vipul Pandey
This is all that I see related to spark.MapOutputTrackerMaster in the master 
logs after OOME


14/08/21 13:24:45 ERROR ActorSystemImpl: Uncaught fatal error from thread 
[spark-akka.actor.default-dispatcher-27] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: Java heap space
Exception in thread "Thread-59" org.apache.spark.SparkException: Error 
communicating with MapOutputTracker
at 
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:108)
at 
org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:114)
at 
org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:319)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:82)
at org.apache.spark.SparkContext.stop(SparkContext.scala:984)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:449)
Caused by: akka.pattern.AskTimeoutException: 
Recipient[Actor[akka://spark/user/MapOutputTracker#112553370]] had already been 
terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at 
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:104)
 


> 2.Erery excutor will processing 10+TB/2000 = 5G data. ReduceByKey will
> create a hashtable of unique lines form this 5G data and keep it in memory.
> it is maybe exceeed 16G .

So you mean the master gets that information from individual nodes and keeps it 
in memory? 


 
On Aug 21, 2014, at 8:18 PM, Nieyuan  wrote:

> 1.At begining of reduce task , mask will deliver map output info to every
> excutor. You can check stderr to find size of map output info . It should be
> :
>   "spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is
> xxx bytes"
> 
> 2.Erery excutor will processing 10+TB/2000 = 5G data. ReduceByKey will
> create a hashtable of unique lines form this 5G data and keep it in memory.
> it is maybe exceeed 16G .
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/AppMaster-OOME-on-YARN-tp12612p12627.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ODBC and HiveThriftServer2

2014-08-22 Thread prnicolas

Is it possible to connect to the thrift server using an ODBC client
(ODBC-JDBC)?
My thrift server is built from branch-1.0-jdbc using Hive 0.13.1



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ODBC-and-HiveThriftServer2-tp12680.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [PySpark] order of values in GroupByKey()

2014-08-22 Thread Matthew Farrellee

On 08/22/2014 04:32 PM, Arpan Ghosh wrote:

Is there any way to control the ordering of values for each key during a
groupByKey() operation? Is there some sort of implicit ordering in place
already?

Thanks

Arpan


there's no implicit ordering in place. the same holds for the order of 
keys, unless you use sortByKey.


what are you trying to achieve?

best,


matt

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [PySpark] order of values in GroupByKey()

2014-08-22 Thread Arpan Ghosh
I was grouping time series data by a key. I want the values to be sorted by
timestamp after the grouping.


On Fri, Aug 22, 2014 at 7:26 PM, Matthew Farrellee  wrote:

> On 08/22/2014 04:32 PM, Arpan Ghosh wrote:
>
>> Is there any way to control the ordering of values for each key during a
>> groupByKey() operation? Is there some sort of implicit ordering in place
>> already?
>>
>> Thanks
>>
>> Arpan
>>
>
> there's no implicit ordering in place. the same holds for the order of
> keys, unless you use sortByKey.
>
> what are you trying to achieve?
>
> best,
>
>
> matt
>


Spark: Why Standalone mode can not set Executor Number.

2014-08-22 Thread Victor Sheng
As far as I know, only yarn mode can set --num-executors, someone proved to
set more number-execuotrs for will perform better than set only 1 or 2
executor with large mem and core. sett
http://apache-spark-user-list.1001560.n3.nabble.com/executor-cores-vs-num-executors-td9878.html
  

Why Standalone mode not provide "number-execuotrs" parameters instead of
using spreadout strategy by default to generate executor? 

Can anyone explain this in detail ? Thanks : )



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Why-Standalone-mode-can-not-set-Executor-Number-tp12684.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What about implementing various hypothesis test for LogisticRegression in MLlib

2014-08-22 Thread guxiaobo1982
Hi Xiangrui,


You can refer to <>, there are many stander hypothesis test to do regarding to linear 
regression and logistic regression, they should be implement in the fist order, 
then we will  list some other testes, which are also important when using 
logistic regression to build score cards.


Xiaobo Gu




-- Original --
From:  "Xiangrui Meng";;
Send time: Wednesday, Aug 20, 2014 2:18 PM
To: ""; 
Cc: "user@spark.apache.org"; 
Subject:  Re: What about implementing various hypothesis test for 
LogisticRegression in MLlib



We implemented chi-squared tests in v1.1:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166
and we will add more after v1.1. Feedback on which tests should come
first would be greatly appreciated. -Xiangrui

On Tue, Aug 19, 2014 at 9:50 PM, guxiaobo1982  wrote:
> Hi,
>
> From the documentation I think only the model fitting part is implement,
> what about the various hypothesis test and performance indexes used to
> evaluate the model fit?
>
> Regards,
>
> Xiaobo Gu

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [PySpark] order of values in GroupByKey()

2014-08-22 Thread Matthew Farrellee
you can kv.mapValues(sorted), but that's definitely less efficient than 
sorting during the groupBy


you could try using combineByKey directly w/ heapq...

from heapq import heapify, heappush, merge
def createCombiner(x):
return [x]
def mergeValues(xs, x):
heappush(xs, x)
return xs
def mergeCombiners(a, b):
return merge(a, b)

rdd.combineByKey(createCombiner, mergeValues, mergeCombiners)

best,


matt

On 08/22/2014 10:41 PM, Arpan Ghosh wrote:

I was grouping time series data by a key. I want the values to be sorted
by timestamp after the grouping.


On Fri, Aug 22, 2014 at 7:26 PM, Matthew Farrellee mailto:m...@redhat.com>> wrote:

On 08/22/2014 04:32 PM, Arpan Ghosh wrote:

Is there any way to control the ordering of values for each key
during a
groupByKey() operation? Is there some sort of implicit ordering
in place
already?

Thanks

Arpan


there's no implicit ordering in place. the same holds for the order
of keys, unless you use sortByKey.

what are you trying to achieve?

best,


matt





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why classTag not typeTag?

2014-08-22 Thread Matei Zaharia
TypeTags are unfortunately not thread-safe in Scala 2.10. They were still 
somewhat experimental at the time so we decided not to use them. If you want 
though, you can probably design other APIs that pass a TypeTag around (e.g. 
make a method that takes an RDD[T] but also requires an implicit TypeTag[T]).

Matei

On Aug 22, 2014, at 9:15 AM, Mohit Jaggi  wrote:

> Folks,
> I am wondering why Spark uses ClassTag in RDD[T: ClassTag] instead of the 
> more functional TypeTag option.
> I have some code that needs TypeTag functionality and I don't know if a 
> typeTag can be converted to a classTag.
> 
> Mohit.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Installation On Windows machine

2014-08-22 Thread Matei Zaharia
You should be able to just download / unzip a Spark release and run it on a 
Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd. 
The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on 
Windows, but you can launch a standalone cluster manually using

bin\spark-class org.apache.spark.deploy.master.Master

and

bin\spark-class org.apache.spark.deploy.worker.Worker spark://master:port

For submitting jobs to YARN instead of the standalone cluster, spark-submit.cmd 
*may* work but I don't think we've tested it heavily. If you find issues with 
that, please let us know. But overall the instructions should be the same as on 
Linux, except you use the .cmd scripts instead of the .sh ones.

Matei

On Aug 22, 2014, at 3:01 AM, Mishra, Abhishek  wrote:

> Hello Team,
>  
> I was just trying to install spark on my windows server 2012 machine and use 
> it in my project; but unfortunately I do not find any documentation for the 
> same. Please let me know if we have drafted anything for spark users on 
> Windows. I am really in need of it as we are using Windows machine for Hadoop 
> and other tools and so cannot move back to Linux OS or anything. We run 
> Hadoop on hortonworks HDP2.0  platform and also recently I came across Spark 
> and so wanted use this even in my project for my Analytics work. Please 
> suggest me links or documents where I can move ahead with my installation and 
> usage. I want to run it on Java.
>  
> Looking forward for a reply,
>  
> Thanking you in Advance,
> Sincerely,
> Abhishek
>  
> Thanks,
>  
> Abhishek Mishra
> Software Engineer
> Innovation Delivery CoE (IDC)
>  
> Xerox Services India
> 4th Floor Tapasya, Infopark,
> Kochi, Kerala, India 682030
>  
> m +91-989-516-8770
>  
> www.xerox.com/businessservices


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



FetchFailedException from Block Manager

2014-08-22 Thread Victor Tso-Guillen
Anyone know why I would see this in a bunch of executor logs? Is it just
classical overloading of the cluster network, OOM, or something else? If
anyone's seen this before, what do I need to tune to make some headway here?

Thanks,
Victor


Caused by: org.apache.spark.FetchFailedException: Fetch failed:
BlockManagerId(116, xxx, 54761, 0) 110 32 38

at org.apache.spark.BlockStoreShuffleFetcher.org
$apache$spark$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)

at
org.apache.spark.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:77)

at
org.apache.spark.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:77)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4.apply(PairRDDFunctions.scala:107)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4.apply(PairRDDFunctions.scala:106)

at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)

at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)

at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


Re: Configuration for big worker nodes

2014-08-22 Thread Zhan Zhang
I think it depends on your job. My personal experiences when I run TB data.
spark got loss connection failure if I use big JVM with large memory, but with 
more executors with small memory, it can run very smoothly. I was running spark 
on yarn.

Thanks.

Zhan Zhang


On Aug 21, 2014, at 3:42 PM, soroka21  wrote:

> Hi,
> I have relatively big worker nodes. What would be the best worker
> configuration for them? Should I use all memory for JVM and utilize all
> cores when running my jobs?
> Each node has 2x10 cores CPU and 160GB of RAM. Cluster has 4 nodes connected
> with 10G network.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuration-for-big-worker-nodes-tp12614.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org