Re: Submiting Spark application through code

2014-11-02 Thread Marius Soutier
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my 
Hadoop dependencies to run a SparkContext.

In your build.sbt:

"org.apache.hadoop" % "hadoop-common" % “..." exclude("javax.servlet", 
"servlet-api"),
"org.apache.hadoop" % "hadoop-hdfs" % “..." exclude("javax.servlet", 
"servlet-api”)

(or whatever Hadoop deps you use)

If you're using Maven:

 
  
   javax.servlet
  servlet-api
...


On 31.10.2014, at 07:14, sivarani  wrote:

> I tried running it but dint work
> 
> public static final SparkConf batchConf= new SparkConf();
> String master = "spark://sivarani:7077";
> String spark_home ="/home/sivarani/spark-1.0.2-bin-hadoop2/";
> String jar = "/home/sivarani/build/Test.jar";
> public static final JavaSparkContext batchSparkContext = new
> JavaSparkContext(master,"SparkTest",spark_home,new String[] {jar});
> 
> public static void main(String args[]){
>   runSpark(0,"TestSubmit");}
> 
>   public static void runSpark(int crit, String dataFile){
> JavaRDD logData = batchSparkContext.textFile(input, 10);
> flatMap
> maptoparr
> reduceByKey
> List> output1 = counts.collect();
>}
>   
> 
> This works fine with spark-submit but when i tried to submit through code
> LeadBatchProcessing.runSpark(0, "TestSubmit.csv");
> 
> I get this following error 
> 
> HTTP Status 500 - javax.servlet.ServletException:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0.0:0 failed 4 times, most recent failure: TID 29 on host 172.18.152.36
> failed for unknown reason
> Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent
> failure: TID 29 on host 172.18.152.36 failed for unknown reason Driver
> stacktrace:
> 
> 
> 
> Any Advice on this?
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Submiting-Spark-application-through-code-tp17452p17797.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



Spark on Yarn probably trying to load all the data to RAM

2014-11-02 Thread jan.zikes

Hi,

I am using Spark on Yarn, particularly Spark in Python. I am trying to run:

myrdd = sc.textFile("s3n://mybucket/files/*/*/*.json")
myrdd.getNumPartitions()

Unfortunately it seems that Spark tries to load everything to RAM, or at least 
after while of running this everything slows down and then I am getting errors 
with log below. Everything works fine for datasets smaller than RAM, but I 
would expect Spark doing this without storing everything to RAM. So I would 
like to ask if I'm not missing some settings in Spark on Yarn?


Thank you in advance for any help.


14/11/01 22:06:57 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-375] shutting down ActorSystem 
[sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
14/11/01 22:06:57 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-381] shutting down ActorSystem 
[sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
11744,575: [Full GC 1194515K->1192839K(1365504K), 2,2367150 secs]
11746,814: [Full GC 1194507K->1193186K(1365504K), 2,1788150 secs]
11748,995: [Full GC 1194507K->1193278K(1365504K), 1,3511480 secs]
11750,347: [Full GC 1194507K->1193263K(1365504K), 2,2735350 secs]
11752,622: [Full GC 1194506K->1193192K(1365504K), 1,2700110 secs]
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 391, in getNumPartitions
    return self._jrdd.partitions().size()
  File 
"/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
538, in __call__
  File "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value
py4j.protocol.Py4JJavaError14/11/01 22:07:07 INFO scheduler.DAGScheduler: 
Failed to run saveAsTextFile at NativeMethodAccessorImpl.java:-2
: An error occurred while calling o112.partitions.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
 

11753,896: [Full GC 1194506K->947839K(1365504K), 2,1483780 secs]

14/11/01 22:07:09 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Shutting down remote daemon.
14/11/01 22:07:09 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-381] shutting down ActorSystem 
[sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
14/11/01 22:07:09 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-309] shutting down ActorSystem 
[sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
14/11/01 22:07:09 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote 
daemon shut down; proceeding with flushing remote transports.
14/11/01 22:07:09 INFO Remoting: Remoting shut down
14/11/01 22:07:09 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Remoting shut down.
14/11/01 22:07:09 INFO network.ConnectionManager: Removing ReceivingConnection 
to ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,55871)
14/11/01 22:07:09 INFO network.ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,55871)
14/11/01 22:07:09 INFO network.ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,55871)
14/11/01 22:07:09 INFO network.ConnectionManager: Key not valid ? 
sun.nio.ch.SelectionKeyImpl@5ca1c790
14/11/01 22:07:09 INFO network.ConnectionManager: key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@5ca1c790
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at 
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/11/01 22:07:09 INFO network.ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,52768)
14/11/01 22:07:09 INFO network.ConnectionManager: Removing ReceivingConnection 
to ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,52768)
14/11/01 22:07:09 ERROR network.ConnectionManager: Corresponding 
SendingConnection to 
ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,52768) not found
14/11/01 22:07:10 ERROR cluster.YarnClientSchedulerBackend: Yarn application 
already ended: FINISHED
14/11/01 22:07:10 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/metrics/json,null}
14/11/01 22:07:10 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
14/11/01 22:07:10 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/,null}
14/11/01 22:07:10 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/static,null}
14/11/01 22:07:10 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/executors/json,null}
14/11/01 22:07:10 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/executors,null}
14/11/01 22

Re: Spark speed performance

2014-11-02 Thread jan.zikes

Thank you, I would expect it to work as you write, but I am probably 
experiencing it working other way. But now it seems that Spark is generally 
trying to fit everything to RAM. I run Spark on YARN and I have wraped this to 
another question: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-probably-trying-to-load-all-the-data-to-RAM-td17908.html
__


coalesce() is a streaming operation if used without the second parameter, it 
does not put all the data in RAM. If used with the second parameter (shuffle = 
true), then it performs a shuffle, but still does not put all the data in RAM.
On Sat, Nov 1, 2014 at 12:09 PM, > 
wrote:
Now I am getting to problems using:

distData = sc.textFile(sys.argv[2]).coalesce(10)
 
The problem is that it seems that Spark is trying to put all the data to RAM 
first and then perform coalesce. Do you know if there is something that would 
do coalesce on fly with for example fixed size of the partition? Do you think 
that something like this is possible? Unfortunately I am not able to find 
anything like this in the Spark documentation.

Thank you in advance for any advices or suggestions.

Best regards,
Jan 
__



Thank you very much lot of very small json files was exactly the speed 
performance problem, using coalesce makes my Spark program to run on single 
node only twice slower (even with starting Spark) than single node Python 
program, which is acceptable.

Jan 
__

Because the overhead between JVM and Python, single task will be
slower than your local Python scripts, but it's very easy to scale to
many CPUs.

Even one CPUs, it's not common that PySpark was 100 times slower. You
have many small files, each file will be processed by a task, which
will have about 100ms overhead (scheduled and executed), but the small
file can be processed in your single thread Python script in less than
1ms.

You could pack your json files into larger ones, or you could try to
merge the small tasks into larger one by coalesce(N), such as:

distData = sc.textFile(sys.argv[2]).coalesce(10)  # which will have 10
partitons (tasks)

Davies

On Sat, Oct 18, 2014 at 12:07 PM,  > wrote:

Hi,

I have program that I have for single computer (in Python) exection and also
implemented the same for Spark. This program basically only reads .json from
which it takes one field and saves it back. Using Spark my program runs
aproximately 100 times slower on 1 master and 1 slave. So I would like to
ask where possibly might be the problem?

My Spark program looks like:



sc = SparkContext(appName="Json data preprocessor")

distData = sc.textFile(sys.argv[2])

json_extractor = JsonExtractor(sys.argv[1])

cleanedData = distData.flatMap(json_extractor.extract_json)

cleanedData.saveAsTextFile(sys.argv[3])

JsonExtractor only selects the data from field that is given by sys.argv[1].



My data are basically many small one json files, where is one json per line.

I have tried both, reading and writing the data from/to Amazon S3, local
disc on all the machines.

I would like to ask if there is something that I am missing or if Spark is
supposed to be so slow in comparison with the local non parallelized single
node program.



Thank you in advance for any suggestions or hints.



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 

For additional commands, e-mail: user-h...@spark.apache.org 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 

For additional commands, e-mail: user-h...@spark.apache.org 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 

For additional commands, e-mail: user-h...@spark.apache.org 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL : how to find element where a field is in a given set

2014-11-02 Thread Rishi Yadav
did you create SQLContext?

On Sat, Nov 1, 2014 at 7:51 PM, abhinav chowdary  wrote:

> I have same requirement of passing list of values to in clause, when i am
> trying to do
>
> i am getting below error
>
> scala> val longList = Seq[Expression]("a", "b")
> :11: error: type mismatch;
>  found   : String("a")
>  required: org.apache.spark.sql.catalyst.expressions.Expression
>val longList = Seq[Expression]("a", "b")
>
> Thanks
>
>
> On Fri, Aug 29, 2014 at 3:52 PM, Michael Armbrust 
> wrote:
>
>> This feature was not part of that version.  It will be in 1.1.
>>
>>
>> On Fri, Aug 29, 2014 at 12:33 PM, Jaonary Rabarisoa 
>> wrote:
>>
>>>
>>> 1.0.2
>>>
>>>
>>> On Friday, August 29, 2014, Michael Armbrust 
>>> wrote:
>>>
 What version are you using?



 On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa 
 wrote:

> Still not working for me. I got a compilation error : *value in is
> not a member of Symbol.* Any ideas ?
>
>
> On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> To pass a list to a variadic function you can use the type ascription
>> :_*
>>
>> For example:
>>
>> val longList = Seq[Expression]("a", "b", ...)
>> table("src").where('key in (longList: _*))
>>
>> Also, note that I had to explicitly specify Expression as the type
>> parameter of Seq to ensure that the compiler converts "a" and "b" into
>> Spark SQL expressions.
>>
>>
>>
>>
>> On Thu, Aug 28, 2014 at 11:52 PM, Jaonary Rabarisoa <
>> jaon...@gmail.com> wrote:
>>
>>> ok, but what if I have a long list do I need to hard code like this
>>> every element of my list of is there a function that translate a list 
>>> into
>>> a tuple ?
>>>
>>>
>>> On Fri, Aug 29, 2014 at 3:24 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 You don't need the Seq, as in is a variadic function.

 personTable.where('name in ("foo", "bar"))



 On Thu, Aug 28, 2014 at 3:09 AM, Jaonary Rabarisoa <
 jaon...@gmail.com> wrote:

> Hi all,
>
> What is the expression that I should use with spark sql DSL if I
> need to retreive
> data with a field in a given set.
> For example :
>
> I have the following schema
>
> case class Person(name: String, age: Int)
>
> And I need to do something like :
>
> personTable.where('name in Seq("foo", "bar")) ?
>
>
> Cheers.
>
>
> Jaonary
>


>>>
>>
>

>>>
>>
>>
>
>
> --
> Warm Regards
> Abhinav Chowdary
>


Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Bharath Ravi Kumar
Thanks for responding. This is what I initially suspected, and hence asked
why the library needed to construct the entire value buffer on a single
host before writing it out. The stacktrace appeared to suggest that user
code is not constructing the large buffer. I'm simply calling groupBy and
saveAsText on the resulting grouped rdd. The value after grouping is an
Iterable>. None of the strings are
large. I also do not need a single large string created out of the Iterable
for writing to disk. Instead, I expect the iterable to get written out in
chunks in response to saveAsText. This shouldn't be the default behaviour
of saveAsText perhaps? Hence my original question of the behavior of
saveAsText. The tuning / partitioning attempts were aimed at reducing
memory pressure so that multiple such buffers aren't constructed at the
same time on a host. I'll take a second look at the data and code before
updating this thread. Thanks.
None of your tuning will help here because the problem is actually the way
you are saving the output. If you take a look at the stacktrace, it is
trying to build a single string that is too large for the VM to allocate
memory. The VM is actually not running out of memory, but rather, JVM
cannot support a single String so large.

I suspect this is due to the fact that the value in your key, value pair
after group by is too long (maybe it concatenates every single record). Do
you really want to save the key, value output this way using a text file?
Maybe you can write them out as multiple strings rather than a single super
giant string.




On Sat, Nov 1, 2014 at 9:52 PM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:

>
> Hi,
>
> FYI as follows.  Could you post your heap size settings as well your Spark
> app code?
>
> Regards
> Arthur
>
> 3.1.3 Detail Message: Requested array size exceeds VM limitThe detail
> message Requested array size exceeds VM limit indicates that the
> application (or APIs used by that application) attempted to allocate an
> array that is larger than the heap size. For example, if an application
> attempts to allocate an array of 512MB but the maximum heap size is 256MB
> then OutOfMemoryError will be thrown with the reason Requested array size
> exceeds VM limit. In most cases the problem is either a configuration
> issue (heap size too small), or a bug that results in an application
> attempting to create a huge array, for example, when the number of elements
> in the array are computed using an algorithm that computes an incorrect
> size.”
>
>
>
>
> On 2 Nov, 2014, at 12:25 pm, Bharath Ravi Kumar 
> wrote:
>
> Resurfacing the thread. Oom shouldn't be the norm for a common groupby /
> sort use case in a framework that is leading in sorting bench marks? Or is
> there something fundamentally wrong in the usage?
> On 02-Nov-2014 1:06 am, "Bharath Ravi Kumar"  wrote:
>
>> Hi,
>>
>> I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD
>> of count ~ 100 million. The data size is 20GB and groupBy results in an RDD
>> of 1061 keys with values being Iterable> String>>. The job runs on 3 hosts in a standalone setup with each host's
>> executor having 100G RAM and 24 cores dedicated to it. While the groupBy
>> stage completes successfully with ~24GB of shuffle write, the
>> saveAsTextFile fails after repeated retries with each attempt failing due
>> to an out of memory error *[1]*. I understand that a few partitions may
>> be overloaded as a result of the groupBy and I've tried the following
>> config combinations unsuccessfully:
>>
>> 1) Repartition the initial rdd (44 input partitions but 1061 keys) across
>> 1061 paritions and have max cores = 3 so that each key is a "logical"
>> partition (though many partitions will end up on very few hosts), and each
>> host likely runs saveAsTextFile on a single key at a time due to max cores
>> = 3 with 3 hosts in the cluster. The level of parallelism is unspecified.
>>
>> 2) Leave max cores unspecified, set the level of parallelism to 72, and
>> leave number of partitions unspecified (in which case the # input
>> partitions was used, which is 44)
>> Since I do not intend to cache RDD's, I have set
>> spark.storage.memoryFraction=0.2 in both cases.
>>
>> My understanding is that if each host is processing a single logical
>> partition to saveAsTextFile and is reading from other hosts to write out
>> the RDD, it is unlikely that it would run out of memory. My interpretation
>> of the spark tuning guide is that the degree of parallelism has little
>> impact in case (1) above since max cores = number of hosts. Can someone
>> explain why there are still OOM's with 100G being available? On a related
>> note, intuitively (though I haven't read the source), it appears that an
>> entire key-value pair needn't fit into memory of a single host for
>> saveAsTextFile since a single shuffle read from a remote can be written to
>> HDFS before the next remote read is carried out. This way, not all data
>> needs t

Re: properties file on a spark cluster

2014-11-02 Thread Akhil Das
The problem here is, when you run a spark program in cluster mode, it will
look for the file in the worker machine. Best approach would be to put the
file in hdfs and use it instead of local path. Another approach would be to
create the same file in the same path on all worker machines and hopefully
it will pick it up from there.

Thanks
Best Regards

On Fri, Oct 31, 2014 at 10:32 PM, Daniel Takabayashi <
takabaya...@scanboo.com.br> wrote:

> Hi Guys,
>
> I'm trying to execute a spark job using python, running on a cluster of
> Yarn (managed by cloudera manager). The python script is using a set of
> python programs installed in each member of cluster. This set of programs
> need an property file found by a local system path.
>
> My problem is:  When this script is sent, using spark-submit, the programs
> can't find this properties file. Running locally as stand-alone job, is no
> problem, the properties file is found.
>
> My questions is:
>
> 1 - What is the problem here ?
> 2 - In this scenario (an script running on a spark yarn cluster that use
> python programs that share same properties file) what is the best approach ?
>
>
> Thank's
> taka
>


Re: ExecutorLostFailure (executor lost)

2014-11-02 Thread Akhil Das
You can check in the worker logs for more accurate information(that are
found under the work directory inside spark directory). I used to hit this
issue with:

- Too many open files : Increasing the ulimit would solve this issue
- Akka connection timeout/Framesize: Setting the following while creating
sparkContext would solve it

 .set("spark.rdd.compress","true")
>   .set("spark.storage.memoryFraction","1")
>   .set("spark.core.connection.ack.wait.timeout","600")
>   .set("spark.akka.frameSize","50")




Thanks
Best Regards

On Sat, Nov 1, 2014 at 12:28 AM,  wrote:

> Hi,
>
> I am running my Spark job and I am getting ExecutorLostFailure (executor
> lost) using PySpak. I don't get any Error in my code, but just this. So I
> would like to ask, what can be possibly wrong. From the log it seems like
> some kind of internal problem in Spark.
>
> Thank you in advance for any suggestions and help.
>
>  2014-10-31 18:13:11,423 : INFO : spark:track_progress:300 : Traceback
> (most recent call last):
>
> INFO: File "/home/hadoop/preprocessor.py", line 69, in 
>
> 2014-10-31 18:13:11,423 : INFO : spark:track_progress:300 : File
> "/home/hadoop/preprocessor.py", line 69, in 
>
>
>
> INFO: cleanedData.saveAsTextFile(sys.argv[3])
>
> 2014-10-31 18:13:11,423 : INFO : spark:track_progress:300 :
> cleanedData.saveAsTextFile(sys.argv[3])
>
> INFO: File "/home/hadoop/spark/python/pyspark/rdd.py", line 1324, in
> saveAsTextFile
>
> 2014-10-31 18:13:11,424 : INFO : spark:track_progress:300 : File
> "/home/hadoop/spark/python/pyspark/rdd.py", line 1324, in saveAsTextFile
>
> INFO: keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
>
> 2014-10-31 18:13:11,424 : INFO : spark:track_progress:300 :
> keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
>
> INFO: File
> "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>
> 2014-10-31 18:13:11,424 : INFO : spark:track_progress:300 : File
> "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>
> INFO: File
> "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
>
> 2014-10-31 18:13:11,424 : INFO : spark:track_progress:300 : File
> "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
>
> INFO: py4j.protocol.Py4JJavaError: An error occurred while calling
> o47.saveAsTextFile.
>
> 2014-10-31 18:13:11,431 : INFO : spark:track_progress:300 :
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o47.saveAsTextFile.
>
> INFO: : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 20 in stage 0.0 failed 4 times, most recent failure: Lost task 20.3 in
> stage 0.0 (TID 110, ip-172-31-26-147.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
>
> 2014-10-31 18:13:11,431 : INFO : spark:track_progress:300 : :
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 20
> in stage 0.0 failed 4 times, most recent failure: Lost task 20.3 in stage
> 0.0 (TID 110, ip-172-31-26-147.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
>
> INFO: Driver stacktrace:
>
> 2014-10-31 18:13:11,431 : INFO : spark:track_progress:300 : Driver
> stacktrace:
>
> INFO: at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
>
> 2014-10-31 18:13:11,432 : INFO : spark:track_progress:300 : at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
>
> INFO: at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
>
> 2014-10-31 18:13:11,432 : INFO : spark:track_progress:300 : at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
>
> INFO: at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
>
> 2014-10-31 18:13:11,432 : INFO : spark:track_progress:300 : at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
>
> INFO: at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> 2014-10-31 18:13:11,432 : INFO : spark:track_progress:300 : at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> INFO: at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> 2014-10-31 18:13:11,432 : INFO : spark:track_progress:300 : at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> INFO: at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
>
> 2014-10-31 18:13:11,433 : INFO : spark:track_progress:300 : at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
>
> INFO: at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.app

Re: Cannot instantiate hive context

2014-11-02 Thread Akhil Das
Adding the libthrift jar
 in
the class path would resolve this issue.

Thanks
Best Regards

On Sat, Nov 1, 2014 at 12:34 AM, Pala M Muthaia  wrote:

> Hi,
>
> I am trying to load hive datasets using HiveContext, in spark shell. Spark
> ver 1.0.1 and Hive ver 0.12.
>
> We are trying to get Spark work with hive datasets. I already have
> existing Spark deployment. Following is what i did on top of that:
> 1. Build spark using 'mvn -Pyarn,hive -Phadoop-2.4 -Dhadoop.version=2.4.0
> -DskipTests clean package'
> 2. Copy over spark-assembly-1.0.1-hadoop2.4.0.jar into spark deployment
> directory.
> 3. Launch spark-shell with the spark hive jar included in the list.
>
> When i execute *'*
>
> *val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)*
>
> i get the following error stack:
>
> java.lang.NoClassDefFoundError: org/apache/thrift/TBase
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
> at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> 
> at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TBase
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 55 more
>
> I thought that building with -Phive option should include all the
> necessary hive packages into the assembly jar (according to here
> ).
> I tried searching online and in this mailing list archive but haven't found
> any instructions on how to get this working.
>
> I know that there is additional step of updating the assembly jar across
> the whole cluster, not just client side, but right now, even the client is
> not working.
>
> Would appreciate instructions (or link to them) on how to get this working
> end-to-end.
>
>
> Thanks,
> pala
>


Re: hadoop_conf_dir when running spark on yarn

2014-11-02 Thread Akhil Das
You can set HADOOP_CONF_DIR inside the spark-env.sh file

Thanks
Best Regards

On Sat, Nov 1, 2014 at 4:14 AM, ameyc  wrote:

> How do i setup hadoop_conf_dir correctly when I'm running my spark job on
> Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the
> configuration that I pull from sc.hadoopConfiguration() is incorrect.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-conf-dir-when-running-spark-on-yarn-tp17872.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Sean Owen
saveAsText means "save every element of the RDD as one line of text".
It works like TextOutputFormat in Hadoop MapReduce since that's what
it uses. So you are causing it to create one big string out of each
Iterable this way.

On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar  wrote:
> Thanks for responding. This is what I initially suspected, and hence asked
> why the library needed to construct the entire value buffer on a single host
> before writing it out. The stacktrace appeared to suggest that user code is
> not constructing the large buffer. I'm simply calling groupBy and saveAsText
> on the resulting grouped rdd. The value after grouping is an
> Iterable>. None of the strings are
> large. I also do not need a single large string created out of the Iterable
> for writing to disk. Instead, I expect the iterable to get written out in
> chunks in response to saveAsText. This shouldn't be the default behaviour of
> saveAsText perhaps? Hence my original question of the behavior of
> saveAsText. The tuning / partitioning attempts were aimed at reducing memory
> pressure so that multiple such buffers aren't constructed at the same time
> on a host. I'll take a second look at the data and code before updating this
> thread. Thanks.
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-11-02 Thread ashu
Hi, 
Sorry to bounce back the old thread. 
What is the state now? Is this problem solved. How spark handle categorical
data now? 

Regards, 
Ashutosh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Prediction using Classification with text attributes in Apache Spark MLLib

2014-11-02 Thread Xiangrui Meng
This operation requires two transformers:

1) Indexer, which maps string features into categorical features
2) OneHotEncoder, which flatten categorical features into binary features

We are working on the new dataset implementation, so we can easily
express those transformations. Sorry for late! If you want a quick and
dirty solution, you can try hashing:

val rdd: RDD[(Double, Array[String])] = ...
val training = rdd.mapValues { factors =>
val indices = mutable.Set.empty[Int]
factors.view.zipWithIndex.foreach { (f, idx) =>
  indices += math.abs(f.## ^ idx) % 10
}
Vectors.sparse(10, indices.toSeq.map(x => (x, 1.0)))
}

It creates a training dataset with all binary features, with a chance
of collision. You can use it in SVM, LR, or DecisionTree.

Best,
Xiangrui

On Sun, Nov 2, 2014 at 9:20 AM, ashu  wrote:
> Hi,
> Sorry to bounce back the old thread.
> What is the state now? Is this problem solved. How spark handle categorical
> data now?
>
> Regards,
> Ashutosh
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Master Web UI showing "0 cores" in Completed Applications

2014-11-02 Thread Justin Yip
Hello,

I have a question about the "Completed Applications" table on the Spark
Master web UI page.

For the column "Cores", it used to show the number of cores used in the
application. However, after I added a line "sparkContext.stop()" as the end
my spark app, it shows "0 cores".

My application is working perfectly fine, when it is running, the "Running
Applications" table shows that it is using "8 Cores". But once it is ended,
the "Completed Applications" page show "0 Cores".

Has anyone experienced similar issue?

Thanks.

Justin


How do I kill av job submitted with spark-submit

2014-11-02 Thread Steve Lewis
I see the job in the web interface but don't know how to kill it


Re: hadoop_conf_dir when running spark on yarn

2014-11-02 Thread Amey Chaugule
I thought that only applied when you're trying to run a job using
spark-submit or in the shell...

On Sun, Nov 2, 2014 at 8:47 AM, Akhil Das 
wrote:

> You can set HADOOP_CONF_DIR inside the spark-env.sh file
>
> Thanks
> Best Regards
>
> On Sat, Nov 1, 2014 at 4:14 AM, ameyc  wrote:
>
>> How do i setup hadoop_conf_dir correctly when I'm running my spark job on
>> Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the
>> configuration that I pull from sc.hadoopConfiguration() is incorrect.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-conf-dir-when-running-spark-on-yarn-tp17872.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Amey S. Chaugule

B.S Computer Engineering and Mathematics '11,
University of Illinois at Urbana-Champaign


Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Paul Wais
Thanks Sean! My novice understanding is that the 'native heap' is the
address space not allocated to the JVM heap, but I wanted to check to see
if I'm missing something.  I found out my issue appeared to be actual
memory pressure on the executor machine.  There was space for the JVM heap
but not much more.

On Thu, Oct 30, 2014 at 12:49 PM, Sean Owen > wrote:
> No, but, the JVM also does not allocate memory for native code on the
heap.
> I dont think heap has any bearing on whether your native code can't
allocate
> more memory except that of course the heap is also taking memory.
>
> On Oct 30, 2014 6:43 PM, "Paul Wais" >
wrote:
>>
>> Dear Spark List,
>>
>> I have a Spark app that runs native code inside map functions.  I've
>> noticed that the native code sometimes sets errno to ENOMEM indicating
>> a lack of available memory.  However, I've verified that the /JVM/ has
>> plenty of heap space available-- Runtime.getRuntime().freeMemory()
>> shows gigabytes free and the native code needs only megabytes.  Does
>> spark limit the /native/ heap size somehow?  Am poking through the
>> executor code now but don't see anything obvious.
>>
>> Best Regards,
>> -Paul Wais
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> For additional commands, e-mail: user-h...@spark.apache.org

>>
>


Spark SQL takes unexpected time

2014-11-02 Thread Shailesh Birari
Hello,

I have written an Spark SQL application which reads data from HDFS  and
query on it.
The data size is around 2GB (30 million records). The schema and query I am
running is as below.
The query takes around 05+ seconds to execute. 
I tried by adding 
   rdd.persist(StorageLevel.MEMORY_AND_DISK)
and
   rdd.cache()
but in both the cases it takes extra time, even if I give the below query as
second the data. (assuming Spark will cache it for first query).

case class EventDataTbl(ID: String, 
ONum: String,
RNum: String,
Timestamp: String,
Duration: String,
Type: String,
Source: String,
OName: String,
RName: String)

sql("SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName FROM EventDataTbl
GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT
10").collect().foreach(println)

Can you let me know if I am missing anything ?

Thanks,
  Shailesh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does SparkSQL work with custom defined SerDe?

2014-11-02 Thread Chirag Aggarwal
Did  https://issues.apache.org/jira/browse/SPARK-3807 fix the issue seen by you?
If yes, then please note that it shall be part of 1.1.1 and 1.2

Chirag

From: Chen Song mailto:chen.song...@gmail.com>>
Date: Wednesday, 15 October 2014 4:03 AM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: Does SparkSQL work with custom defined SerDe?

Looks like it may be related to 
https://issues.apache.org/jira/browse/SPARK-3807.

I will build from branch 1.1 to see if the issue is resolved.

Chen

On Tue, Oct 14, 2014 at 10:33 AM, Chen Song 
mailto:chen.song...@gmail.com>> wrote:
Sorry for bringing this out again, as I have no clue what could have caused 
this.

I turned on DEBUG logging and did see the jar containing the SerDe class was 
scanned.

More interestingly, I saw the same exception 
(org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes) when running simple select on valid column names and malformed 
column names. This lead me to suspect that syntactical breaks somewhere.

select [valid_column] from table limit 5;
select [malformed_typo_column] from table limit 5;


On Mon, Oct 13, 2014 at 6:04 PM, Chen Song 
mailto:chen.song...@gmail.com>> wrote:
In Hive, the table was created with custom SerDe, in the following way.

row format serde "abc.ProtobufSerDe"

with serdeproperties ("serialization.class"="abc.protobuf.generated.LogA$log_a")

When I start spark-sql shell, I always got the following exception, even for a 
simple query.

select user from log_a limit 25;

I can desc the table without any problem. When I explain the query, I got the 
same exception.


14/10/13 22:01:13 INFO impl.AMRMClientImpl: Waiting for application to be 
successfully unregistered.

Exception in thread "Driver" java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Unresolved attributes: 'user, tree:

Project ['user]

 Filter (dh#4 = 2014-10-13 05)

  LowerCaseSchema

   MetastoreRelation test, log_a, None


at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at 
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

at 
scala.collection.AbstractIterator.to(Iterator.scala:1157)

at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at 
org.apache.spark.sql.catalyst.r

Spark cluster stability

2014-11-02 Thread jatinpreet
Hi,

I am running a small 6 node spark cluster for testing purposes. Recently,
one of the node's physical memory was filled up by temporary files and there
was no space left on the disk. Due to this my Spark jobs started failing
even though on the Spark Web UI the was shown 'Alive'. Once I logged on to
the machine and cleaned up some trash, I was able to run the jobs again.

My question is, how reliable my Spark cluster can be if issues like these
can bring down my jobs? I would have expected Spark to not use this node or
at least distribute this work to other nodes. But as the node was still
alive, it tried to run tasks on it regardless.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Sean Owen
Yes, that's correct to my understanding and the probable explanation of
your issue. There are no additional limits or differences from how the JVM
works here.
On Nov 3, 2014 4:40 AM, "Paul Wais"  wrote:

> Thanks Sean! My novice understanding is that the 'native heap' is the
> address space not allocated to the JVM heap, but I wanted to check to see
> if I'm missing something.  I found out my issue appeared to be actual
> memory pressure on the executor machine.  There was space for the JVM heap
> but not much more.
>
> On Thu, Oct 30, 2014 at 12:49 PM, Sean Owen  wrote:
> > No, but, the JVM also does not allocate memory for native code on the
> heap.
> > I dont think heap has any bearing on whether your native code can't
> allocate
> > more memory except that of course the heap is also taking memory.
> >
> > On Oct 30, 2014 6:43 PM, "Paul Wais"  wrote:
> >>
> >> Dear Spark List,
> >>
> >> I have a Spark app that runs native code inside map functions.  I've
> >> noticed that the native code sometimes sets errno to ENOMEM indicating
> >> a lack of available memory.  However, I've verified that the /JVM/ has
> >> plenty of heap space available-- Runtime.getRuntime().freeMemory()
> >> shows gigabytes free and the native code needs only megabytes.  Does
> >> spark limit the /native/ heap size somehow?  Am poking through the
> >> executor code now but don't see anything obvious.
> >>
> >> Best Regards,
> >> -Paul Wais
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
>


Re: Spark cluster stability

2014-11-02 Thread Akhil Das
You can enable monitoring (nagios) with alerts to tackle these kind of
issues.

Thanks
Best Regards

On Mon, Nov 3, 2014 at 10:55 AM, jatinpreet  wrote:

> Hi,
>
> I am running a small 6 node spark cluster for testing purposes. Recently,
> one of the node's physical memory was filled up by temporary files and
> there
> was no space left on the disk. Due to this my Spark jobs started failing
> even though on the Spark Web UI the was shown 'Alive'. Once I logged on to
> the machine and cleaned up some trash, I was able to run the jobs again.
>
> My question is, how reliable my Spark cluster can be if issues like these
> can bring down my jobs? I would have expected Spark to not use this node or
> at least distribute this work to other nodes. But as the node was still
> alive, it tried to run tasks on it regardless.
>
> Thanks,
> Jatin
>
>
>
> -
> Novice Big Data Programmer
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Parquet files are only 6-20MB in size?

2014-11-02 Thread ag007
Hi there,

I have a pySpark job that is simply taking a tab separated CSV outputting it
to a Parquet file.  The code is based on the SQL write parquet example. 
(Using a different inferred schema, only 35 columns). The input files range
from 100MB to 12 Gb.

I have tried different different block sizes from 10MB through to 1 Gb, I
have tried different parallelism. The total part files total about 1:5
compression.  

I am trying to get large parquet files.  Having this many small files will
cause problems to my name node.  I have over 500,000 of these files. 

Your assistance would be greatly appreciated.

cheers,
Ag

PS Another solution may be if there is a parquet concat tool around.  I
couldn't see one.  I understand that this tool would have to adjust the
footer.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



graph x extracting the path

2014-11-02 Thread dizzy5112
Hi all, just wondering if there was a way to extract paths in graphx. For
example if i have the graph attached i would like to return the results
along the lines of :

101 -> 103
101 ->104 ->108
102 ->105
102 ->106->107

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/graph-x-extracting-the-path-tp17936.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org