from:"TJ Klein"

Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread TJ Klein

Hi, I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster than Spark 1.1. However, the initial joy faded quickly when I noticed that all my stuff didn't successfully terminate operations anymore. Using Spark

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-22 Thread TJ Klein

Seems like it is a bug rather than a feature. I filed a bug report: https://issues.apache.org/jira/browse/SPARK-5363 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21317.html Sent from the Apache Spark

SPARK_LOCAL_DIRS Issue

2015-02-11 Thread TJ Klein

Hi, Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a different path then local directory. On our cluster we have a folder for temporary files (in a central file system), which is called /scratch. When setting SPARK_LOCAL_DIRS=/scratch/ I get: An error occurred while calling z:o

Broadcasting Large Objects Fails?

2015-02-22 Thread TJ Klein

Hi, I am trying to broadcast large objects (order of a couple of 100 MBs). However, I keep getting errors when trying to do so: Traceback (most recent call last): File "/LORM_experiment.py", line 510, in broadcast_gradient_function = sc.broadcast(gradient_function) File "/scratch/users/2

Re: Using Hadoop InputFormat in Python

2014-08-14 Thread TJ Klein

Yes, thanks great. This seems to be the issue. At least running with spark-submit works as well. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12126.html Sent from the Apache Spark User List mailing list archive

RDD Grouping

2014-08-19 Thread TJ Klein

Hi, is there a way such that I can group items in an RDD together such that I can process them using parallelize/map Let's say I have data items with keys 1...1000 e.g. loading RDD = sc. newAPIHadoopFile(...).cache() Now, I would like them to be processed in chunks of e.g. tens chunk1=[0..9],ch

Re: RDD Grouping

2014-08-19 Thread TJ Klein

Thanks a lot. Yes, this mapPartitions seems a better way of dealing with this problem as for groupBy() I need to collect() data before applying parallelize(), which is expensive. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Grouping-tp12407p12424.html

RDD Row Index

2014-08-20 Thread TJ Klein

Hi, I wonder if there is something like an (row) index to of the elements in the RDD. Specifically, my RDD is generated from a series of files, where the value corresponds the file contents. Ideally, I would like to have the keys to be an enumeration of the file number e.g. (0,),(1,). Any idea? T

Mapping with extra arguments

2014-08-20 Thread TJ Klein

Hi, I am using Spark in Python. I wonder if there is a possibility for passing extra arguments to the mapping function. In my scenario, after each map I update parameters, which I want to use in the folllowning new iteration of mapping. Any idea? Thanks in advance. -Tassilo -- View this messa

Re: Mapping with extra arguments

2014-08-21 Thread TJ Klein

Thanks. That's pretty much what I need. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-with-extra-arguments-tp12541p12548.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Mapping with extra arguments

2014-08-21 Thread TJ Klein

Thanks for the nice example. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-with-extra-arguments-tp12541p12549.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: RDD Row Index

2014-08-21 Thread TJ Klein

Thanks. As my files are defined to be non-splittable, I eventually I ended up using mapPartitionsWithIndex() taking the split ID as index def g(splitIndex, iterator): yield (splitIndex, iterator.next()) myRDD.mapPartitionsWithIndex(g) -- View this message in context: http://apache-spark

Spark-Submit Python along with JAR

2014-10-21 Thread TJ Klein

Hi, I'd like to run my python script using "spark-submit" together with a JAR file containing Java specifications for a Hadoop file system. How can I do that? It seems I can either provide a JAR file or a PYthon file to spark-submit. So far I have been running my code in ipython with IPYTHON_OPTS

Yarn-Client Python

2014-10-28 Thread TJ Klein

Hi there, I am trying to run Spark on YARN managed cluster using Python (which requires yarn-client mode). However, I cannot get it running (same with example apps). Using spark-submit to launch the script I get the following warning: WARN cluster.YarnClientClusterScheduler: Initial job has not

Re: Yarn-Client Python

2014-10-28 Thread TJ Klein

Hi Andrew, thanks for trying to help. However, I am a bit confused now. I'm not setting any 'spark.driver.host', particularly spark-defaults.conf is empty/non-exisiting. I thought this is only required when running Spark standalone mode. Isn't it the case, when using YARN all the configuration nee

Re: cannot run spark shell in yarn-client mode

2014-10-28 Thread TJ Klein

Hi Marco, I have the same issue. Did you fix it by chance? How? Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-run-spark-shell-in-yarn-client-mode-tp4013p17603.html Sent from the Apache Spark User List mailing list archive at Nabble.

Spark Standalone on cluster stops

2014-10-31 Thread TJ Klein

Hi, I have an issue with running Spark in standalone mode on a cluster. Everything seems to run fine for a couple of minutes until Spark stops executing the tasks. Any idea? Would appreciate some help. Thanks in advance, Tassilo I get errors like that at the end: 14/10/31 16:16:59 INFO clien

Re: stage failure: java.lang.IllegalStateException: unread block data

2014-11-01 Thread TJ Klein

Hi, I get exactly the same error. It runs on my local machine but not on the cluster. I am running the example pi.py example. Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-java-lang-IllegalStateException-unread-block-data-tp1

org.apache.hadoop.security.UserGroupInformation.doAs Issue

2014-11-01 Thread TJ Klein

Hi there, I am trying to run the example code pi.py on a cluster, however, I only got it working on localhost. When trying to run in standalone mode, ./bin/spark-submit \ --master spark://[mymaster]:7077 \ examples/src/main/python/pi.py \ I get warnings about resources and memory (the works

Spark Memory Hungry?

2014-11-14 Thread TJ Klein

Hi, I am using PySpark (1.1) and I am using it for some image processing tasks. The images (RDD) are of in the order of several MB to low/mid two digit MB. However, when using the data and running operations on it using Spark, I experience blowing up memory. Is there anything I can do about it? I

Spark Standalone Scheduling

2014-11-19 Thread TJ Klein

Hi, I am running some Spark code on my cluster in standalone mode. However, I have noticed that the most powerful machines (32 cores, 192 Gb mem) hardly get any tasks, whereas my small machines (8 cores, 128 Gb mem) all get plenty of tasks. The resources are all displayed correctly in the WebUI an

Issues running spark on cluster

2015-01-14 Thread TJ Klein

Hi, I am running PySpark on a cluster. Generally it runs. However, frequently I get the warning message (and consequently, the task not being executed): WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have suffici

Re: Issues running spark on cluster

2015-01-14 Thread TJ Klein

I got it working. It was a bug in Spark 1.1. After upgrading to 1.2 it worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-running-spark-on-cluster-tp21138p21140.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Performance issue

2015-01-16 Thread TJ Klein

Hi, I observed some weird performance issue using Spark in combination with Theano, and I have no real explanation for that. To exemplify the issue I am using the pi.py example of spark that computes pi: When I modify the function from the example: #unmodified code def f(_): x = random(

Re: Performance issue

2015-01-17 Thread TJ Klein

I suspect that putting a function into shared variable incurs additional overhead? Any suggestion how to avoid that? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-tp21194p21210.html Sent from the Apache Spark User List mailing list archiv

Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

SPARK_LOCAL_DIRS Issue

Broadcasting Large Objects Fails?

Re: Using Hadoop InputFormat in Python

RDD Grouping

Re: RDD Grouping

RDD Row Index

Mapping with extra arguments

Re: Mapping with extra arguments

Re: Mapping with extra arguments

Re: RDD Row Index

Spark-Submit Python along with JAR

Yarn-Client Python

Re: Yarn-Client Python

Re: cannot run spark shell in yarn-client mode

Spark Standalone on cluster stops

Re: stage failure: java.lang.IllegalStateException: unread block data

org.apache.hadoop.security.UserGroupInformation.doAs Issue

Spark Memory Hungry?

Spark Standalone Scheduling

Issues running spark on cluster

Re: Issues running spark on cluster

Performance issue

Re: Performance issue

25 matches

Site Navigation

Mail list logo

Footer information