unsubscribe

2023-01-20 Thread peng
unsubscribe

Unsubscribe

2023-05-01 Thread peng

Re: spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-22 Thread peng
eople will still move to databricks cloud, which has far more features than that. Many influential projects already depends on the routinely published Scala-REPL (e.g. playFW), it would be strange for Spark not doing the same. What do you think? Yours Peng On 12/22/2014 04:57 PM, Sean Owen

Re: Announcing Spark Packages

2014-12-22 Thread peng
Me 2 :) On 12/22/2014 06:14 PM, Andrew Ash wrote: Hi Xiangrui, That link is currently returning a 503 Over Quota error message. Would you mind pinging back out when the page is back up? Thanks! Andrew On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng > wrote: D

Re: unsubscribe

2020-06-27 Thread Wesley Peng
please send an empty email to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. Sri Kris wrote: Sent from Mail for Windows 10 - To unsubscribe e-mail

Re: Unsubscribe

2020-12-22 Thread Wesley Peng
Bhavya Jain wrote: Unsubscribe please send an email to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Question about how hadoop configurations populated in driver/executor pod

2021-03-22 Thread Yue Peng
Hi, I am trying run sparkPi example via Spark on Kubernetes in my cluster. However, it is consistently failing because of executor does not have the correct hadoop configurations. I could fix it by pre-creating a configmap and mounting it into executor by specifying in pod template. But I do s

Re: Spark Session error with 30s

2021-04-12 Thread Peng Lei
Hi KhajaAsmath Mohammed Please check the configuration of "spark.speculation.interval", just pass the "30" to it. ''' override def start(): Unit = { backend.start() if (!isLocal && conf.get(SPECULATION_ENABLED)) { logInfo("Starting speculative execution thread") speculationSched

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Henrik Peng
Congrats and thanks! Gengliang Wang 于2021年10月19日 周二下午10:16写道: > Hi all, > > Apache Spark 3.2.0 is the third release of the 3.x line. With tremendous > contribution from the open-source community, this release managed to > resolve in excess of 1,700 Jira tickets. > > We'd like to thank our contri

Re: ivy unit test case filing for Spark

2021-12-21 Thread Wes Peng
Are you using IvyVPN which causes this problem? If the VPN software changes the network URL silently you should avoid using them. Regards. On Wed, Dec 22, 2021 at 1:48 AM Pralabh Kumar wrote: > Hi Spark Team > > I am building a spark in VPN . But the unit test case below is failing. > This is p

Re: [Pyspark] How to download Zip file from SFTP location and put in into Azure Data Lake and unzip it

2022-01-18 Thread Wes Peng
How large is the file? From my experience, reading the excel file from data lake and loading as dataframe, works great. Thanks On 2022-01-18 22:16, Heta Desai wrote: Hello, I have zip files on SFTP location. I want to download/copy those files and put into Azure Data Lake. Once the zip files

Re: Profiling spark application

2022-01-19 Thread Wes Peng
Give a look at this: https://github.com/LucaCanali/sparkMeasure On 2022/1/20 1:18, Prasad Bhalerao wrote: Is there any way we can profile spark applications which will show no. of invocations of spark api and their execution time etc etc just the way jprofiler shows all the details? -

query time comparison to several SQL engines

2022-04-07 Thread Wes Peng
I made a simple test to query time for several SQL engines including mysql, hive, drill and spark. The report, https://cloudcache.net/data/query-time-mysql-hive-drill-spark.pdf It maybe have no special meaning, just for fun. :) regards.

Re: Executorlost failure

2022-04-07 Thread Wes Peng
how many executors do you have? rajat kumar wrote: Tested this with executors of size 5 cores, 17GB memory. Data vol is really high around 1TB - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Executorlost failure

2022-04-07 Thread Wes Peng
I once had a file which is 100+GB getting computed in 3 nodes, each node has 24GB memory only. And the job could be done well. So from my experience spark cluster seems to work correctly for big files larger than memory by swapping them to disk. Thanks rajat kumar wrote: Tested this with exec

Re: Executorlost failure

2022-04-07 Thread Wes Peng
I just did a test, even for a single node (local deployment), spark can handle the data whose size is much larger than the total memory. My test VM (2g ram, 2 cores): $ free -m totalusedfree shared buff/cache available Mem: 19921845

Re: A simple comparison for three SQL engines

2022-04-09 Thread Wes Peng
may I forward this report to spark list as well. Thanks. Wes Peng wrote: Hello, This weekend I made a test against a big dataset. spark, drill, mysql, postgresql were involved. This is the final report: https://blog.cloudcache.net/handles-the-file-larger-than-memory/ The simple conclusion

Re: Potability of dockers built on different cloud platforms

2023-04-05 Thread Ken Peng
ashok34...@yahoo.com.INVALID wrote: Is it possible to use Spark docker built on GCP on AWS without rebuilding from new on AWS? I am using the spark image from bitnami for running on k8s. And yes, it's deployed by helm. -- https://kenpeng.pages.dev/

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Jerry Peng
Craig, Thanks! Please let us know the result! Best, Jerry On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote: > > Hi Craig, > > Can you please clarify what this bug is and provide sample code causing > this issue? > > HTH > > Mich Talebzadeh, > Distinguished Technologist, Solutions Archit

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng
d the chance to review the attached draft, let us know if > there are any questions in the meantime. Again, we welcome the opportunity > to work with the teams on this. > > > > Best- > > Craig > > > > > > > > *From: *Craig Alfieri > *Date: *Thursda

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2015-05-21 Thread Peng Cheng
I stumble upon this thread and I conjecture that this may affect restoring a checkpointed RDD as well: http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928 In my case I have 1600+ fragmented che

[Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
In Spark <1.3.x, the system property of the driver can be set by --conf option, shared between setting spark properties and system properties. In Spark 1.4.0 this feature is removed, the driver instead log the following warning: Warning: Ignoring non-spark config property: xxx.xxx=v How do s

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
know the new way to set the same properties? Yours Peng On 12 June 2015 at 14:20, Andrew Or wrote: > Hi Peng, > > Setting properties through --conf should still work in Spark 1.4. From the > warning it looks like the config you are trying to set does not start with > the prefix &

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
On 12 June 2015 at 19:39, Ted Yu wrote: > This is the SPARK JIRA which introduced the warning: > > [SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties > in spark-shell and spark-submit > > On Fri, Jun 12, 2015 at 4:34 PM, Peng Cheng wrote: > >> Hi A

Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Kevin Peng
Ted, What triggerAndWait does is perform a rest call to a specified url and then waits until the status message that gets returned by that url in a json a field says complete. The issues is I put a println at the very top of the method and that doesn't get printed out, and I know that println isn

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, Apologies. I edited my post with this information: Spark version: 1.6 Result from spark shell OS: Linux version 2.6.32-431.20.3.el6.x86_64 ( mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014 Thanks, KP On Mon,

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, I wish that was case, but I have done a select count on each of the two tables individually and they return back different number of rows: dps.registerTempTable("dps_pin_promo_lt") swig.registerTempTable("swig_pin_promo_lt") dps.count() RESULT: 42632 swig.count() RESULT: 42034 On

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Yong, Sorry, let explain my deduction; it is going be difficult to get a sample data out since the dataset I am using is proprietary. >From the above set queries (ones mentioned in above comments) both inner and outer join are producing the same counts. They are basically pulling out selected co

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
at 11:16 PM, Davies Liu wrote: > as @Gourav said, all the join with different join type show the same > results, > which meant that all the rows from left could match at least one row from > right, > all the rows from right could match at least one row from left, even > the numb

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
it a first look I do think that you have hit something here > >> and this does not look quite fine. I have to work on the multiple AND > >> conditions in ON and see whether that is causing any issues. > >> > >> Regards, > >> Gourav Sengupta > >

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
gt;> Hi Kevin, > >>> > >>> Having given it a first look I do think that you have hit something > here > >>> and this does not look quite fine. I have to work on the multiple AND > >>> conditions in ON and see whether that is causing any issues. > &

udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
Hi, is there a way to write a udf in pyspark support agg()? i search all over the docs and internet, and tested it out.. some say yes, some say no. and when i try those yes code examples, just complaint about AnalysisException: u"expression 'pythonUDF' is neither present in the group by, nor

Re: udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
btw, i am using spark 1.6.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/udf-of-aggregation-in-pyspark-dataframe-tp27811p27812.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
df: - a|b|c --- 1|m|n 1|x | j 2|m|x ... import pyspark.sql.functions as F from pyspark.sql.types import MapType, StringType def my_zip(c, d): return dict(zip(c, d)) my_zip = F.udf(_my_zip, MapType(StingType(), StringType(), True), True) df.groupBy('a').agg(my_zip(collect_list

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Kevin Peng
Mohini, We set that parameter before we went and played with the number of executors and that didn't seem to help at all. Thanks, KP On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar wrote: > Hi, > > try using this parameter --conf spark.sql.shuffle.partitions=1000 > > Thanks, > Mohini > > On

The stability of Spark Stream Kafka 010

2017-06-29 Thread Martin Peng
Hi, We planned to upgrade our Spark Kafka library to 0.10 from 0.81 to simplify our infrastructure code logic. Does anybody know when will the 010 version become stable from experimental? May I use this 010 version together with Spark 1.5.1? https://spark.apache.org/docs/latest/streaming-kafka-0-

Spark Job crash due to File Not found when shuffle intermittently

2017-07-21 Thread Martin Peng
Hi, I have several Spark jobs including both batch job and Stream jobs to process the system log and analyze them. We are using Kafka as the pipeline to connect each jobs. Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of the jobs(both batch or streaming) are thrown below e

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread Martin Peng
Is there anyone at share me some lights about this issue? Thanks Martin 2017-07-21 18:58 GMT-07:00 Martin Peng : > Hi, > > I have several Spark jobs including both batch job and Stream jobs to > process the system log and analyze them. We are using Kafka as the pipeline > to co

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-25 Thread Martin Peng
t; Second, when using BypassMergeSortShuffleWriter, it will first write >>> data file then write an index file. >>> You can check "Failed to delete temporary index file at" or "fail to >>> rename file" in related executor node's log file. &

spark jdbc postgres query results don't match those of postgres query

2018-03-29 Thread Kevin Peng
I am running into a weird issue in Spark 1.6, which I was wondering if anyone has encountered before. I am running a simple select query from spark using a jdbc connection to postgres: val POSTGRES_DRIVER: String = "org.postgresql.Driver" val srcSql = """select total_action_value, last_updated from

How to work around NoOffsetForPartitionException when using Spark Streaming

2018-06-01 Thread Martin Peng
Hi, We see below exception when using Spark Kafka streaming 0.10 on a normal Kafka topic. Not sure why offset missing in zk, but since Spark streaming override the offset reset policy to none in the code. I can not set the reset policy to latest(I don't really care data loss now). Is there any qu

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Wesley Peng
on 2019/9/2 5:54, Dongjoon Hyun wrote: We are happy to announce the availability of Spark 2.4.4! Spark 2.4.4 is a maintenance release containing stability fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable

[ML] [How-to]: How to unload the loaded W2V model in Pyspark?

2020-02-17 Thread Zhefu PENG
Hi all, I'm using pyspark and Spark-ml to train and use Word2Vect model, here is the logic of my program: model = Word2VecModel.load("save path") result_list = model.findSynonymsArray(target, top_N) Then I use the graphframe and result_list to create graph and do some computing. However the pro

Re: java.lang.IllegalStateException: unread block data

2015-02-02 Thread Peng Cheng
I got the same problem, maybe java serializer is unstable -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p21463.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Is LogisticRegressionWithSGD in MLlib scalable?

2015-02-03 Thread Peng Zhang
Hi Everyone, Is LogisticRegressionWithSGD in MLlib scalable? If so, what is the idea behind the scalable implementation? Thanks in advance, Peng - Peng Zhang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-LogisticRegressionWithSGD-in-MLlib

Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-10 Thread Peng Cheng
I'm running a small job on a cluster with 15G of mem and 8G of disk per machine. The job always get into a deadlock where the last error message is: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write

Re: Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-11 Thread Peng Cheng
You are right. I've checked the overall stage metrics and looks like the largest shuffling write is over 9G. The partition completed successfully but its spilled file can't be removed until all others are finished. It's very likely caused by a stupid mistake in my design. A lookup table grows const

Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
s not related to Spark 1.2.0's new features Yours Peng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21656.html Sent from the Apache Spark User List mailing list archive at

Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
I double check the 1.2 feature list and found out that the new sort-based shuffle manager has nothing to do with HashPartitioner :-< Sorry for the misinformation. In another hand. This may explain increase in shuffle spill as a side effect of the new shuffle manager, let me revert spark.shuffle.ma

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Marcelo, Yes that is correct, I am going through a mirror, but 1.1.0 works properly, while 1.2.0 does not. I suspect there is crc in the 1.2.0 pom file. On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin wrote: > Seems like someone set up "m2.mines.com" as a mirror in your pom file > or ~/.m2/sett

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Ted, I have tried wiping out ~/.m2/org.../spark directory multiple times. It doesn't seem to work. On Wed, Mar 4, 2015 at 4:20 PM, Ted Yu wrote: > kpeng1: > Try wiping out ~/.m2 and build again. > > Cheers > > On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin > wrote: > >> Seems like someone s

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
thread: http://search-hadoop.com/m/JW1q5Vfe6X1 > > Cheers > > On Wed, Mar 4, 2015 at 4:18 PM, Kevin Peng wrote: > >> Marcelo, >> >> Yes that is correct, I am going through a mirror, but 1.1.0 works >> properly, while 1.2.0 does not. I suspect there is crc in

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
loudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html > > On Wed, Mar 4, 2015 at 4:34 PM, Kevin Peng wrote: > > Ted, > > > > I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not > too > > sure about the compatibilit

error on training with logistic regression sgd

2015-03-09 Thread Peng Xia
(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) The data are transformed to LabeledPoint and I was using pyspark for this. Can anyone help me on this? Thanks, Best, Peng

Re: error on training with logistic regression sgd

2015-03-10 Thread Peng Xia
algorithm in python. 3. train a logistic regression model with the converted labeled points. Can any one give some advice for how to avoid the 2gb, if this is the cause? Thanks very much for the help. Best, Peng On Mon, Mar 9, 2015 at 3:54 PM, Peng Xia wrote: > Hi, > > I was launchin

Re: spark sql writing in avro

2015-03-12 Thread Kevin Peng
Dale, I basically have the same maven dependency above, but my code will not compile due to not being able to reference to AvroSaver, though the saveAsAvro reference compiles fine, which is weird. Eventhough saveAsAvro compiles for me, it errors out when running the spark job due to it not being

Re: spark sql writing in avro

2015-03-13 Thread Kevin Peng
n will pick up the latest version of > spark-avro (for this machine). > > Now you should be able to compile and run. > > HTH, > Markus > > > On 03/12/2015 11:55 PM, Kevin Peng wrote: > > Dale, > > I basically have the same maven dependency above, but my

spark there is no space on the disk

2015-03-13 Thread Peng Xia
Hi I was running a logistic regression algorithm on a 8 nodes spark cluster, each node has 8 cores and 56 GB Ram (each node is running a windows system). And the spark installation driver has 1.9 TB capacity. The dataset I was training on are has around 40 million records with around 6600 feature

Re: Loading in json with spark sql

2015-03-13 Thread Kevin Peng
Yin, Yup thanks. I fixed that shortly after I posted and it worked. Thanks, Kevin On Fri, Mar 13, 2015 at 8:28 PM, Yin Huai wrote: > Seems you want to use array for the field of "providers", like > "providers":[{"id": > ...}, {"id":...}] instead of "providers":{{"id": ...}, {"id":...}} > >

Re: spark there is no space on the disk

2015-03-14 Thread Peng Xia
Hi Sean, Thank very much for your reply. I tried to config it from below code: sf = SparkConf().setAppName("test").set("spark.executor.memory", "45g").set("spark.cores.max", 62),set("spark.local.dir", "C:\\tmp") But still get the er

Re: spark there is no space on the disk

2015-03-14 Thread Peng Xia
And I have 2 TB free space on C driver. On Sat, Mar 14, 2015 at 8:29 PM, Peng Xia wrote: > Hi Sean, > > Thank very much for your reply. > I tried to config it from below code: > > sf = SparkConf().setAppName("test").set("spark.executor.memory", &

Re: Can I start multiple executors in local mode?

2015-03-16 Thread xu Peng
Hi David, You can try the local-cluster. the number in local-cluster[2,2,1024] represents that there are 2 worker, 2 cores and 1024M Best Regards Peng Xu 2015-03-16 19:46 GMT+08:00 Xi Shen : > Hi, > > In YARN mode you can specify the number of executors. I wonder if we can >

Re: spark there is no space on the disk

2015-03-31 Thread Peng Xia
.0 and later this will be overriden by >> > SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) >> > >> > On Sat, Mar 14, 2015 at 5:29 PM, Peng Xia >> wrote: >> >> Hi Sean, >> >> >> >> Thank very much for your repl

refer to dictionary

2015-03-31 Thread Peng Xia
dict: rdd1.map(lambda line: [dict1[item] for item in line]) But this task is not distributed, I believe the reason is the dict1 is a local instance. Can any one provide suggestions on this to parallelize this? Thanks, Best, Peng

Re: refer to dictionary

2015-03-31 Thread Peng Xia
Hi Ted, Thanks very much, yea, using broadcast is much faster. Best, Peng On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu wrote: > You can use broadcast variable. > > See also this thread: > > http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variable&subj=How+Broad

How to avoid “Invalid checkpoint directory” error in apache Spark?

2015-04-17 Thread Peng Cheng
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm running a job with periodic checkpointing (it has a long dependency tree, so truncating by checkpointing is mandatory, each checkpoint has 320 partitions). The job stops halfway, resulting an exception: (On driver) org.apache.s

Re: Spark Performance on Yarn

2015-04-20 Thread Peng Cheng
I got exactly the same problem, except that I'm running on a standalone master. Can you tell me the counterpart parameter on standalone master for increasing the same memroy overhead? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-

What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-04-24 Thread Peng Cheng
I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job alwa

Union of checkpointed RDD in Apache Spark has long (> 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
I'm implementing one of my machine learning/graph analysis algorithm on Apache Spark: The algorithm is very iterative (like all other ML algorithms), but it has a rather strange workflow: first a subset of all training data (called seeds RDD: {S_1} is randomly selected) and loaded, in each iterati

Re: Union of checkpointed RDD in Apache Spark has long (> 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
BTW: My thread dump of the driver's main thread looks like it is stuck on waiting for Amazon S3 bucket metadata for a long time (which may suggests that I should move checkpointing directory from S3 to HDFS): Thread 1: main (RUNNABLE) java.net.SocketInputStream.socketRead0(Native Method) java.net

Re: Union of checkpointed RDD in Apache Spark has long (> 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
Looks like this problem has been mentioned before: http://qnalist.com/questions/5666463/downloads-from-s3-exceedingly-slow-when-running-on-spark-ec2 and a temporarily solution is to deploy on a dedicated EMR/S3 configuration. I'll go for that one for a shot. -- View this message in context: h

Re: Union of checkpointed RDD in Apache Spark has long (> 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
Turns out the above thread is unrelated: it was caused by using s3:// instead of s3n://. Which I already avoided in my checkpointDir configuration. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-10-hour-bet

Re: spark1.0 spark sql saveAsParquetFile Error

2014-06-09 Thread Peng Cheng
I wasn't using spark sql before. But by default spark should retry the exception for 4 times. I'm curious why it aborted after 1 failure -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-spark-sql-saveAsParquetFile-Error-tp7006p7252.html Sent from the

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
I speculate that Spark will only retry on exceptions that are registered with TaskSetScheduler, so a definitely-will-fail task will fail quickly without taking more resources. However I haven't found any documentation or web page on it -- View this message in context: http://apache-spark-user-l

Re: Occasional failed tasks

2014-06-09 Thread Peng Cheng
I think these failed task must got retried automatically if you can't see any error in your results. Other wise the entire application will throw a SparkException and abort. Unfortunately I don't know how to do this, my application always abort. -- View this message in context: http://apache-s

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
fortunately they never got pushed into the documentation, and you got config parameters scattered in two different places (masterURL and $spark.task.maxFailures). I'm thinking of adding a new config parameter $spark.task.maxLocalFailures to override 1, how do you think? Thanks again buddy.

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
Oh, and to make things worse, they forgot '\*' in their regex. Am I the first to encounter this problem before? On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote: Thanks a lot! That's very responsive, somebody definitely has encountered the same problem before, and added two

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
Hi Matei, Yeah you are right this is very niche (my user case is as a web crawler), but I glad you also like an additional property. Let me open a JIRA. Yours Peng On Mon 09 Jun 2014 03:08:29 PM EDT, Matei Zaharia wrote: If this is a useful feature for local mode, we should open a JIRA to

What is the best way to handle transformations or actions that takes forever?

2014-06-16 Thread Peng Cheng
My transformations or actions has some external tool set dependencies and sometimes they just stuck somewhere and there is no way I can fix them. If I don't want the job to run forever, Do I need to implement several monitor threads to throws an exception when they stuck. Or the framework can alrea

Re: What is the best way to handle transformations or actions that takes forever?

2014-06-17 Thread Peng Cheng
I've tried enabling the speculative jobs, this seems partially solved the problem, however I'm not sure if it can handle large-scale situations as it only start when 75% of the job is finished. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-best

Re: What is the best way to handle transformations or actions that takes forever?

2014-06-20 Thread Peng Cheng
Wow, that sounds a lot of work (need a mini-thread), thanks a lot for the answer. It might be a nice-to-have feature. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-best-way-to-handle-transformations-or-actions-that-takes-forever-tp7664p8024.htm

Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
o fix because I cannot debug them. Is there a local cluster simulation mode that can throw all errors yet allows me to debug? Yours Peng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064.html Sent

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Thanks a lot! Let me check my maven shade plugin config and see if there is a fix -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8073.html Sent from the Apache Spark User List mailing list ar

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Indeed I see a lot of duplicate package warning in the maven-shade assembly package output, so I tried to eliminate them: First I set scope of dependency to apache-spark to 'provided', as suggested in this page: http://spark.apache.org/docs/latest/submitting-applications.html But spark master gav

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Latest Advancement: I found the cause of NoClassDef exception: I wasn't using spark-submit, instead I tried to run the spark application directly with SparkConf set in the code. (this is handy in local debugging). However the old problem remains: Even my maven-shade plugin doesn't give any warning

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
I also found that any buggy application submitted in --deploy-mode = cluster mode will crash the worker (turn status to 'DEAD'). This shouldn't really happen, otherwise nobody will use this mode. It is yet unclear whether all workers will crash or only the one running the driver will (as I only hav

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Hi Sean, OK I'm about 90% sure about the cause of this problem: Just another classic Dependency conflict: Myproject -> Selenium -> apache.httpcomponents:httpcore 4.3.1 (has ContentType) Spark -> Spark SQL Hive -> Hive -> Thrift -> apache.httpcomponents:httpcore 4.1.3 (has no ContentType) Though I

Re: Spark Processing Large Data Stuck

2014-06-21 Thread Peng Cheng
JVM will quit after spending most of its time on GC (about 95%), but usually before that you have to wait for a long time, particularly if your job is already at massive scale. Since it is hard to run profiling online, maybe its easier for debugging if you make a lot of partitions (so you can watc

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-22 Thread Peng Cheng
Right problem solved in a most disgraceful manner. Just add a package relocation in maven shade config. The downside is that it is not compatible with my IDE (IntelliJ IDEA), will cause: Error:scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found.: objec

Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
I'm trying to link a spark slave with an already-setup master, using: $SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-32-12:7077 However the result shows that it cannot open a log file it is supposed to create: failed to launch org.apache.spark.deploy.worker.Worker: tail: cannot open '/opt/spa

Re: Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
I haven't setup a passwordless login from slave to master node yet (I was under impression that this is not necessary since they communicate using port 7077) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-informati

Re: Serialization problem in Spark

2014-06-24 Thread Peng Cheng
I encounter the same problem with hadoop.fs.Configuration (very complex, unserializable class) basically if your closure contains any instance (not constant object/singleton! they are in the jar, not closure) that doesn't inherit Serializable, or their properties doesn't inherit Serializable, you a

Re: ElasticSearch enrich

2014-06-24 Thread Peng Cheng
make sure all queries are called through class methods and wrap your query info with a class having only simple properties (strings, collections etc). If you can't find such wrapper you can also use SerializableWritable wrapper out-of-the-box, but its not recommended. (developer-api and make fat cl

Re: How to Reload Spark Configuration Files

2014-06-24 Thread Peng Cheng
I've read somewhere that in 1.0 there is a bash tool called 'spark-config.sh' that allows you to propagate your config files to a number of master and slave nodes. However I haven't use it myself -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Reload-

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Peng Cheng
I got 'NoSuchFieldError' which is of the same type. its definitely a dependency jar conflict. spark driver will load jars of itself which in recent version get many dependencies that are 1-2 years old. And if your newer version dependency is in the same package it will be shaded (Java's first come

Re: ElasticSearch enrich

2014-06-24 Thread Peng Cheng
I'm afraid persisting connection across two tasks is a dangerous act as they can't be guaranteed to be executed on the same machine. Your ES server may think its a man-in-the-middle attack! I think its possible to invoke a static method that give you a connection in a local 'pool', so nothing will

Re: Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
anyone encounter this situation? Also, I'm very sure my slave and master are in the same security group, with port 7077 opened -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8227.html Sent from

Does PUBLIC_DNS environment parameter really works?

2014-06-24 Thread Peng Cheng
I'm deploying a cluster to Amazon EC2, trying to override its internal ip addresses with public dns I start a cluster with environment parameter: SPARK_PUBLIC_DNS=[my EC2 public DNS] But it doesn't change anything on the web UI, it still shows internal ip address Spark Master at spark://ip-172-3

Re: Spark slave fail to start with wierd error information

2014-06-25 Thread Peng Cheng
Sorry I just realize that start-slave is for a different task. Please close this -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8246.html Sent from the Apache Spark User List mailing list archive

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-25 Thread Peng Cheng
y per Node Submitted Time UserState Duration app-20140625083158- org.tribbloid.spookystuff.example.GoogleImage$ 2 512.0 MB2014/06/25 08:31:58 pengRUNNING 17 min However when submitting the job in client mode: $SPARK_HOME/bin/spark-submit \ --

  1   2   >