RE: How to get progress information of an RDD operation

2016-02-24 Thread Wang, Ningjun (LNG-NPV)
? Ningjun From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, February 23, 2016 2:30 PM To: Kevin Mellott Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: How to get progress information of an RDD operation I think Ningjun was looking for programmatic way of tracking progress. I took

How to get progress information of an RDD operation

2016-02-23 Thread Wang, Ningjun (LNG-NPV)
How can I get progress information of a RDD operation? For example val lines = sc.textFile("c:/temp/input.txt") // a RDD of millions of line lines.foreach(line => { handleLine(line) }) The input.txt contains millions of lines. The entire operation take 6 hours. I want to print out h

RE: How to create dataframe from SQL Server SQL query

2015-12-07 Thread Wang, Ningjun (LNG-NPV)
This is a very helpful article. Thanks for the help. Ningjun From: Sujit Pal [mailto:sujitatgt...@gmail.com] Sent: Monday, December 07, 2015 12:42 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to create dataframe from SQL Server SQL query Hi Ningjun, Haven't

How to create dataframe from SQL Server SQL query

2015-12-07 Thread Wang, Ningjun (LNG-NPV)
How can I create a RDD from a SQL query against SQLServer database? Here is the example of dataframe http://spark.apache.org/docs/latest/sql-programming-guide.html#overview val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tab

RE: Why is my spark executor is terminated?

2015-10-14 Thread Wang, Ningjun (LNG-NPV)
, October 13, 2015 10:42 AM To: user@spark.apache.org Subject: Re: Why is my spark executor is terminated? Hi Ningjun, Nothing special in the master log ? Regards JB On 10/13/2015 04:34 PM, Wang, Ningjun (LNG-NPV) wrote: > We use spark on windows 2008 R2 servers. We use one spark context > which

Why is my spark executor is terminated?

2015-10-13 Thread Wang, Ningjun (LNG-NPV)
We use spark on windows 2008 R2 servers. We use one spark context which create one spark executor. We run spark master, slave, driver, executor on one single machine. >From time to time, we found that the executor JAVA process was terminated. I >cannot fig out why it was terminated. Can anybody

RE: How to register array class with Kyro in spark-defaults.conf

2015-07-31 Thread Wang, Ningjun (LNG-NPV)
: Friday, July 31, 2015 11:49 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to register array class with Kyro in spark-defaults.conf For the second exception, was there anything following SparkException which would give us more clue ? Can you tell us how EsDoc is

RE: How to register array class with Kyro in spark-defaults.conf

2015-07-31 Thread Wang, Ningjun (LNG-NPV)
Does anybody have any idea how to solve this problem? Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, July 30, 2015 11:06 AM To: user@spark.apache.org Subject: How to register array class with Kyro in spark-defaults.conf I register my class with Kyro in spark-defaults.conf as follow

How to register array class with Kyro in spark-defaults.conf

2015-07-30 Thread Wang, Ningjun (LNG-NPV)
I register my class with Kyro in spark-defaults.conf as follow spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrationRequired true spark.kryo.classesToRegister ltn.analytics.es.EsDoc But I got the following excep

RE: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32

2015-07-17 Thread Wang, Ningjun (LNG-NPV)
Does anybody have any idea what cuase this problem? Thanks. Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Wednesday, July 15, 2015 11:09 AM To: user@spark.apache.org Subject: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32 I just installed spark

java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32

2015-07-15 Thread Wang, Ningjun (LNG-NPV)
I just installed spark 1.3.1 on windows 2008 server. When I start spark-shell, I got the following error Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32 Please advise. Thanks. Ningjun

Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Wang, Ningjun (LNG-NPV)
In rdd.mapPartition(...) if I try to iterate through the items in the partition, everything screw. For example val rdd = sc.parallelize(1 to 1000, 3) val count = rdd.mapPartitions(iter => { println(iter.length) iter }).count() The count is 0. This is incorrect. The count should be 1000. If

Does spark performance really scale out with multiple machines?

2015-06-15 Thread Wang, Ningjun (LNG-NPV)
I try to measure how spark standalone cluster performance scale out with multiple machines. I did a test of training the SVM model which is heavy in memory computation. I measure the run time for spark standalone cluster of 1 - 3 nodes, the result is following 1 node: 35 minutes 2 nodes: 30.1 m

RE: How to set spark master URL to contain domain name?

2015-06-12 Thread Wang, Ningjun (LNG-NPV)
I think the problem is that in my local etc/hosts file, I have 10.196.116.95 WIN02 I will remove it and try. Thanks for the help. Ningjun From: prajod.vettiyat...@wipro.com [mailto:prajod.vettiyat...@wipro.com] Sent: Friday, June 12, 2015 1:44 AM To: Wang, Ningjun (LNG-NPV) Cc: user

How to set spark master URL to contain domain name?

2015-06-11 Thread Wang, Ningjun (LNG-NPV)
I start spark master on windows using bin\spark-class.cmd org.apache.spark.deploy.master.Master Then I goto http://localhost:8080/ to find the master URL, it is spark://WIN02:7077 Here WIN02 is my machine name. Why does it missing the domain name? If I start the spark master on other machines,

RE: spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-26 Thread Wang, Ningjun (LNG-NPV)
: Friday, May 22, 2015 5:02 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: spark on Windows 2008 failed to save RDD to windows shared folder The stack trace is related to hdfs. Can you tell us which hadoop release you are using ? Is this a secure cluster ? Thanks On Fri

spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-22 Thread Wang, Ningjun (LNG-NPV)
I used spark standalone cluster on Windows 2008. I kept on getting the following error when trying to save an RDD to a windows shared folder rdd.saveAsObjectFile("file:///T:/lab4-win02/IndexRoot01/tobacco-07/myrdd.obj") 15/05/22 16:49:05 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 1

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
to:so...@cloudera.com] Sent: Thursday, May 21, 2015 11:30 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: rdd.sample() methods very slow I guess the fundamental issue is that these aren't stored in a way that allows random access to a Document. Underneath, Hadoop has a concept of

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
document). How can I do this quickly? The rdd.sample() methods does not help because it need to read the entire RDD of 7 million Document from disk which take very long time. Ningjun From: Sean Owen [mailto:so...@cloudera.com] Sent: Tuesday, May 19, 2015 4:51 PM To: Wang, Ningjun (LNG-NPV) Cc

rdd.sample() methods very slow

2015-05-19 Thread Wang, Ningjun (LNG-NPV)
Hi I have an RDD[Document] that contains 7 million objects and it is saved in file system as object file. I want to get a random sample of about 70 objects from it using rdd.sample() method. It is ver slow val rdd : RDD[Document] = sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.0

RE: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-07 Thread Wang, Ningjun (LNG-NPV)
a:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Ningjun From: Jonathan Coveney [mailto:jcove...@gmail.com] Sent: Wednesday, May 06, 2015 5:23 PM To: Wang, Ningjun (LNG-NPV) Cc: Ted Yu; user@spark.apac

RE: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Wang, Ningjun (LNG-NPV)
:32 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0 Which release of Spark are you using ? Thanks On May 6, 2015, at 8:03 AM, Wang, Ningjun (LNG-NPV) mailto:ningjun.w...@lexisnexis.com

java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Wang, Ningjun (LNG-NPV)
I run a job on spark standalone cluster and got the exception below Here is the line of code that cause problem val myRdd: RDD[(String, String, String)] = ... // RDD of (docid, cattegory, path) myRdd.persist(StorageLevel.MEMORY_AND_DISK_SER) val cats: Array[String] = myRdd.map(t => t._2).disti

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-30 Thread Wang, Ningjun (LNG-NPV)
onvert a DataFrame to RDD and then invoke the recudeByKey Ningjun From: ayan guha [mailto:guha.a...@gmail.com] Sent: Thursday, April 30, 2015 3:41 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: RE: HOw can I merge multiple DataFrame and remove duplicated key 1. Do a group

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread Wang, Ningjun (LNG-NPV)
, value2, 2015-01-02 id2, value4, 2015-01-02 I can use reduceByKey() in RDD but how to do it using DataFrame? Can you give an example code snipet? Thanks Ningjun From: ayan guha [mailto:guha.a...@gmail.com] Sent: Wednesday, April 29, 2015 5:54 PM To: Wang, Ningjun (LNG-NPV) Cc: user

HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread Wang, Ningjun (LNG-NPV)
I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to union all the DataFrame objects together but for duplicated id only keep the record with the latest timestamp. How can I do that? I can do this for RDDs b

Can I index a column in parquet file to make it join faster

2015-04-22 Thread Wang, Ningjun (LNG-NPV)
I have two RDDs each saved in a parquet file. I need to join this two RDDs by the "id" column. Can I created index on the id column so they can join faster? Here is the code case class Example(val id: String, val category: String) case class DocVector(val id: String, val vector: Vector) val

RE: implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-22 Thread Wang, Ningjun (LNG-NPV)
Ningjun From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, April 21, 2015 11:05 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: implicits is not a member of org.apache.spark.sql.SQLContext Have you tried the following ? import sqlContext._ import sqlContext.implicits

implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-21 Thread Wang, Ningjun (LNG-NPV)
I tried to convert an RDD to a data frame using the example codes on spark website case class Person(name: String, age: Int) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split

How to persist RDD return from partitionBy() to disk?

2015-04-17 Thread Wang, Ningjun (LNG-NPV)
I have a huge RDD[Document] with millions of items. I partitioned it using HashPartitioner and save as object file. But when I load the object file back into RDD, I lost the HashPartitioner. How do I preserve the partitions when loading the object file? Here is the code val docVectors : RDD[D

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
ing called IndexedRDD on the web https://github.com/amplab/spark-indexedrdd Has anybody use it? Ningjun -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 12:18 PM To: 'Sean Owen'; Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Sub

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Does anybody have a solution for this? From: Wang, Ningjun (LNG-NPV) Sent: Tuesday, April 14, 2015 10:41 AM To: user@spark.apache.org Subject: How to join RDD keyValuePairs efficiently I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I

RE: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Wang, Ningjun (LNG-NPV)
worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=" On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) mailto:ningjun.w...@lexisnexis.com>> wrote: Does anybody have an answer for this? Thanks Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, April 02, 2015 12:14 PM To: user@spark.apache.org

How to join RDD keyValuePairs efficiently

2015-04-14 Thread Wang, Ningjun (LNG-NPV)
I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I need to find the documents by ids quickly. Currently I used RDD join as follow First I save the RDD as object file allDocs : RDD[Document] = getDocs() // this RDD contains 7 million Doc

RE: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-10 Thread Wang, Ningjun (LNG-NPV)
Does anybody have an answer for this? Thanks Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, April 02, 2015 12:14 PM To: user@spark.apache.org Subject: Is the disk space in SPARK_LOCAL_DIRS cleanned up? I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are shuffled, spark

Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-02 Thread Wang, Ningjun (LNG-NPV)
I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are shuffled, spark writes to this folder. I found that the disk space of this folder keep on increase quickly and at certain point I will run out of disk space. I wonder does spark clean up the disk spac in this folder once the shuffle

RE: How to get rdd count() without double evaluation of the RDD?

2015-03-30 Thread Wang, Ningjun (LNG-NPV)
: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: Thursday, March 26, 2015 12:37 PM To: Sean Owen Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: How to get rdd count() without double evaluation of the RDD? You can also always take the more extreme approach of using SparkContext

How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Wang, Ningjun (LNG-NPV)
I have a rdd that is expensive to compute. I want to save it as object file and also print the count. How can I avoid double computation of the RDD? val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) val count = rdd.count() // this force computation of the rdd println(count

Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Wang, Ningjun (LNG-NPV)
Hi I ran a spark job and got the following error. Can anybody tell me how to work around this problem? For example how can I increase spark.driver.maxResultSize? Thanks. org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 128 tasks (1029.1 MB)

RE: sc.textFile() on windows cannot access UNC path

2015-03-12 Thread Wang, Ningjun (LNG-NPV)
: Wednesday, March 11, 2015 2:40 AM To: Wang, Ningjun (LNG-NPV) Cc: java8964; user@spark.apache.org Subject: Re: sc.textFile() on windows cannot access UNC path ​​ I don't have a complete example for your usecase, but you can see a lot of codes showing how to use new APIHadoopFile from here&

RE: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Wang, Ningjun (LNG-NPV)
Thanks for the suggestion. I will try that. Ningjun From: Silvio Fiorito [mailto:silvio.fior...@granturing.com] Sent: Wednesday, March 11, 2015 12:40 AM To: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: Is it possible to use windows service to start and stop spark standalone

Is it possible to use windows service to start and stop spark standalone cluster

2015-03-10 Thread Wang, Ningjun (LNG-NPV)
We are using spark stand alone cluster on Windows 2008 R2. I can start spark clusters by open an command prompt and run the following bin\spark-class.cmd org.apache.spark.deploy.master.Master bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://mywin.mydomain.com:7077 I can stop s

RE: sc.textFile() on windows cannot access UNC path

2015-03-10 Thread Wang, Ningjun (LNG-NPV)
, March 10, 2015 9:14 AM To: java8964 Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: sc.textFile() on windows cannot access UNC path You can create your own Input Reader (using java.nio.*) and pass it to the sc.newAPIHadoopFile while reading. Thanks Best Regards On Tue, Mar 10, 2015

RE: sc.textFile() on windows cannot access UNC path

2015-03-09 Thread Wang, Ningjun (LNG-NPV)
(...)? Ningjun From: java8964 [mailto:java8...@hotmail.com] Sent: Monday, March 09, 2015 5:33 PM To: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: RE: sc.textFile() on windows cannot access UNC path This is a Java problem, not really Spark. >From this page: >http://stackoverfl

sc.textFile() on windows cannot access UNC path

2015-03-09 Thread Wang, Ningjun (LNG-NPV)
I am running Spark on windows 2008 R2. I use sc.textFile() to load text file using UNC path, it does not work. sc.textFile(raw"file:10.196.119.230/folder1/abc.txt", 4).count() Input path does not exist: file:/10.196.119.230/folder1/abc.txt org.apache.hadoop.mapred.InvalidInputException: Inp

RE: How to union RDD and remove duplicated keys

2015-02-13 Thread Wang, Ningjun (LNG-NPV)
appreciated because I am new to Spark. Ningjun From: Boromir Widas [mailto:vcsub...@gmail.com] Sent: Friday, February 13, 2015 1:28 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to union RDD and remove duplicated keys reducebyKey should work, but you need to define the ordering

How to union RDD and remove duplicated keys

2015-02-13 Thread Wang, Ningjun (LNG-NPV)
I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g. rdd1: ("id1", "Long text 1"), ("id2", "Long text 2"), ("id3", "Long text 3") rdd2: ("id1", "Long text 1 A"), ("id2", "Long text 2 A") rdd3: ("id1", "Long text 1 B") Then, I want to merge all RDDs. If there is dup

RE: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Wang, Ningjun (LNG-NPV)
integrate with our existing app easily. Has anybody use spark on windows for production system? Is spark reliable on windows? Ningjun From: gen tang [mailto:gen.tan...@gmail.com] Sent: Thursday, January 29, 2015 12:53 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: Fail to

RE: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread Wang, Ningjun (LNG-NPV)
only use local file system and do not have any hdfs file system at all. I don’t understand why spark generate so many error on Hadoop while we don’t even need hdfs. Ningjun From: gen tang [mailto:gen.tan...@gmail.com] Sent: Thursday, January 29, 2015 10:45 AM To: Wang, Ningjun (LNG-NPV) Cc: user

Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread Wang, Ningjun (LNG-NPV)
I deployed spark-1.1.0 on Windows 7 and was albe to launch the spark-shell. I then deploy it to windows 2008 R2 and launch the spark-shell, I got the error java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOExceptio n: Cannot run program "ls": CreateProce

RE: Spark on Windows 2008 R2 serv er does not work

2015-01-29 Thread Wang, Ningjun (LNG-NPV)
n Windows? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Wednesday, January 28, 2015 5:15 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org S

RE: Spark on Windows 2008 R2 serv er does not work

2015-01-28 Thread Wang, Ningjun (LNG-NPV)
Has anybody successfully install and run spark-1.2.0 on windows 2008 R2 or windows 7? How do you get that works? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent

Spark on Windows 2008 R2 serv er does not work

2015-01-27 Thread Wang, Ningjun (LNG-NPV)
I download and install spark-1.2.0-bin-hadoop2.4.tgz pre-built version on Windows 2008 R2 server. When I submit a job using spark-submit, I got the following error WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform ... using builtin-java classe

RE: How to start spark master on windows

2015-01-27 Thread Wang, Ningjun (LNG-NPV)
Never mind, the problem is that JAVA is not installed on windows. I install JAVA and the problem go away. Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent

How to start spark master on windows

2015-01-27 Thread Wang, Ningjun (LNG-NPV)
I download spark 1.2.0 on my windows server 2008. How do I start spark master? I tried to run the following on command prompt C:\spark-1.2.0-bin-hadoop2.4> bin\spark-class.cmd org.apache.spark.deploy.master.Master I got the error else was unexpected at this time. Ningjun

RE: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Wang, Ningjun (LNG-NPV)
contains thousands of partitions instead of 8 partitions Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, January 21, 2015 2:32 PM To: Wang, Ningjun (LNG-NPV) Cc: user

sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Wang, Ningjun (LNG-NPV)
Why sc.objectFile(...) return a Rdd with thousands of partitions? I save a rdd to file system using rdd.saveAsObjectFile("file:///tmp/mydir") Note that the rdd contains 7 millions object. I check the directory /tmp/mydir/, it contains 8 partitions part-0 part-2 part-4 part-6

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Wang, Ningjun (LNG-NPV)
Can anybody answer this? Do I have to have hdfs to achieve this? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent: Friday, January 16, 2015 1:15 PM To: Imran

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
, Ningjun From: imranra...@gmail.com [mailto:imranra...@gmail.com] On Behalf Of Imran Rashid Sent: Friday, January 16, 2015 12:14 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes? I&#

RE: How to force parallel processing of RDD using multiple thread

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
: Friday, January 16, 2015 9:44 AM To: Wang, Ningjun (LNG-NPV) Cc: Sean Owen; user@spark.apache.org Subject: Re: How to force parallel processing of RDD using multiple thread Spark will use the number of cores available in the cluster. If your cluster is 1 node with 4 cores, Spark will execute up to

Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
I have asked this question before but get no answer. Asking again. Can I save RDD to the local file system and then read it back on a spark cluster with multiple nodes? rdd.saveAsObjectFile("file:///home/data/rdd1") val rdd2 = sc.objectFile("file:///home/data/rdd1") This will works if the clus

RE: How to force parallel processing of RDD using multiple thread

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to force parallel processing of RDD using multiple thread Check the number of partitions in your input. It may be much less than the available parallelism of your small cluster. For example, input that lives in just 1 partition

RE: How to force parallel processing of RDD using multiple thread

2015-01-15 Thread Wang, Ningjun (LNG-NPV)
Providence, NJ 07974-1541 -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, January 15, 2015 4:29 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to force parallel processing of RDD using multiple thread What is your cluster manager

How to force parallel processing of RDD using multiple thread

2015-01-15 Thread Wang, Ningjun (LNG-NPV)
I have a standalone spark cluster with only one node with 4 CPU cores. How can I force spark to do parallel processing of my RDD using multiple threads? For example I can do the following Spark-submit --master local[4] However I really want to use the cluster as follow Spark-submit --master

Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-14 Thread Wang, Ningjun (LNG-NPV)
Can I save RDD to the local file system and then read it back on a spark cluster with multiple nodes? rdd.saveAsObjectFile("file:///home/data/rdd1") val rdd2 = sc.objectFile("file:///home/data/rdd1") This will works if the cluster has only one node. But my cluster has 3 nodes and each node has

RE: Failed to save RDD as text file to local file system

2015-01-13 Thread Wang, Ningjun (LNG-NPV)
From: Prannoy [via Apache Spark User List] [mailto:[hidden email][hidden email]<http://user/SendEmail.jtp?type=node&node=21105&i=0>] Sent: Monday, January 12, 2015 4:18 AM To: Wang, Ningjun (LNG-NPV) Subject: Re: Failed to save RDD as text file to local file system Have you tried simple

RE: Failed to save RDD as text file to local file system

2015-01-13 Thread Wang, Ningjun (LNG-NPV)
1.run(Server.java:2009) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:415) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformat > ion.java:1642) > &g

Cannot save RDD as text file to local file system

2015-01-08 Thread Wang, Ningjun (LNG-NPV)
I try to save RDD as text file to local file system (Linux) but it does not work Launch spark-shell and run the following val r = sc.parallelize(Array("a", "b", "c")) r.saveAsTextFile("file:///home/cloudera/tmp/out1") IOException: Mkdirs failed to create file:/home/cloudera/tmp/out1/_temporary/

subscribe me to the list

2014-12-05 Thread Wang, Ningjun (LNG-NPV)
I would like to subscribe to the user@spark.apache.org Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541

SparkContext.textfile() cannot load file using UNC path on windows

2014-11-26 Thread Wang, Ningjun (LNG-NPV)
SparkContext.textfile() cannot load file using UNC path on windows I run the following on Windows XP val conf = new SparkConf().setAppName("testproj1.ClassificationEngine").setMaster("local") val sc = new SparkContext(conf) sc.textFile(raw"\\10.209.128.150\TempShare\SvmPocData\reute