Do existing R packages work with SparkR data frames

2015-12-22 Thread Lan
Hello, Is it possible for existing R Machine Learning packages (which work with R data frames) such as bnlearn, to work with SparkR data frames? Or do I need to convert SparkR data frames to R data frames? Is "collect" the function to do the conversion, or how else to do that? Many T

ERROR EndpointWriter: AssociationError

2015-02-07 Thread Lan
ciation because it is shutting down. ] More about the setup: each VM has only 4GB RAM, running Ubuntu, using spark-1.2.0, built for Hadoop 2.6.0. I have struggled with this error for a few days. Could anyone please tell me what the problem is and how to fix it? Thanks, Lan -- View this mess

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2015-02-11 Thread Lan
Hi Alexey and Daniel, I'm using Spark 1.2.0 and still having the same error, as described below. Do you have any news on this? Really appreciate your responses!!! "a Spark cluster of 1 master VM SparkV1 and 1 worker VM SparkV4 (the error is the same if I have 2 workers). They are connected witho

Why is Columnar Parquet used as default for saving Row-based DataFrames/RDD?

2015-04-20 Thread Lan
Hello, I have the above naive question if anyone could help. Why not using a Row-based File format to save Row-based DataFrames/RDD? Thanks, Lan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Columnar-Parquet-used-as-default-for-saving-Row-based

getting WARN ReliableDeliverySupervisor

2015-07-01 Thread xiaohe lan
Hi Expert, Hadoop version: 2.4 Spark version: 1.3.1 I am running the SparkPi example application. bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --executor-memory 2G lib/spark-examples-1.3.1-hadoop2.4.0.jar 2 The same command sometimes gets WARN ReliableDeli

Re: getting WARN ReliableDeliverySupervisor

2015-07-02 Thread xiaohe lan
Change jdk from 1.8.0_45 to 1.7.0_79 solve this issue. I saw https://issues.apache.org/jira/browse/SPARK-6388 But it is not a problem however. On Thu, Jul 2, 2015 at 1:30 PM, xiaohe lan wrote: > Hi Expert, > > Hadoop version: 2.4 > Spark version: 1.3.1 > > I am running t

MLLib + Streaming

2016-03-05 Thread Lan Jiang
should be able to run in the streaming application. Am I wrong? Lan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Spark ML and Streaming

2016-03-06 Thread Lan Jiang
should be able to run in the streaming application. Am I wrong? Thanks in advance. Lan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark ML and Streaming

2016-03-06 Thread Lan Jiang
Sorry, accidentally sent again. My apology. > On Mar 6, 2016, at 1:22 PM, Lan Jiang wrote: > > Hi, there > > I hope someone can clarify this for me. It seems that some of the MLlib > algorithms such as KMean, Linear Regression and Logistics Regression have a > Streami

Re: MLLib + Streaming

2016-03-06 Thread Lan Jiang
online and offline learning. Lan > On Mar 6, 2016, at 2:43 AM, Chris Miller wrote: > > Guru:This is a really great response. Thanks for taking the time to explain > all of this. Helpful for me too. > > > -- > Chris Miller > > On Sun, Mar 6, 2016 at 1:54

Processing json document

2016-07-06 Thread Lan Jiang
Hi, there Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here is

Re: Processing json document

2016-07-07 Thread Lan Jiang
e, this would only work in single executor which I think will >> end up with OutOfMemoryException. >> >> Spark JSON data source does not support multi-line JSON as input due to >> the limitation of TextInputFormat and LineRecordReader. >> >> You may have to just extrac

Spark Yarn executor container memory

2016-08-15 Thread Lan Jiang
y question is why it does not count permgen size and memory used by stack. They are not part of the max heap size. IMHO, YARN executor container memory should be set to: spark.executor.memory + [-XX:MaxPermSize] + number_of_threads * [-Xss] + spark.yarn.executor.memoryOverhead. What did I

Re: Scala VS Java VS Python

2015-12-16 Thread Lan Jiang
s not have REPL shell, which is a major drawback from my perspective. Lan > On Dec 16, 2015, at 3:46 PM, Stephen Boesch wrote: > > There are solid reasons to have built spark on the jvm vs python. The > question for Daniel appear to be at this point scala vs java8. For that

Question about Spark Streaming checkpoint interval

2015-12-18 Thread Lan Jiang
I do not find the answer in the document saying whether metadata checkpointing is done for each batch and whether checkpointinterval setting applies to both types of checkpointing. Maybe I miss it. If anyone can point me to the right documentation, I would highly appreciate it. Best Regards, Lan

broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Hi, there I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold. According to the programming guide *Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.* My question is that is "N

Re: broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Michael, Thanks for the reply. On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust wrote: > My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE >> compute statistics" command in Hive shell, is the statistics >> going to be used by SparkSQL to decide broadcast join? > > >

unintended consequence of using coalesce operation

2015-09-29 Thread Lan Jiang
5. Is my understanding correct? In this case, I think repartition is a better choice than coalesce. Lan

How to access lost executor log file

2015-10-01 Thread Lan Jiang
executors to find out why they were lost? Thanks Lan

Re: How to access lost executor log file

2015-10-01 Thread Lan Jiang
L" in the application overview section. When I click it, it brings me to the spark history server UI, where I cannot find the lost exectuors. The only logs link I can find one the YARN RM site is the ApplicationMaster log, which is not what I need. Did I miss something? Lan On Thu, Oct 1, 20

"java.io.IOException: Filesystem closed" on executors

2015-10-01 Thread Lan Jiang
Hi, there Here is the problem I ran into when executing a Spark Job (Spark 1.3). The spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0 library. Then it does some filter/map transformation, repartition to 1 partition and then write to HDFS. It creates 2 stages. The total H

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-05 Thread Lan Jiang
to 1, write the result to HDFS. I use spark 1.3 with spark-avro (1.0.0). The error only happens when running on the whole dataset. When running on 1/3 of the files, the same job completes without error. On Thu, Oct 1, 2015 at 2:41 PM, Lan Jiang wrote: > Hi, there > > Here is the prob

Spark cache memory storage

2015-10-06 Thread Lan Jiang
be 6g. thus I expect the memory cache to be 6 * 0.9 * 0.6 = 3.24g. However, on the Spark history server, it shows the reserved cached size for each executor is 3.1g. So it does not add up. What do I miss? Lan

failed spark job reports on YARN as successful

2015-10-08 Thread Lan Jiang
still a bug or there is something I need to do in spark application to report the correct job status to YARN? Lan

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
The partition number should be the same as the HDFS block number instead of file number. Did you confirmed from the spark UI that only 12 partitions were created? What is your ORC orc.stripe.size? Lan > On Oct 8, 2015, at 1:13 PM, unk1102 wrote: > > Hi I have the following code whe

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
Hmm, that’s odd. You can always use repartition(n) to increase the partition number, but then there will be shuffle. How large is your ORC file? Have you used NameNode UI to check how many HDFS blocks each ORC file has? Lan > On Oct 8, 2015, at 2:08 PM, Umesh Kacha wrote: > &g

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-14 Thread Lan Jiang
the problem. After I increased the spark.yarn.executor.memoryOverhead, it was working fine. I was using Spark 1.3, which has the defaut value as executorMemory * 0.07, with minimum of 384. In spark 1.4 and later, the default value was changed to executorMemory * 0.10, with minimum of 384. Lan On

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
splittable and will not create so many partitions. Lan > On Oct 20, 2015, at 8:03 AM, François Pelletier > wrote: > > You should aggregate your files in larger chunks before doing anything else. > HDFS is not fit for small files. It will bloat it and cause you a lot of >

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
I think the data file is binary per the original post. So in this case, sc.binaryFiles should be used. However, I still recommend against using so many small binary files as 1. They are not good for batch I/O 2. They put too many memory pressure on namenode. Lan > On Oct 20, 2015, at 11

specify yarn-client for --master from a laptop

2015-10-27 Thread xiaohe lan
Hi, I have hadoop 2.4 cluster running on some remote VMs, can I start spark shell or submit from my laptop. For example: bin/spark-shell --mast yarn-client If this is possible, how can I do this ? I have copied the same hadoop to my laptop(but I don't run hadoop on my laptop), I have also set:

Re: Protobuff 3.0 for Spark

2015-11-04 Thread Lan Jiang
protobuf 3 jar file either through —jars during the spark-submit or package it into a uber jar file with your own classes. Lan > On Nov 4, 2015, at 4:07 PM, Cassa L wrote: > > Hi, > Does spark support protobuff 3.0? I used protobuff 2.5 with spark-1.4 built > for HDP 2.3. Given

Re: Protobuff 3.0 for Spark

2015-11-09 Thread Lan Jiang
I have not run into any linkage problem, but maybe I was lucky. :-). The reason I wanted to use protobuf 3 is mainly for Map type support. On Thu, Nov 5, 2015 at 4:43 AM, Steve Loughran wrote: > > > On 5 Nov 2015, at 00:12, Lan Jiang wrote: > > > > I have used protobu

create a table for csv files

2015-11-19 Thread xiaohe lan
Hi, I have some csv file in HDFS with headers like col1, col2, col3, I want to add a column named id, so the a record would be How can I do this using Spark SQL ? Can id be auto increment ? Thanks, Xiaohe

SparkPi is geting java.lang.NoClassDefFoundError: scala/collection/Seq

2015-08-16 Thread xiaohe lan
Hi, I am trying to run SparkPi in Intellij and getting NoClassDefFoundError. Anyone else saw this issue before ? Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Seq at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0

Re: SparkPi is geting java.lang.NoClassDefFoundError: scala/collection/Seq

2015-08-17 Thread xiaohe lan
is provided, you need to change > it to compile to run SparkPi in Intellij. As I remember, you also need to > change guava and jetty related library to compile too. > > On Mon, Aug 17, 2015 at 2:14 AM, xiaohe lan > wrote: > >> Hi, >> >> I am trying to run Spark

add external jar file to Spark shell vs. Scala Shell

2015-09-14 Thread Lan Jiang
Hi, there I ran into a problem when I try to pass external jar file to spark-shell. I have a uber jar file that contains all the java codes I created for protobuf and all its dependency. If I simply execute my code using Scala Shell, it works fine without error. I use -cp to pass the extern

Change protobuf version or any other third party library version in Spark application

2015-09-14 Thread Lan Jiang
and extend the question to any third party libraries. How to deal with version conflict for any third party libraries included in the Spark distribution? Thanks! Lan

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
how to configure spark shell to use my uber jar first. java8964 -- appreciate the link and I will try the configuration. Looks promising. However, the "user classpath first" attribute does not apply to spark-shell, am I correct? Lan On Tue, Sep 15, 2015 at 8:24 AM, java8964 wrote: >

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
that and it did not work either. Lan > On Sep 15, 2015, at 10:31 AM, java8964 wrote: > > If you use Standalone mode, just start spark-shell like following: > > spark-shell --jars your_uber_jar --conf spark.files.userClassPathFirst=true > > Yong > > Date: Tue,

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
I am happy to report that after set spark.dirver.userClassPathFirst, I can use protobuf 3 with spark-shell. Looks like the classloading issue in the driver, not executor. Marcelo, thank you very much for the tip! Lan > On Sep 15, 2015, at 1:40 PM, Marcelo Vanzin wrote: > > Hi,

Spark Streaming proactive monitoring

2017-01-23 Thread Lan Jiang
them proactively? For example, if processing time/scheduling delay exceed certain threshold, send alert to the admin/developer? Lan - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Does monotonically_increasing_id generates the same id even when executor fails or being evicted out of memory

2017-02-28 Thread Lan Jiang
241, which is fixed in 2.0. Lan

BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-11 Thread Lan Jiang
or a MulticlassClassificationEvaluator for multiclass problems*. " https://spark.apache.org/docs/2.1.0/ml-tuning.html Can someone shed some lights on the issue? Lan

unsubscribe

2020-04-16 Thread Jiang, Lan
-- Lan Jiang https://hpi.de/naumann/people/lan-jiang Hasso-Plattner-Institut an der Universität Potsdam Prof.-Dr.-Helmert-Str. 2-3, D-14482 Potsdam Tel +49 331 5509 280

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
automatically without you copying them manually. Lan > On Apr 20, 2015, at 9:26 AM, Michael Ryabtsev wrote: > > Hi all, > > I need to configure spark executor log4j.properties on a standalone cluster. > It looks like placing the relevant properties file in the spark > con

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
Each application gets its own executor processes, so there should be no problem running them in parallel. Lan > On Apr 20, 2015, at 10:25 AM, Michael Ryabtsev wrote: > > Hi Lan, > > Thanks for fast response. It could be a solution if it works. I have more > than one lo

Re: HiveContext vs SQLContext

2015-04-20 Thread Lan Jiang
Spark. Future releases will focus on bringing SQLContext up to feature parity with a HiveContext.” Lan > On Apr 20, 2015, at 4:17 PM, Daniel Mahler wrote: > > Is HiveContext still preferred over SQLContext? > What are the current (1.3.1) diferences between them? > > thanks > Daniel

Re: Scheduling across applications - Need suggestion

2015-04-22 Thread Lan Jiang
YARN capacity scheduler support hierarchical queues, which you can assign cluster resource as percentage. Your spark application/shell can be submitted to different queues. Mesos supports fine-grained mode, which allows the machines/cores used each executors ramp up and down. Lan On Wed, Apr 22

How to install spark in spark on yarn mode

2015-04-29 Thread xiaohe lan
Hi experts, I see spark on yarn has yarn-client and yarn-cluster mode. I also have a 5 nodes hadoop cluster (hadoop 2.4). How to install spark if I want to try the spark on yarn mode. Do I need to install spark on the each node of hadoop cluster ? Thanks, Xiaohe

Re: How to install spark in spark on yarn mode

2015-04-30 Thread xiaohe lan
> http://mbonaci.github.io/mbo-spark/ > You dont need to install spark on every node.Just install it on one node > or you can install it on remote system also and made a spark cluster. > Thanks > Madhvi > > On Thursday 30 April 2015 09:31 AM, xiaohe lan wrote: > >> Hi experts

number of executors

2015-05-16 Thread xiaohe lan
Hi, I have a 5 nodes yarn cluster, I used spark-submit to submit a simple app. spark-submit --master yarn target/scala-2.10/simple-project_2.10-1.0.jar --class scala.SimpleApp --num-executors 5 I have set the number of executor to 5, but from sparkui I could see only two executors and it ran ve

println in spark-shell

2015-05-17 Thread xiaohe lan
Hi, When I start spark shell by passing yarn to master option, println does not print elements in RDD: bash-4.1$ spark-shell --master yarn 15/05/17 01:50:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to

Re: number of executors

2015-05-17 Thread xiaohe lan
executor-cores param? While you submit the job, do a ps aux > | grep spark-submit and see the exact command parameters. > > Thanks > Best Regards > > On Sat, May 16, 2015 at 12:31 PM, xiaohe lan > wrote: > >> Hi, >> >> I have a 5 nodes yarn cluster, I used

Re: number of executors

2015-05-17 Thread xiaohe lan
MB295.4 MB2host2:620721.7 min505640.0 MB / 12014510386.0 MB / 109269121646.6 MB304.8 MB On Sun, May 17, 2015 at 11:50 PM, xiaohe lan wrote: > bash-4.1$ ps aux | grep SparkSubmit > xilan 1704 13.2 1.2 5275520 380244 pts/0 Sl+ 08:39 0:13 > /scratch/xilan/jdk1.8.0_45/bin/java -cp &

Re: number of executors

2015-05-18 Thread xiaohe lan
, Sandy Ryza wrote: > Hi Xiaohe, > > The all Spark options must go before the jar or they won't take effect. > > -Sandy > > On Sun, May 17, 2015 at 8:59 AM, xiaohe lan > wrote: > >> Sorry, them both are assigned task actually. >> >> Aggreg

Re: number of executors

2015-05-18 Thread xiaohe lan
: > Awesome! > > It's documented here: > https://spark.apache.org/docs/latest/submitting-applications.html > > -Sandy > > On Mon, May 18, 2015 at 8:03 PM, xiaohe lan > wrote: > >> Hi Sandy, >> >> Thanks for your information. Yes, spark-submit --master y

Do existing R packages work with SparkR data frames

2015-12-22 Thread Duy Lan Nguyen
Hello, Is it possible for existing R Machine Learning packages (which work with R data frames) such as bnlearn, to work with SparkR data frames? Or do I need to convert SparkR data frames to R data frames? Is "collect" the function to do the conversion, or how else to do that? Many Thanks, Lan

Why is Columnar Parquet used as default for saving Row-based DataFrames/RDD?

2015-04-20 Thread Duy Lan Nguyen
Hello, I have the above naive question if anyone could help. Why not using a Row-based File format to save Row-based DataFrames/RDD? Thanks, Lan