BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-11 Thread Lan Jiang
I realized that in the Spark ML, BinaryClassifcationMetrics only supports AreaUnderPR and AreaUnderROC. Why is that? I What if I need other metrics such as F-score, accuracy? I tried to use MulticlassClassificationEvaluator to evaluate other metrics such as Accuracy for a binary classification pro

Does monotonically_increasing_id generates the same id even when executor fails or being evicted out of memory

2017-02-28 Thread Lan Jiang
Hi, there I am trying to generate unique ID for each record in a dataframe, so that I can save the dataframe to a relational database table. My question is that when the dataframe is regenerated due to executor failure or being evicted out of cache, does the ID keeps the same as before? According

Spark Streaming proactive monitoring

2017-01-23 Thread Lan Jiang
Hi, there From the Spark UI, we can monitor the following two metrics: • Processing Time - The time to process each batch of data. • Scheduling Delay - the time a batch waits in a queue for the processing of previous batches to finish. However, what is the best way to monitor th

Spark Yarn executor container memory

2016-08-15 Thread Lan Jiang
Hello, My understanding is that YARN executor container memory is based on "spark.executor.memory" + “spark.yarn.executor.memoryOverhead”. The first one is for heap memory and second one is for offheap memory. The spark.executor.memory is used by -Xmx to set the max heap size. Now my question

Re: Processing json document

2016-07-07 Thread Lan Jiang
e, this would only work in single executor which I think will >> end up with OutOfMemoryException. >> >> Spark JSON data source does not support multi-line JSON as input due to >> the limitation of TextInputFormat and LineRecordReader. >> >> You may have to just extrac

Processing json document

2016-07-06 Thread Lan Jiang
Hi, there Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here is

Re: MLLib + Streaming

2016-03-06 Thread Lan Jiang
be > constantly monitored to see how it is performing. > > Hope this helps in understanding offline learning vs. online learning and > which algorithms you can choose for online learning in MLlib. > > Guru Medasani > gdm...@gmail.com <mailto:gdm...@gmail.com> > > >

Re: Spark ML and Streaming

2016-03-06 Thread Lan Jiang
Sorry, accidentally sent again. My apology. > On Mar 6, 2016, at 1:22 PM, Lan Jiang wrote: > > Hi, there > > I hope someone can clarify this for me. It seems that some of the MLlib > algorithms such as KMean, Linear Regression and Logistics Regression have a > Streami

Spark ML and Streaming

2016-03-06 Thread Lan Jiang
Hi, there I hope someone can clarify this for me. It seems that some of the MLlib algorithms such as KMean, Linear Regression and Logistics Regression have a Streaming version, which can do online machine learning. But does that mean other MLLib algorithm cannot be used in Spark streaming appl

MLLib + Streaming

2016-03-05 Thread Lan Jiang
Hi, there I hope someone can clarify this for me. It seems that some of the MLlib algorithms such as KMean, Linear Regression and Logistics Regression have a Streaming version, which can do online machine learning. But does that mean other MLLib algorithm cannot be used in Spark streaming appl

Re: broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Michael, Thanks for the reply. On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust wrote: > My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE >> compute statistics" command in Hive shell, is the statistics >> going to be used by SparkSQL to decide broadcast join? > > >

broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Hi, there I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold. According to the programming guide *Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.* My question is that is "N

Question about Spark Streaming checkpoint interval

2015-12-18 Thread Lan Jiang
Need some clarification about the documentation. According to Spark doc "the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream

Re: Scala VS Java VS Python

2015-12-16 Thread Lan Jiang
For Spark data science project, Python might be a good choice. However, for Spark streaming, Python API is still lagging. For example, for Kafka no receiver connector, according to the Spark 1.5.2 doc: "Spark 1.4 added a Python API, but it is not yet at full feature parity”. Java does not hav

Re: Protobuff 3.0 for Spark

2015-11-09 Thread Lan Jiang
I have not run into any linkage problem, but maybe I was lucky. :-). The reason I wanted to use protobuf 3 is mainly for Map type support. On Thu, Nov 5, 2015 at 4:43 AM, Steve Loughran wrote: > > > On 5 Nov 2015, at 00:12, Lan Jiang wrote: > > > > I have used protobu

Re: Protobuff 3.0 for Spark

2015-11-04 Thread Lan Jiang
I have used protobuf 3 successfully with Spark on CDH 5.4, even though Hadoop itself comes with protobuf 2.5. I think the steps apply to HDP too. You need to do the following 1. Set the below parameter spark.executor.userClassPathFirst=true spark.driver.userClassPathFirst=true 2. Include proto

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
Context.html#wholeTextFiles(java.lang.String,%20int)> > > On 20 October 2015 at 15:04, Lan Jiang <mailto:ljia...@gmail.com>> wrote: > As Francois pointed out, you are encountering a classic small file > anti-pattern. One solution I used in the past is to wrap all these small &g

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
As Francois pointed out, you are encountering a classic small file anti-pattern. One solution I used in the past is to wrap all these small binary files into a sequence file or avro file. For example, the avro schema can have two fields: filename: string and binaryname:byte[]. Thus your file is

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-14 Thread Lan Jiang
Mon, Oct 12, 2015 at 8:34 AM, Akhil Das wrote: > Can you look a bit deeper in the executor logs? It could be filling up the > memory and getting killed. > > Thanks > Best Regards > > On Mon, Oct 5, 2015 at 8:55 PM, Lan Jiang wrote: > >> I am still facing th

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
t; Hi Lan, thanks for the response yes I know and I have confirmed in UI that it > has only 12 partitions because of 12 HDFS blocks and hive orc file strip size > is 33554432. > > On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <mailto:ljia...@gmail.com>> wrote: > The partition numb

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
The partition number should be the same as the HDFS block number instead of file number. Did you confirmed from the spark UI that only 12 partitions were created? What is your ORC orc.stripe.size? Lan > On Oct 8, 2015, at 1:13 PM, unk1102 wrote: > > Hi I have the following code where I read

failed spark job reports on YARN as successful

2015-10-08 Thread Lan Jiang
Hi, there I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in “yarn-client” mode. The job itself failed due to YARN kills several executor containers because the containers exceeded the memory limit posed by YARN. However, when I went to the YARN resource manager site,

Spark cache memory storage

2015-10-06 Thread Lan Jiang
Hi, there My understanding is that the cache storage is calculated as following executor heap size * spark.storage.safetyFraction * spark.storage.memoryFraction. The default value for safetyFraction is 0.9 and memoryFraction is 0.6. When I started a spark job on YARN, I set executor-memory to be

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-05 Thread Lan Jiang
to 1, write the result to HDFS. I use spark 1.3 with spark-avro (1.0.0). The error only happens when running on the whole dataset. When running on 1/3 of the files, the same job completes without error. On Thu, Oct 1, 2015 at 2:41 PM, Lan Jiang wrote: > Hi, there > > Here is the prob

"java.io.IOException: Filesystem closed" on executors

2015-10-01 Thread Lan Jiang
Hi, there Here is the problem I ran into when executing a Spark Job (Spark 1.3). The spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0 library. Then it does some filter/map transformation, repartition to 1 partition and then write to HDFS. It creates 2 stages. The total H

Re: How to access lost executor log file

2015-10-01 Thread Lan Jiang
15 at 1:30 PM, Ted Yu wrote: > Can you go to YARN RM UI to find all the attempts for this Spark Job ? > > The two lost executors should be found there. > > On Thu, Oct 1, 2015 at 10:30 AM, Lan Jiang wrote: > >> Hi, there >> >> When running a Spark job on

How to access lost executor log file

2015-10-01 Thread Lan Jiang
Hi, there When running a Spark job on YARN, 2 executors somehow got lost during the execution. The message on the history server GUI is “CANNOT find address”. Two extra executors were launched by YARN and eventually finished the job. Usually I go to the “Executors” tab on the UI to check the e

unintended consequence of using coalesce operation

2015-09-29 Thread Lan Jiang
Hi, there I ran into an issue when using Spark (v 1.3) to load avro file through Spark SQL. The code sample is below val df = sqlContext.load(“path-to-avro-file","com.databricks.spark.avro”) val myrdd = df.select(“Key", “Name", “binaryfield").rdd val results = myrdd.map(...) val finalResults = r

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
; > If that doesn't work, I'd recommend shading, as others already have. > > On Tue, Sep 15, 2015 at 9:19 AM, Lan Jiang wrote: >> I used the --conf spark.files.userClassPathFirst=true in the spark-shell >> option, it still gave me the eror: java.

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
version or any other third party library version > in Spark application > From: ste...@hortonworks.com <mailto:ste...@hortonworks.com> > To: ljia...@gmail.com <mailto:ljia...@gmail.com> > CC: user@spark.apache.org <mailto:user@spark.apache.org> > Date: Tue, 15 Sep 201

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
; To: ljia...@gmail.com > CC: user@spark.apache.org > Date: Tue, 15 Sep 2015 09:19:28 + > > > > > On 15 Sep 2015, at 05:47, Lan Jiang wrote: > > Hi, there, > > I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by > default. However, I wou

Change protobuf version or any other third party library version in Spark application

2015-09-14 Thread Lan Jiang
Hi, there, I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. However, I would like to use Protobuf 3 in my spark application so that I can use some new features such as Map support. Is there anyway to do that? Right now if I build a uber.jar with dependencies includ

add external jar file to Spark shell vs. Scala Shell

2015-09-14 Thread Lan Jiang
Hi, there I ran into a problem when I try to pass external jar file to spark-shell. I have a uber jar file that contains all the java codes I created for protobuf and all its dependency. If I simply execute my code using Scala Shell, it works fine without error. I use -cp to pass the extern

Re: Scheduling across applications - Need suggestion

2015-04-22 Thread Lan Jiang
YARN capacity scheduler support hierarchical queues, which you can assign cluster resource as percentage. Your spark application/shell can be submitted to different queues. Mesos supports fine-grained mode, which allows the machines/cores used each executors ramp up and down. Lan On Wed, Apr 22,

Re: HiveContext vs SQLContext

2015-04-20 Thread Lan Jiang
Daniel, HiveContext is a subclass of SQLContext, thus offers a superset of features not available in SQLContext, such as access to Hive UDF, Hive table, Hive Serde, etc. This does not change in 1.3.1. Quote from 1.3.1 documentation “… using HiveContext is recommended for the 1.3 release of Spa

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
ame? Can I > conclude that going this way I will not be able to run several applications > on the same cluster in parallel? > > Regarding submit, I am not using it now, I submit from the code, but I think > I should consider this option. > > Thanks. > > On Mon, Apr 20,

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
Rename your log4j_special.properties file as log4j.properties and place it under the root of your jar file, you should be fine. If you are using Maven to build your jar, please the log4j.properties in the src/main/resources folder. However, please note that if you have other dependency jar file