I realized that in the Spark ML, BinaryClassifcationMetrics only supports
AreaUnderPR and AreaUnderROC. Why is that? I
What if I need other metrics such as F-score, accuracy? I tried to use
MulticlassClassificationEvaluator to evaluate other metrics such as
Accuracy for a binary classification pro
Hi, there
I am trying to generate unique ID for each record in a dataframe, so that I
can save the dataframe to a relational database table. My question is that
when the dataframe is regenerated due to executor failure or being evicted
out of cache, does the ID keeps the same as before?
According
Hi, there
From the Spark UI, we can monitor the following two metrics:
• Processing Time - The time to process each batch of data.
• Scheduling Delay - the time a batch waits in a queue for the
processing of previous batches to finish.
However, what is the best way to monitor th
Hello,
My understanding is that YARN executor container memory is based on
"spark.executor.memory" + “spark.yarn.executor.memoryOverhead”. The first one
is for heap memory and second one is for offheap memory. The
spark.executor.memory is used by -Xmx to set the max heap size. Now my question
e, this would only work in single executor which I think will
>> end up with OutOfMemoryException.
>>
>> Spark JSON data source does not support multi-line JSON as input due to
>> the limitation of TextInputFormat and LineRecordReader.
>>
>> You may have to just extrac
Hi, there
Spark has provided json document processing feature for a long time. In
most examples I see, each line is a json object in the sample file. That is
the easiest case. But how can we process a json document, which does not
conform to this standard format (one line per json object)? Here is
be
> constantly monitored to see how it is performing.
>
> Hope this helps in understanding offline learning vs. online learning and
> which algorithms you can choose for online learning in MLlib.
>
> Guru Medasani
> gdm...@gmail.com <mailto:gdm...@gmail.com>
>
>
>
Sorry, accidentally sent again. My apology.
> On Mar 6, 2016, at 1:22 PM, Lan Jiang wrote:
>
> Hi, there
>
> I hope someone can clarify this for me. It seems that some of the MLlib
> algorithms such as KMean, Linear Regression and Logistics Regression have a
> Streami
Hi, there
I hope someone can clarify this for me. It seems that some of the MLlib
algorithms such as KMean, Linear Regression and Logistics Regression have a
Streaming version, which can do online machine learning. But does that mean
other MLLib algorithm cannot be used in Spark streaming appl
Hi, there
I hope someone can clarify this for me. It seems that some of the MLlib
algorithms such as KMean, Linear Regression and Logistics Regression have a
Streaming version, which can do online machine learning. But does that mean
other MLLib algorithm cannot be used in Spark streaming appl
Michael,
Thanks for the reply.
On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust
wrote:
> My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE
>> compute statistics" command in Hive shell, is the statistics
>> going to be used by SparkSQL to decide broadcast join?
>
>
>
Hi, there
I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold.
According to the programming guide
*Note that currently statistics are only supported for Hive Metastore
tables where the command ANALYZE TABLE COMPUTE STATISTICS
noscan has been run.*
My question is that is "N
Need some clarification about the documentation. According to Spark doc
"the default interval is a multiple of the batch interval that is at least 10
seconds. It can be set by using dstream.checkpoint(checkpointInterval).
Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream
For Spark data science project, Python might be a good choice. However, for
Spark streaming, Python API is still lagging. For example, for Kafka no
receiver connector, according to the Spark 1.5.2 doc: "Spark 1.4 added a
Python API, but it is not yet at full feature parity”.
Java does not hav
I have not run into any linkage problem, but maybe I was lucky. :-). The
reason I wanted to use protobuf 3 is mainly for Map type support.
On Thu, Nov 5, 2015 at 4:43 AM, Steve Loughran
wrote:
>
> > On 5 Nov 2015, at 00:12, Lan Jiang wrote:
> >
> > I have used protobu
I have used protobuf 3 successfully with Spark on CDH 5.4, even though Hadoop
itself comes with protobuf 2.5. I think the steps apply to HDP too. You need to
do the following
1. Set the below parameter
spark.executor.userClassPathFirst=true
spark.driver.userClassPathFirst=true
2. Include proto
Context.html#wholeTextFiles(java.lang.String,%20int)>
>
> On 20 October 2015 at 15:04, Lan Jiang <mailto:ljia...@gmail.com>> wrote:
> As Francois pointed out, you are encountering a classic small file
> anti-pattern. One solution I used in the past is to wrap all these small
&g
As Francois pointed out, you are encountering a classic small file
anti-pattern. One solution I used in the past is to wrap all these small binary
files into a sequence file or avro file. For example, the avro schema can have
two fields: filename: string and binaryname:byte[]. Thus your file is
Mon, Oct 12, 2015 at 8:34 AM, Akhil Das
wrote:
> Can you look a bit deeper in the executor logs? It could be filling up the
> memory and getting killed.
>
> Thanks
> Best Regards
>
> On Mon, Oct 5, 2015 at 8:55 PM, Lan Jiang wrote:
>
>> I am still facing th
t; Hi Lan, thanks for the response yes I know and I have confirmed in UI that it
> has only 12 partitions because of 12 HDFS blocks and hive orc file strip size
> is 33554432.
>
> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <mailto:ljia...@gmail.com>> wrote:
> The partition numb
The partition number should be the same as the HDFS block number instead of
file number. Did you confirmed from the spark UI that only 12 partitions were
created? What is your ORC orc.stripe.size?
Lan
> On Oct 8, 2015, at 1:13 PM, unk1102 wrote:
>
> Hi I have the following code where I read
Hi, there
I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in
“yarn-client” mode. The job itself failed due to YARN kills several executor
containers because the containers exceeded the memory limit posed by YARN.
However, when I went to the YARN resource manager site,
Hi, there
My understanding is that the cache storage is calculated as following
executor heap size * spark.storage.safetyFraction *
spark.storage.memoryFraction.
The default value for safetyFraction is 0.9 and memoryFraction is 0.6. When
I started a spark job on YARN, I set executor-memory to be
to 1, write the result to HDFS. I use spark 1.3 with
spark-avro (1.0.0). The error only happens when running on the whole
dataset. When running on 1/3 of the files, the same job completes without
error.
On Thu, Oct 1, 2015 at 2:41 PM, Lan Jiang wrote:
> Hi, there
>
> Here is the prob
Hi, there
Here is the problem I ran into when executing a Spark Job (Spark 1.3). The
spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0
library. Then it does some filter/map transformation, repartition to 1
partition and then write to HDFS. It creates 2 stages. The total H
15 at 1:30 PM, Ted Yu wrote:
> Can you go to YARN RM UI to find all the attempts for this Spark Job ?
>
> The two lost executors should be found there.
>
> On Thu, Oct 1, 2015 at 10:30 AM, Lan Jiang wrote:
>
>> Hi, there
>>
>> When running a Spark job on
Hi, there
When running a Spark job on YARN, 2 executors somehow got lost during the
execution. The message on the history server GUI is “CANNOT find address”. Two
extra executors were launched by YARN and eventually finished the job. Usually
I go to the “Executors” tab on the UI to check the e
Hi, there
I ran into an issue when using Spark (v 1.3) to load avro file through Spark
SQL. The code sample is below
val df = sqlContext.load(“path-to-avro-file","com.databricks.spark.avro”)
val myrdd = df.select(“Key", “Name", “binaryfield").rdd
val results = myrdd.map(...)
val finalResults = r
;
> If that doesn't work, I'd recommend shading, as others already have.
>
> On Tue, Sep 15, 2015 at 9:19 AM, Lan Jiang wrote:
>> I used the --conf spark.files.userClassPathFirst=true in the spark-shell
>> option, it still gave me the eror: java.
version or any other third party library version
> in Spark application
> From: ste...@hortonworks.com <mailto:ste...@hortonworks.com>
> To: ljia...@gmail.com <mailto:ljia...@gmail.com>
> CC: user@spark.apache.org <mailto:user@spark.apache.org>
> Date: Tue, 15 Sep 201
; To: ljia...@gmail.com
> CC: user@spark.apache.org
> Date: Tue, 15 Sep 2015 09:19:28 +
>
>
>
>
> On 15 Sep 2015, at 05:47, Lan Jiang wrote:
>
> Hi, there,
>
> I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by
> default. However, I wou
Hi, there,
I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by
default. However, I would like to use Protobuf 3 in my spark application so
that I can use some new features such as Map support. Is there anyway to
do that?
Right now if I build a uber.jar with dependencies includ
Hi, there
I ran into a problem when I try to pass external jar file to spark-shell.
I have a uber jar file that contains all the java codes I created for protobuf
and all its dependency.
If I simply execute my code using Scala Shell, it works fine without error. I
use -cp to pass the extern
YARN capacity scheduler support hierarchical queues, which you can assign
cluster resource as percentage. Your spark application/shell can be
submitted to different queues. Mesos supports fine-grained mode, which
allows the machines/cores used each executors ramp up and down.
Lan
On Wed, Apr 22,
Daniel,
HiveContext is a subclass of SQLContext, thus offers a superset of features not
available in SQLContext, such as access to Hive UDF, Hive table, Hive Serde,
etc. This does not change in 1.3.1.
Quote from 1.3.1 documentation
“… using HiveContext is recommended for the 1.3 release of Spa
ame? Can I
> conclude that going this way I will not be able to run several applications
> on the same cluster in parallel?
>
> Regarding submit, I am not using it now, I submit from the code, but I think
> I should consider this option.
>
> Thanks.
>
> On Mon, Apr 20,
Rename your log4j_special.properties file as log4j.properties and place it
under the root of your jar file, you should be fine.
If you are using Maven to build your jar, please the log4j.properties in the
src/main/resources folder.
However, please note that if you have other dependency jar file
37 matches
Mail list logo