Hello,
Is it possible for existing R Machine Learning packages (which work with R
data frames) such as bnlearn, to work with SparkR data frames? Or do I need
to convert SparkR data frames to R data frames? Is "collect" the function to
do the conversion, or how else to do that?
Many T
ciation because it is shutting down.
]
More about the setup: each VM has only 4GB RAM, running Ubuntu, using
spark-1.2.0, built for Hadoop 2.6.0.
I have struggled with this error for a few days. Could anyone please tell me
what the problem is and how to fix it?
Thanks,
Lan
--
View this mess
Hi Alexey and Daniel,
I'm using Spark 1.2.0 and still having the same error, as described below.
Do you have any news on this? Really appreciate your responses!!!
"a Spark cluster of 1 master VM SparkV1 and 1 worker VM SparkV4 (the error
is the same if I have 2 workers). They are connected witho
Hello,
I have the above naive question if anyone could help. Why not using a
Row-based File format to save Row-based DataFrames/RDD?
Thanks,
Lan
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Columnar-Parquet-used-as-default-for-saving-Row-based
Hi Expert,
Hadoop version: 2.4
Spark version: 1.3.1
I am running the SparkPi example application.
bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-client --executor-memory 2G lib/spark-examples-1.3.1-hadoop2.4.0.jar
2
The same command sometimes gets WARN ReliableDeli
Change jdk from 1.8.0_45 to 1.7.0_79 solve this issue.
I saw https://issues.apache.org/jira/browse/SPARK-6388
But it is not a problem however.
On Thu, Jul 2, 2015 at 1:30 PM, xiaohe lan wrote:
> Hi Expert,
>
> Hadoop version: 2.4
> Spark version: 1.3.1
>
> I am running t
should be
able to run in the streaming application. Am I wrong?
Lan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
should be
able to run in the streaming application. Am I wrong?
Thanks in advance.
Lan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Sorry, accidentally sent again. My apology.
> On Mar 6, 2016, at 1:22 PM, Lan Jiang wrote:
>
> Hi, there
>
> I hope someone can clarify this for me. It seems that some of the MLlib
> algorithms such as KMean, Linear Regression and Logistics Regression have a
> Streami
online and
offline learning.
Lan
> On Mar 6, 2016, at 2:43 AM, Chris Miller wrote:
>
> Guru:This is a really great response. Thanks for taking the time to explain
> all of this. Helpful for me too.
>
>
> --
> Chris Miller
>
> On Sun, Mar 6, 2016 at 1:54
Hi, there
Spark has provided json document processing feature for a long time. In
most examples I see, each line is a json object in the sample file. That is
the easiest case. But how can we process a json document, which does not
conform to this standard format (one line per json object)? Here is
e, this would only work in single executor which I think will
>> end up with OutOfMemoryException.
>>
>> Spark JSON data source does not support multi-line JSON as input due to
>> the limitation of TextInputFormat and LineRecordReader.
>>
>> You may have to just extrac
y question
is why it does not count permgen size and memory used by stack. They are not
part of the max heap size. IMHO, YARN executor container memory should be set
to: spark.executor.memory + [-XX:MaxPermSize] + number_of_threads * [-Xss] +
spark.yarn.executor.memoryOverhead. What did I
s not have REPL shell, which is a major drawback from my perspective.
Lan
> On Dec 16, 2015, at 3:46 PM, Stephen Boesch wrote:
>
> There are solid reasons to have built spark on the jvm vs python. The
> question for Daniel appear to be at this point scala vs java8. For that
I do not find the answer in the document saying whether metadata checkpointing
is done for each batch and whether checkpointinterval setting applies to both
types of checkpointing. Maybe I miss it. If anyone can point me to the right
documentation, I would highly appreciate it.
Best Regards,
Lan
Hi, there
I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold.
According to the programming guide
*Note that currently statistics are only supported for Hive Metastore
tables where the command ANALYZE TABLE COMPUTE STATISTICS
noscan has been run.*
My question is that is "N
Michael,
Thanks for the reply.
On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust
wrote:
> My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE
>> compute statistics" command in Hive shell, is the statistics
>> going to be used by SparkSQL to decide broadcast join?
>
>
>
5.
Is my understanding correct? In this case, I think repartition is a better
choice than coalesce.
Lan
executors to find out why they were lost?
Thanks
Lan
L" in the application overview section. When I click it, it
brings me to the spark history server UI, where I cannot find the lost
exectuors. The only logs link I can find one the YARN RM site is the
ApplicationMaster log, which is not what I need. Did I miss something?
Lan
On Thu, Oct 1, 20
Hi, there
Here is the problem I ran into when executing a Spark Job (Spark 1.3). The
spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0
library. Then it does some filter/map transformation, repartition to 1
partition and then write to HDFS. It creates 2 stages. The total H
to 1, write the result to HDFS. I use spark 1.3 with
spark-avro (1.0.0). The error only happens when running on the whole
dataset. When running on 1/3 of the files, the same job completes without
error.
On Thu, Oct 1, 2015 at 2:41 PM, Lan Jiang wrote:
> Hi, there
>
> Here is the prob
be 6g. thus I
expect the memory cache to be 6 * 0.9 * 0.6 = 3.24g. However, on the Spark
history server, it shows the reserved cached size for each executor is
3.1g. So it does not add up. What do I miss?
Lan
still a bug or there is something I need to do in spark application to
report the correct job status to YARN?
Lan
The partition number should be the same as the HDFS block number instead of
file number. Did you confirmed from the spark UI that only 12 partitions were
created? What is your ORC orc.stripe.size?
Lan
> On Oct 8, 2015, at 1:13 PM, unk1102 wrote:
>
> Hi I have the following code whe
Hmm, that’s odd.
You can always use repartition(n) to increase the partition number, but then
there will be shuffle. How large is your ORC file? Have you used NameNode UI to
check how many HDFS blocks each ORC file has?
Lan
> On Oct 8, 2015, at 2:08 PM, Umesh Kacha wrote:
>
&g
the problem.
After I increased the spark.yarn.executor.memoryOverhead, it was working
fine. I was using Spark 1.3, which has the defaut value as executorMemory *
0.07, with minimum of 384. In spark 1.4 and later, the default value was
changed to executorMemory * 0.10, with minimum of 384.
Lan
On
splittable and will not create so many partitions.
Lan
> On Oct 20, 2015, at 8:03 AM, François Pelletier
> wrote:
>
> You should aggregate your files in larger chunks before doing anything else.
> HDFS is not fit for small files. It will bloat it and cause you a lot of
>
I think the data file is binary per the original post. So in this case,
sc.binaryFiles should be used. However, I still recommend against using so many
small binary files as
1. They are not good for batch I/O
2. They put too many memory pressure on namenode.
Lan
> On Oct 20, 2015, at 11
Hi,
I have hadoop 2.4 cluster running on some remote VMs, can I start spark
shell or submit from my laptop. For example:
bin/spark-shell --mast yarn-client
If this is possible, how can I do this ?
I have copied the same hadoop to my laptop(but I don't run hadoop on my
laptop), I have also set:
protobuf 3 jar file either through —jars during the spark-submit or
package it into a uber jar file with your own classes.
Lan
> On Nov 4, 2015, at 4:07 PM, Cassa L wrote:
>
> Hi,
> Does spark support protobuff 3.0? I used protobuff 2.5 with spark-1.4 built
> for HDP 2.3. Given
I have not run into any linkage problem, but maybe I was lucky. :-). The
reason I wanted to use protobuf 3 is mainly for Map type support.
On Thu, Nov 5, 2015 at 4:43 AM, Steve Loughran
wrote:
>
> > On 5 Nov 2015, at 00:12, Lan Jiang wrote:
> >
> > I have used protobu
Hi,
I have some csv file in HDFS with headers like col1, col2, col3, I want to
add a column named id, so the a record would be
How can I do this using Spark SQL ? Can id be auto increment ?
Thanks,
Xiaohe
Hi,
I am trying to run SparkPi in Intellij and getting NoClassDefFoundError.
Anyone else saw this issue before ?
Exception in thread "main" java.lang.NoClassDefFoundError:
scala/collection/Seq
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0
is provided, you need to change
> it to compile to run SparkPi in Intellij. As I remember, you also need to
> change guava and jetty related library to compile too.
>
> On Mon, Aug 17, 2015 at 2:14 AM, xiaohe lan
> wrote:
>
>> Hi,
>>
>> I am trying to run Spark
Hi, there
I ran into a problem when I try to pass external jar file to spark-shell.
I have a uber jar file that contains all the java codes I created for protobuf
and all its dependency.
If I simply execute my code using Scala Shell, it works fine without error. I
use -cp to pass the extern
and extend the question to any third party libraries. How to deal with
version conflict for any third party libraries included in the Spark
distribution?
Thanks!
Lan
how to
configure spark shell to use my uber jar first.
java8964 -- appreciate the link and I will try the configuration. Looks
promising. However, the "user classpath first" attribute does not apply to
spark-shell, am I correct?
Lan
On Tue, Sep 15, 2015 at 8:24 AM, java8964 wrote:
>
that and it did not work either.
Lan
> On Sep 15, 2015, at 10:31 AM, java8964 wrote:
>
> If you use Standalone mode, just start spark-shell like following:
>
> spark-shell --jars your_uber_jar --conf spark.files.userClassPathFirst=true
>
> Yong
>
> Date: Tue,
I am happy to report that after set spark.dirver.userClassPathFirst, I can use
protobuf 3 with spark-shell. Looks like the classloading issue in the driver,
not executor.
Marcelo, thank you very much for the tip!
Lan
> On Sep 15, 2015, at 1:40 PM, Marcelo Vanzin wrote:
>
> Hi,
them proactively? For example, if
processing time/scheduling delay exceed certain threshold, send alert to the
admin/developer?
Lan
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
241, which
is fixed in 2.0.
Lan
or a
MulticlassClassificationEvaluator for multiclass problems*. "
https://spark.apache.org/docs/2.1.0/ml-tuning.html
Can someone shed some lights on the issue?
Lan
--
Lan Jiang
https://hpi.de/naumann/people/lan-jiang
Hasso-Plattner-Institut an der Universität Potsdam
Prof.-Dr.-Helmert-Str. 2-3, D-14482 Potsdam
Tel +49 331 5509 280
automatically without you
copying them manually.
Lan
> On Apr 20, 2015, at 9:26 AM, Michael Ryabtsev wrote:
>
> Hi all,
>
> I need to configure spark executor log4j.properties on a standalone cluster.
> It looks like placing the relevant properties file in the spark
> con
Each application gets its own executor processes, so there should be no
problem running them in parallel.
Lan
> On Apr 20, 2015, at 10:25 AM, Michael Ryabtsev wrote:
>
> Hi Lan,
>
> Thanks for fast response. It could be a solution if it works. I have more
> than one lo
Spark. Future
releases will focus on bringing SQLContext up to feature parity with a
HiveContext.”
Lan
> On Apr 20, 2015, at 4:17 PM, Daniel Mahler wrote:
>
> Is HiveContext still preferred over SQLContext?
> What are the current (1.3.1) diferences between them?
>
> thanks
> Daniel
YARN capacity scheduler support hierarchical queues, which you can assign
cluster resource as percentage. Your spark application/shell can be
submitted to different queues. Mesos supports fine-grained mode, which
allows the machines/cores used each executors ramp up and down.
Lan
On Wed, Apr 22
Hi experts,
I see spark on yarn has yarn-client and yarn-cluster mode. I also have a 5
nodes hadoop cluster (hadoop 2.4). How to install spark if I want to try
the spark on yarn mode.
Do I need to install spark on the each node of hadoop cluster ?
Thanks,
Xiaohe
> http://mbonaci.github.io/mbo-spark/
> You dont need to install spark on every node.Just install it on one node
> or you can install it on remote system also and made a spark cluster.
> Thanks
> Madhvi
>
> On Thursday 30 April 2015 09:31 AM, xiaohe lan wrote:
>
>> Hi experts
Hi,
I have a 5 nodes yarn cluster, I used spark-submit to submit a simple app.
spark-submit --master yarn target/scala-2.10/simple-project_2.10-1.0.jar
--class scala.SimpleApp --num-executors 5
I have set the number of executor to 5, but from sparkui I could see only
two executors and it ran ve
Hi,
When I start spark shell by passing yarn to master option, println does not
print elements in RDD:
bash-4.1$ spark-shell --master yarn
15/05/17 01:50:08 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
executor-cores param? While you submit the job, do a ps aux
> | grep spark-submit and see the exact command parameters.
>
> Thanks
> Best Regards
>
> On Sat, May 16, 2015 at 12:31 PM, xiaohe lan
> wrote:
>
>> Hi,
>>
>> I have a 5 nodes yarn cluster, I used
MB295.4
MB2host2:620721.7 min505640.0 MB / 12014510386.0 MB / 109269121646.6 MB304.8
MB
On Sun, May 17, 2015 at 11:50 PM, xiaohe lan wrote:
> bash-4.1$ ps aux | grep SparkSubmit
> xilan 1704 13.2 1.2 5275520 380244 pts/0 Sl+ 08:39 0:13
> /scratch/xilan/jdk1.8.0_45/bin/java -cp
&
, Sandy Ryza
wrote:
> Hi Xiaohe,
>
> The all Spark options must go before the jar or they won't take effect.
>
> -Sandy
>
> On Sun, May 17, 2015 at 8:59 AM, xiaohe lan
> wrote:
>
>> Sorry, them both are assigned task actually.
>>
>> Aggreg
:
> Awesome!
>
> It's documented here:
> https://spark.apache.org/docs/latest/submitting-applications.html
>
> -Sandy
>
> On Mon, May 18, 2015 at 8:03 PM, xiaohe lan
> wrote:
>
>> Hi Sandy,
>>
>> Thanks for your information. Yes, spark-submit --master y
Hello,
Is it possible for existing R Machine Learning packages (which work with R
data frames) such as bnlearn, to work with SparkR data frames? Or do I need
to convert SparkR data frames to R data frames? Is "collect" the function
to do the conversion, or how else to do that?
Many Thanks,
Lan
Hello,
I have the above naive question if anyone could help. Why not using a
Row-based File format to save Row-based DataFrames/RDD?
Thanks,
Lan
58 matches
Mail list logo