Hi Jikai,
The reason I ask is because your stacktrace has this section in it:
com.hadoop.compression.lzo.GPLNativeCodeLoader.(
GPLNativeCodeLoader.java:32)
at com.hadoop.compression.lzo.LzoCodec.(LzoCodec.java:71)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(C
Actually the program hangs just by calling dataAllRDD.count(). I suspect
creating the RDD is not successful when its elements are too big. When nY =
3000, dataAllRDD.count() works (each element of dataAll = 3000*400*64 bits =
9.6 MB), but when nY = 4000, it hangs (4000*400*64 bits = 12.8 MB).
Wha
-incubator, +user
So, are you trying to "transpose" your data?
val rdd = sc.parallelize(List(List(1,2,3,4,5),List(6,7,8,9,10))).repartition(2)
First you could pair each value with its position in its list:
val withIndex = rdd.flatMap(_.zipWithIndex)
then group by that position, and discard the
Hi,
when I run some spark application on my local machine using spark-submit:
$SPARK_HOME/bin/spark-submit --driver-memory 1g
When I want to interrupt computing by ctrl-c it interrupt current stage but
later it waits and exit after around 5min and sometimes doesn't exit at all,
and the only way
The long lineage causes a long/deep Java object tree (DAG of RDD objects),
which needs to be serialized as part of the task creation. When
serializing, the whole object DAG needs to be traversed leading to the
stackoverflow error.
TD
On Mon, Aug 11, 2014 at 7:14 PM, randylu wrote:
> hi, TD. I
Assuming that your data is very sparse, I would recommend
RDD.repartition. But if it is not the case and you don't want to
shuffle the data, you can try a CombineInputFormat and then parse the
lines into labeled points. Coalesce may cause locality problems if you
didn't use the right number of part
For linear models, the constructors are now public. You can save the
weights to HDFS, then load the weights back and use the constructor to
create the model. -Xiangrui
On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu wrote:
> hello:
>
> I want to know,if I use history data to training model and I want
What did you set for driver memory? The default value is 256m or 512m,
which is too small. Try to set "--driver-memory 10g" with spark-submit
or spark-shell and see whether it works or not. -Xiangrui
On Mon, Aug 11, 2014 at 6:26 PM, durin wrote:
> I'm trying to apply KMeans training to some text
Thanks for your answer.
Yes, I want to transpose data.
At this point, I have one more question.
I tested it with
RDD1
List(1, 2, 3, 4, 5)
List(6, 7, 8, 9, 10)
List(11, 12, 13, 14, 15)
List(16, 17, 18, 19, 20)
And the result is...
ArrayBuffer(11, 1, 16, 6)
ArrayBuffer(2, 12, 7, 17)
ArrayBuffer(3,
I was hoping I could make the system behave as a blocking queue : if the
outputs is too slow, buffers (storing space for RDDs) fills up, then blocks
instead of dropping existing rdds, until the input itself blocks (slows down
it’s consumption).
On a side note I was wondering: is there the same
Sure, just add ".toList.sorted" in there. Putting together in one big
expression:
val rdd = sc.parallelize(List(List(1,2,3,4,5),List(6,7,8,9,10)))
val result =
rdd.flatMap(_.zipWithIndex).groupBy(_._2).values.map(_.map(_._1).toList.sorted)
List(2, 7)
List(1, 6)
List(4, 9)
List(3, 8)
List(5, 10)
spark.speculation was not set, any speculative execution on tachyon side?
tachyon-env.sh only changed following
export TACHYON_MASTER_ADDRESS=test01.zala
#export TACHYON_UNDERFS_ADDRESS=$TACHYON_HOME/underfs
export TACHYON_UNDERFS_ADDRESS=hdfs://test01.zala:8020
export TACHYON_WORKER_MEMORY_SIZE=
more interesting is if spark-shell started on master node (test01)
then
parquetFile.saveAsParquetFile("tachyon://test01.zala:19998/parquet_tablex")
14/08/12 11:42:06 INFO : initialize(tachyon://...
...
...
14/08/12 11:42:06 INFO : File does not exist:
tachyon://test01.zala:19998/parquet_tablex/_
I figured it out. My indices parameters for the sparse vector are messed up. It
is a good learning for me:
When use the Vectors.sparse(int size, int[] indices, double[] values) to
generate a vector, size is the size of the whole vector, not just the size of
the elements with value. The indices a
hi, TD. Thanks very much! I got it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tp5649p11980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Hi All,
I have a question regarding LDA topic modeling. I need to do topic modeling
on ad data.
Does MLBase supports LDA topic modeling?
Or any stable, tested LDA implementation on Spark?
BR,
Aslan
Hi Xiangrui,
Thanks for your reply!
Yes, our data is very sparse, but RDD.repartition invoke
RDD.coalesce(numPartitions, shuffle = true) internally, so I think it has
the same effect with #2, right?
For CombineInputFormat, although I haven't tried it, but it sounds that it
will combine multiple
1. be careful, HDFS are better for large files, not bunches of small files.
2. if that's really what you want, roll it your own.
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasN
Yin helped me with that, and I appreciate the onlist followup. A few
questions: Why is this the case? I guess, does building it with
thriftserver add much more time/size to the final build? It seems that
unless documented well, people will miss that and this situation would
occur, why would we no
Thanks for putting this together, Andrew.
On Tue, Aug 12, 2014 at 2:11 AM, Andrew Ash wrote:
> Hi Chen,
>
> Please see the bug I filed at
> https://issues.apache.org/jira/browse/SPARK-2984 with the
> FileNotFoundException on _temporary directory issue.
>
> Andrew
>
>
> On Mon, Aug 11, 2014 at 1
Hello.
Is there an integration spark ( pyspark) with anaconda?
I googled a lot and didn't find relevant information.
Could you please pointing me on tutorial or simple example.
Thanks in advance
Oleg.
An on-list follow up: http://prof.ict.ac.cn/BigDataBench/#Benchmarks looks
promising as it has spark as one of the platforms used.
Bests,
-Monir
From: Mozumder, Monir
Sent: Monday, August 11, 2014 7:18 PM
To: user@spark.apache.org
Subject: Benchmark on physical Spark cluster
I am trying to get
Sorry, I missed #2. My suggestion is the same as #2. You need to set a
bigger numPartitions to avoid hitting integer bound or 2G limitation,
at the cost of increased shuffle size per iteration. If you use a
CombineInputFormat and then cache, it will try to give you roughly the
same size per partiti
Hive pulls in a ton of dependencies that we were afraid would break
existing spark applications. For this reason all hive submodules are
optional.
On Tue, Aug 12, 2014 at 7:43 AM, John Omernik wrote:
> Yin helped me with that, and I appreciate the onlist followup. A few
> questions: Why is th
I have a class defining an inner static class (map function). The inner
class tries to refer the variable instantiated in the outside class, which
results in a NullPointerException. Sample Code as follows:
class SampleOuterClass {
private static ArrayList someVariable;
SampleOuterClass
I don't think static members are going to be serialized in the
closure? the instance of Parse will be looking at its local
SampleOuterClass, which is maybe not initialized on the remote JVM.
On Tue, Aug 12, 2014 at 6:02 PM, Sunny Khatri wrote:
> I have a class defining an inner static class (map
Are there any other workarounds that could be used to pass in the values
from *someVariable *to the transformation function ?
On Tue, Aug 12, 2014 at 10:48 AM, Sean Owen wrote:
> I don't think static members are going to be serialized in the
> closure? the instance of Parse will be looking at i
You could create a copy of the variable inside your "Parse" class;
that way it would be serialized with the instance you create when
calling map() below.
On Tue, Aug 12, 2014 at 10:56 AM, Sunny Khatri wrote:
> Are there any other workarounds that could be used to pass in the values
> from someVar
We are probably still the minority, but our analytics platform based on
Spark + HDFS does not have map/reduce installed. I'm wondering if there is
a distcp equivalent that leverages Spark to do the work.
Our team is trying to find the best way to do cross-datacenter replication
of our HDFS data t
That should work. Gonna give it a try. Thanks !
On Tue, Aug 12, 2014 at 11:01 AM, Marcelo Vanzin
wrote:
> You could create a copy of the variable inside your "Parse" class;
> that way it would be serialized with the instance you create when
> calling map() below.
>
> On Tue, Aug 12, 2014 at 10:
Hello spark aficionados,
We upgraded from spark 1.0.0 to 1.0.1 when the new release came out and
started noticing some weird errors. Even a simple operation like
"reduceByKey" or "count" on an RDD gets stuck in "cluster mode". This issue
does not occur with spark 1.0.0 (in cluster or local mode)
Hi ,
Today i created a table with 3 regions and 2 jobtrackers but still the
spark job is taking lot of time
I also noticed one thing that is the memory of client was increasing
linearly is it like spark job was first bringing the complete data in
memory?
On Thu, Aug 7, 2014 at 7:31 PM, Ted Yu [vi
Hi,
I'm trying to run streaming.NetworkWordCount example on the Mesos Cluster
(0.19.1). I am able to run SparkPi examples on my mesos cluster, can run
the streaming example in local[n] mode, but no luck so far with the Spark
streaming examples running my mesos cluster.
I dont particularly see any
Hi Jenny,
Have you copied hive-site.xml to spark/conf directory? If not, can you put
it in conf/ and try again?
Thanks,
Yin
On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao wrote:
>
> Thanks Yin!
>
> here is my hive-site.xml, which I copied from $HIVE_HOME/conf, didn't
> experience problem conne
Good question; I don't know of one but I believe people at Cloudera had some
thoughts of porting Sqoop to Spark in the future, and maybe they'd consider
DistCP as part of this effort. I agree it's missing right now.
Matei
On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com) wr
This is a back-to-basics question. How do we know when Spark will clone an
object and distribute it with task closures versus synchronize access to it.
For example, the old rookie mistake of random number generation:
import scala.util.Random
val randRDD = sc.parallelize(0 until 1000).map(ii => R
I tried to access worker info from spark context but it seems spark context
does no expose such API. The reason of doing that is: it seems spark context
itself does not have logic to detect if its workers are in dead status. So I
like to add such logic by myself.
BTW, it seems spark web UI has
understand, thank you
small file is a problem, I am considering process data before put them in
hdfs.
On Tue, Aug 12, 2014 at 9:37 PM, Fengyun RAO wrote:
> 1. be careful, HDFS are better for large files, not bunches of small files.
>
> 2. if that's really what you want, roll it your own.
>
> de
Hi Yin,
hive-site.xml was copied to spark/conf and the same as the one under
$HIVE_HOME/conf.
through hive cli, I don't see any problem. but for spark on yarn-cluster
mode, I am not able to switch to a database other than the default one, for
Yarn-client mode, it works fine.
Thanks!
Jenny
On
Hi, sorry for the delay. Would you have yarn available to test? Given
the discussion in SPARK-2878, this might be a different incarnation of
the same underlying issue.
The option in Yarn is spark.yarn.user.classpath.first
On Mon, Aug 11, 2014 at 1:33 PM, DNoteboom wrote:
> I'm currently running
Uh, for some reason I don't seem to automatically reply to the list any
more.
Here is again my message to Tom.
-- Forwarded message --
Tom,
On Wed, Aug 13, 2014 at 5:35 AM, Tom Vacek wrote:
> This is a back-to-basics question. How do we know when Spark will clone
> an object a
Hi,
On Wed, Aug 13, 2014 at 4:24 AM, Zia Syed wrote:
>
> I dont particularly see any errors on my logs, either on console, or on
> slaves. I see slave downloads the spark-1.0.2-bin-hadoop1.tgz file and
> unpacks them as well. Mesos Master shows quiet alot of Tasks created and
> Finished. I dont
Hi All,
I met a problem that for each stage, most workers finished fast (around 1min),
but a few workers spent like 7min to finish, which significantly slow down the
process.
As shown below, the running time is very unbalancedly distributed over workers.
I wonder whether this is normal? Is
you should try watching this video
https://www.youtube.com/watch?v=sPhyePwo7FA, for more details, please
search in the archives, I've got a same kind of question and other guys
helped me to solve the problem.
On Tue, Aug 12, 2014 at 12:36 PM, XiaoQinyu
wrote:
> Have you solved this problem??
>
>
In MLLib, I found the method to train matrix factorization model to predict
the taste of user. In this function, there are some parameters such as
lambda, and rank, I can not find the best value to set these parameters and
how to optimize this value. Could you please give me some recommends?
--
T
I checked javacore file, there is:
Dump Event "systhrow" (0004) Detail "java/lang/OutOfMemoryError" "Java
heap space" received
After checking the failure thread, I found it occur in
SparkFlumeEvent.readExternal() method:
71 for (i <- 0 until numHeaders) {
72 val keyLength = in.rea
Sometimes workers are dead but spark context does not know it and still send
jobs.
On Tuesday, August 12, 2014 7:14 PM, Stanley Shi wrote:
Why do you need to detect the worker status in the application? you application
generally don't need to know where it is executed.
On Wed, Aug 13, 2
This seems a bug, right? It's not the user's responsibility to manage the
workers.
On Wed, Aug 13, 2014 at 11:28 AM, S. Zhou wrote:
> Sometimes workers are dead but spark context does not know it and still
> send jobs.
>
>
> On Tuesday, August 12, 2014 7:14 PM, Stanley Shi
> wrote:
>
>
> Why
actually if you search the spark mail archives you will find many similar
topics. At this time, I just want to manage it by myself.
On Tuesday, August 12, 2014 8:46 PM, Stanley Shi wrote:
This seems a bug, right? It's not the user's responsibility to manage the
workers.
On Wed, Aug 13,
Hi Nick ,
Thank you for the link , Do I understand correct that AMI is for cloud
only.
Currently we are NOT on cloud? Is there an option to use anaconda with
spark not on Amazon cloud?
Thanks
On Wed, Aug 13, 2014 at 12:38 AM, Nick Pentreath
wrote:
> You may want to take a look at this thread
Hi Cheng,
I also meet some issues when I try to start ThriftServer based a build from
master branch (I could successfully run it from the branch-1.0-jdbc
branch). Below is my build command:
./make-distribution.sh --skip-java-test -Phadoop-2.4 -Phive -Pyarn
-Dyarn.version=2.4.0 -Dhadoop.version=2.
Just find this is because below lines in make_distribution.sh doesn't work:
if [ "$SPARK_HIVE" == "true" ]; then
cp "$FWDIR"/lib_managed/jars/datanucleus*.jar "$DISTDIR/lib/"
fi
There is no definition of $SPARK_HIVE in make_distribution.sh. I should set
it explicitly.
On Wed, Aug 13, 2014 at
Oh, thanks for reporting this. This should be a bug since SPARK_HIVE was
deprecated, we shouldn’t rely on it any more.
On Wed, Aug 13, 2014 at 1:23 PM, ZHENG, Xu-dong wrote:
> Just find this is because below lines in make_distribution.sh doesn't work:
>
> if [ "$SPARK_HIVE" == "true" ]; then
You can define an evaluation metric first and then use a grid search
to find the best set of training parameters. Ampcamp has a tutorial
showing how to do this for ALS:
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
-Xiangrui
On Tue, Aug 12, 2014 at 8:01 PM,
This happens when you are playing around with sortByKey, mapPartition,
groupBy, reduceByKey like Operations.
One thing you can try is providing the number of partition (possibly
>
2x number of CPUs) while doing these operations.
Thanks
Best Regards
On Wed, Aug 13, 2014 at 7:54 AM, Bin wrote:
55 matches
Mail list logo