It's an open issue : https://issues.apache.org/jira/browse/SPARK-4587
That's being said, you can workaround the issue by serializing the Model
(simple java serialization) and then restoring it before calling the
predicition job.
Best Regards,
On 22/10/2015 14:33, Sebastian Kuepers wrote:
> Hey,
When I use that I get a "Caused by: org.postgresql.util.PSQLException:
ERROR: column "none" does not exist"
On Thu, Oct 22, 2015 at 9:31 PM, Kayode Odeyemi wrote:
> Hi,
>
> I've trying to load a postgres table using the following expression:
>
> val cachedIndex = cache.get("latest_legacy_group_i
Have a look at this: https://github.com/koeninger/kafka-exactly-once
especially:
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerPartiti
There is a heartbeat stream pattern that you can use: Create a service
(perhaps a thread in your driver) that pushes a heartbeat event to a
different stream every N seconds. Consume that stream as well in your
streaming application, and perform an action on every heartbeat.
This has worked well in
When fixing the port to the same values as in the stack trace it works too.
The network config of the slaves seems correct.
Thanks,
Eugen
2015-10-23 8:30 GMT+02:00 Akhil Das :
> Mostly a network issue, you need to check your network configuration from
> the aws console and make sure the ports ar
oh no wonder... it undoes the glob (i was reading from /some/path/*),
creates a hadoopRdd for every path, and then creates a union of them using
UnionRDD.
thats not what i want... no need to do union. AvroInpuFormat already has
the ability to handle globs (or multiple paths comma separated) very
e
Hi Sujit, and All,
Currently I lost in large difficulty, I am eager to get some help from you.
There is some big linear system of equations as:Ax = b, A with N number of row
and N number of column, N is very large, b = [0, 0, ..., 0, 1]TThen, I will
sovle it to get x = [x1, x2, ..., xn]T.
The si
I need to run Spark Job as a service in my project, so there is a
"ServiceManager" in it and it use
SparkLauncher(org.apache.spark.launcher.SparkLauncher) to submit Spark jobs.
First, I tried to write a demo, putting only the SparkLauncher codes in the
main and run it with java -jar, it's fine
Hi there,
we have a set of relatively light weight jobs that we would like to run
repeatedly on our Spark cluster.
The situation is as follows. we have a reliable source of data, a Cassandra
database. One table contains time series data, which we would like to
analyse. To do so we read a window o
You can do the following. Start the spark-shell. Register the UDFs in the
shell using sqlContext, then start the Thrift Server using startWithContext
from the spark shell:
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThr
Hi,
I have a big sorted RDD sRdd(~962million elements), and need to scan its
elements in order(using sRdd.toLocalIterator).
But the process failed when the scanning was done after around 893million
elements, returned with following exception:
Anyone has idea? Thanks!
Exception in thread "mai
Hi Yifan,
I think this is a result of Kryo trying to seriallize something too large.
Have you tried to increase your partitioning?
Cheers,
Jem
On Fri, Oct 23, 2015 at 11:24 AM Yifan LI wrote:
> Hi,
>
> I have a big sorted RDD sRdd(~962million elements), and need to scan its
> elements in orde
Thanks for your advice, Jem. :)
I will increase the partitioning and see if it helps.
Best,
Yifan LI
> On 23 Oct 2015, at 12:48, Jem Tucker wrote:
>
> Hi Yifan,
>
> I think this is a result of Kryo trying to seriallize something too large.
> Have you tried to increase your partitioning
Hi Yifan,
You could also try increasing the spark.kryoserializer.buffer.max.mb
*spark.kryoserializer.buffer.max.mb *(64 Mb by default) : useful if your
default buffer size goes further than 64 Mb;
Per doc:
Maximum allowable size of Kryo serialization buffer. This must be larger
than any object y
Hi, I run spark to write data to hbase, but found NoSuchMethodException:
15/10/23 18:45:21 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,
dn18-formal.i.nease.net): java.lang.NoSuchMethodError:
com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
I found g
Hi Sander,
Thank you for your very informative email. From your email, I've learned a
quite a bit.
>>>Is the condition determined somehow from the data coming through
streamLogs, and is newData streamLogs again (rather than a whole data
source?)
No, they are two different Streams. I have two str
xjlin0 wrote
> I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
> or without Hadoop or home compiled with ant or maven). There was no error
> message in v1.4.x, system prompt nothing. On v1.5.x, once I enter
> $SPARK_HOME/bin/pyspark or spark-shell, I got
>
> Error: Could
Hello!
How to adjust the memory settings properly for SparkR with master="local[*]"
in R?
*When running from R -- SparkR doesn't accept memory settings :(*
I use the following commands:
R> library(SparkR)
R> sc <- sparkR.init(master = "local[*]", sparkEnvir =
list(spark.driver.memory = "5g"
Hi Matej,
I'm also using this and I'm having the same behavior here, my driver has
only 530mb which is the default value.
Maybe this is a bug.
2015-10-23 9:43 GMT-02:00 Matej Holec :
> Hello!
>
> How to adjust the memory settings properly for SparkR with
> master="local[*]"
> in R?
>
>
> *When r
just try dropping in that Jar. Hadoop core ships with an out of date guava JAR
to avoid breaking old code downstream, but 2.7.x is designed to work with later
versions too (i.e. it has moved off any of the now-removed methods. See
https://issues.apache.org/jira/browse/HADOOP-10101 for the speci
Sandy
The assembly jar does contain org.apache.spark.deploy.yarn.ExecutorLauncher.
I am trying to find out how i can increase the logging level, so I know the
exact classpath used by Yarn ContainerLaunch.
Deenar
On 23 October 2015 at 03:30, Sandy Ryza wrote:
> Hi Deenar,
>
> The version of Spa
Thanks for the advice. In my case it turned out to be two issues.
- use Java rather than Scala to launch the process, putting the core Scala libs
on the class path.
- I needed a merge strategy of Concat for reference.conf files in my build.sbt
Regards,
Mike
> On 23 Oct 2015, at 01:00, Ted Yu
Hi,
I can't seem to get a successful maven build. Please see command output
below:
bash-3.2$ ./make-distribution.sh --name spark-latest --tgz --mvn mvn
-Dhadoop.version=2.7.0 -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests
clean package
+++ dirname ./make-distribution.sh
++ cd .
++ pwd
+ SPAR
This doesn't show the actual error output from Maven. I have a strong
guess that you haven't set MAVEN_OPTS to increase the memory Maven can
use.
On Fri, Oct 23, 2015 at 6:14 AM, Kayode Odeyemi wrote:
> Hi,
>
> I can't seem to get a successful maven build. Please see command output
> below:
>
> b
do you have JAVA_HOME set to a java 7 jdk?
2015-10-23 7:12 GMT-04:00 emlyn :
> xjlin0 wrote
> > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
> > or without Hadoop or home compiled with ant or maven). There was no
> error
> > message in v1.4.x, system prompt nothing.
JAVA_HOME is unset.
I've also tried setting it with:
export JAVA_HOME=$(/usr/libexec/java_home)
which sets it to
"/Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home" and I
still get the same problem.
On 23 October 2015 at 14:37, Jonathan Coveney wrote:
> do you have JAVA_HOME set to
Here is the spark configure and error log
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors10
spark.executor.cores1
spark.executor.memory 6G
Spark asked YARN to let an executor use 7GB of memory, but it used
more so was killed. In each case you see that the exectuor memory plus
overhead equals the YARN allocation requested. What's the issue with
that?
On Fri, Oct 23, 2015 at 6:46 AM, JoneZhang wrote:
> Here is the spark configure and
Both Spark 1.5 and 1.5.1 are released so it certainly shouldn't be a problem
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action
--
View this message in context:
http://apache-spark-user-list.10015
If you can reproduce, then i think you can open up a jira for this.
Thanks
Best Regards
On Fri, Oct 23, 2015 at 1:37 PM, Eugen Cepoi wrote:
> When fixing the port to the same values as in the stack trace it works
> too. The network config of the slaves seems correct.
>
> Thanks,
> Eugen
>
> 201
I saw this when I tested manually (without ./make-distribution)
Detected Maven Version: 3.2.2 is not in the allowed range 3.3.3.
So I simply upgraded maven to 3.3.3.
Resolved. Thanks
On Fri, Oct 23, 2015 at 3:17 PM, Sean Owen wrote:
> This doesn't show the actual error output from Maven. I ha
Thanks for the suggestion.
1. Heartbeat:
As a matter of fact, the heartbeat solution is what I thought of as well.
However that needs to be outside spark-streaming.
Furthermore, it cannot be generalized to all spark applications. For, e.g.
I am doing several filtering operations before I reach th
emlyn wrote
>
> xjlin0 wrote
>> I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
>> or without Hadoop or home compiled with ant or maven). There was no
>> error message in v1.4.x, system prompt nothing. On v1.5.x, once I enter
>> $SPARK_HOME/bin/pyspark or spark-shell, I
Hi all,
I am looking into how Spark handles data locality wrt Tachyon. My main concern
is how this is coordinated. Will it send a task based on a file loaded from
Tachyon to a node that it knows has that file locally and how does it know
which nodes has what?
Kind regards,
Shane
This email (in
I have a spark job that creates 6 million rows in RDDs. I convert the RDD
into Data-frame and write it to HDFS. Currently it takes 3 minutes to write
it to HDFS.
Here is the snippet:-
RDDList.parallelStream().forEach(mapJavaRDD -> {
if (mapJavaRDD != null) {
Hi,
I created 2 workers on same machine each with 4 cores and 6GB ram
I submitted first job, and it allocated 2 cores on each of the worker
processes, and utilized full 4 GB ram for each executor process
When i submit my second job it always say in WAITING state.
Cheers!!
On Tue, Oct 20, 20
I have a spark job that creates 6 million rows in RDDs. I convert the RDD
into Data-frame and write it to HDFS. Currently it takes 3 minutes to write
it to HDFS.
I am using spark 1.5.1 with YARN.
Here is the snippet:-
RDDList.parallelStream().forEach(mapJavaRDD -> {
if (mapJava
Hi Bin,
Very likely the RedisClientPool is being closed too quickly before map has
a chance to get to it. One way to verify would be to comment out the .close
line and see what happens. FWIW I saw a similar problem writing to Solr
where I put a commit where you have a close, and noticed that the c
You can run 2 threads in driver and spark will fifo schedule the 2 jobs on
the same spark context you created (executors and cores)...same idea is
used for spark sql thriftserver flow...
For streaming i think it lets you run only one stream at a time even if you
run them on multiple threads on dri
I got this working. For others trying this It turns out in Spark 1.3/CDH5.4
spark.yarn.jar=local:/opt/cloudera/parcels/
I had changed this to reflect the 1.5.1 version of spark assembly jar
spark.yarn.jar=/opt/spark-1.5.1-bin/...
and this didn't work, I had to drop the "local:" prefix
spar
https://github.com/databricks/spark-avro/pull/95
On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote:
> oh no wonder... it undoes the glob (i was reading from /some/path/*),
> creates a hadoopRdd for every path, and then creates a union of them using
> UnionRDD.
>
> thats not what i want... no
Hi Zhiliang,
For a system of equations AX = y, Linear Regression will give you a
best-fit estimate for A (coefficient vector) for a matrix of feature
variables X and corresponding target variable y for a subset of your data.
OTOH, what you are looking for here is to solve for x a system of equatio
Hello.
I have activated the file checkpointing for a DStream to unleach the
updateStateByKey.
My unit test worked with no problem but when I have integrated this in my
full stream I got this exception. :
java.io.NotSerializableException: DStream checkpointing has been enabled but
the DStreams wi
Mind sharing your code, if possible ?
Thanks
On Fri, Oct 23, 2015 at 9:49 AM, crakjie wrote:
> Hello.
>
> I have activated the file checkpointing for a DStream to unleach the
> updateStateByKey.
> My unit test worked with no problem but when I have integrated this in my
> full stream I got this
Hello,
Data about my spark job is below. My source data is only 916MB (stage 0)
and 231MB (stage 1), but when i join the two data sets (stage 2) it takes a
very long time and as i see the shuffled data is 614GB. Is this something
expected? Both the data sets produce 200 partitions.
Stage IdDescri
Hi all,
I am running a simple word count job on a cluster of 4 nodes (24 cores per
node). I am varying two parameter in the configuration,
spark.python.worker.memory and the number of partitions in the RDD. My job
is written in python.
I am observing a discontinuity in the run time of the job whe
Sorry i sent the wrong join code snippet, the actual snippet is
ggImpsDf.join(
aggRevenueDf,
aggImpsDf("id_1") <=> aggRevenueDf("id_1")
&& aggImpsDf("id_2") <=> aggRevenueDf("id_2")
&& aggImpsDf("day_hour") <=> aggRevenueDf("day_hour")
&& aggImpsDf("day_hour_2") <=> aggRevenue
Hi All,
got this weird error when I tried to run spark on YARN-CLUSTER mode , I have
33 files and I am looping spark in bash one by one most of them worked ok
except few files.
Is this below error HDFS or spark error ?
Exception in thread "Driver" java.lang.IllegalArgumentException: Pathname
/
I had face a similar issue. The actual problem was not in the file name.
We run Spark on Yarn. The actual problem was seen in the logs by running
the command:
$ yarn logs -applicationId
Scroll from the beginning to know the actual error.
~Pratik
On Fri, Oct 23, 2015 at 11:40 AM kali.tumm...@gma
You might be referring to some class level variables from your code.
I got to see the actual field which caused the error when i marked the
class as serializable and run it on cluster.
class MyClass extends java.io.Serializable
The following resources will also help:
https://youtu.be/mHF3UPqLOL8?
Full Error:-
at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195)
at
org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:104)
at
org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:83
Check what you have at SimpleMktDataFlow.scala:106
~Pratik
On Fri, Oct 23, 2015 at 11:47 AM kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Full Error:-
> at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195)
> at
>
> org.apache.hadoop.
If I have two columns
StructType(Seq(
StructField("id", LongType),
StructField("phones", ArrayType(StringType
I want to add index for “phones” before I explode it.
Can this be implemented as GenericUDF?
I tried DataFrame.explode. It worked for simple types like string, but I
could not f
Hi Sujit ,
Firstly, I must show my deep appreciation and respect towards your kind help
and excellent knowledge.It would be the best if you and me are in the same
place then I shall specially go to express my thanks and respect to you.
I will try your way by spark mllib SVD .
For Linear Regressi
The user facing type mapping is documented here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types
On Fri, Oct 23, 2015 at 12:10 PM, Benyi Wang wrote:
> If I have two columns
>
> StructType(Seq(
> StructField("id", LongType),
> StructField("phones", ArrayType(StringTy
How did you specify number of cores each executor can use?
Be sure to use this when submitting jobs with spark-submit:
*--total-executor-cores
100.*
Other options won't work from my experience.
On Fri, Oct 23, 2015 at 8:36 AM, gaurav sharma
wrote:
> Hi,
>
> I created 2 workers on same machine
Don't use groupBy , use reduceByKey instead , groupBy should always be
avoided as it leads to lot of shuffle reads/writes.
On Fri, Oct 23, 2015 at 11:39 AM, pratik khadloya
wrote:
> Sorry i sent the wrong join code snippet, the actual snippet is
>
> ggImpsDf.join(
>aggRevenueDf,
>aggImps
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and
destroy all copies of the original message. Any unauthorized review, u
Take a look at first section of https://spark.apache.org/community
On Fri, Oct 23, 2015 at 1:46 PM, wrote:
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended re
Actually the groupBy is not taking a lot of time.
The join that i do later takes the most (95 %) amount of time.
Also, the grouping i am doing is based on the DataFrame api, which does not
contain any function for reduceBy... i guess the DF automatically uses
reduce by when we do a group by.
~Prat
Hi I am having weird issue I have a Spark job which has bunch of
hiveContext.sql() and creates ORC files as part of hive tables with
partitions and it runs fine in 1.4.1 and hadoop 2.4.
Now I tried to move to Spark 1.5.1/hadoop 2.6 Spark job does not work as
expected it does not created ORC files
Hi,
Here's my situation, I have some kind of offline dataset, but I want to
form a virtual data stream feeding to Spark Streaming, my code looks like
this
// sort offline data by time
1) JavaRDD sortedByTime = offlineDataRDD.sortBy( );
// compute a list of JavaRDD, each element JavaRDD
in rdd map function, is there a way i can know the list of host names where
the map runs ? any code sample would be appreciated ?
thx,
Weide
Can you outline your use case a bit more ?
Do you want to know all the hosts which would run the map ?
Cheers
On Fri, Oct 23, 2015 at 5:16 PM, weoccc wrote:
> in rdd map function, is there a way i can know the list of host names
> where the map runs ? any code sample would be appreciated ?
>
>
yea,
my use cases is that i want to have some external communications where rdd
is being run in map. The external communication might be handled separately
transparent to spark. What will be the hacky way and nonhacky way to do
that ? :)
Weide
On Fri, Oct 23, 2015 at 5:32 PM, Ted Yu wrote:
I need to save the twitter status I receive so that I can do additional
batch based processing on them in the future. Is it safe to assume HDFS is
the best way to go?
Any idea what is the best way to save twitter status to HDFS?
JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
I have not been able to run spark-shell in yarn-cluster mode since 1.5.0 due to
the same issue described by [SPARK-9776]. Did this pull request fix the issue?
https://github.com/apache/spark/pull/8947
I still have the same problem with 1.5.1 (I am running on HDP 2.2.6 with Hadoop
2.6)
Thanks.
-Y
Hi Shane,
Tachyon provides an api to get the block locations of the file which Spark
uses when scheduling tasks.
Hope this helps,
Calvin
On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane
wrote:
> Hi all,
>
>
>
> I am looking into how Spark handles data locality wrt Tachyon. My main
> concern is
i noticed in the comments for HadoopFsRelation.buildScan it says:
* @param inputFiles For a non-partitioned relation, it contains paths of
all data files in the
*relation. For a partitioned relation, it contains paths of all
data files in a single
*selected partition.
do i
How many rows are you joining? How many rows in the output?
Regards
Sab
On 24-Oct-2015 2:32 am, "pratik khadloya" wrote:
> Actually the groupBy is not taking a lot of time.
> The join that i do later takes the most (95 %) amount of time.
> Also, the grouping i am doing is based on the DataFrame
I specified spark,cores.max = 4
but it started 2 executors with 2 cores each on each of the 2 workers.
in standalone cluster mode, though we can specify Worker cores, there is no
ways to specify Number of cores executor must take on that particular
worker machine.
On Sat, Oct 24, 2015 at 1:41
71 matches
Mail list logo