Hi David,
Can you also try with Spark 1.3 if possible? I believe there was a 2x
improvement on K-Means between 1.2 and 1.3.
Thanks,
Burak
On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 wrote:
> Hi Jao,
>
> Sorry to pop up this old thread. I am have the same problem like you did. I
> want to kn
Hi Jao,
Sorry to pop up this old thread. I am have the same problem like you did. I
want to know if you have figured out how to improve k-means on Spark.
I am using Spark 1.2.0. My data set is about 270k vectors, each has about
350 dimensions. If I set k=500, the job takes about 3hrs on my cluste
This is something we are hoping to support in Spark 1.4. We'll post more
information to JIRA when there is a design.
On Thu, Mar 26, 2015 at 11:22 PM, Jianshi Huang
wrote:
> Hi,
>
> Anyone has similar request?
>
> https://issues.apache.org/jira/browse/SPARK-6561
>
> When we save a DataFrame int
In this case I'd probably just store it as a String. Our casting rules
(which come from Hive) are such that when you use a string as an number of
boolean it will be casted to the desired type.
Thanks for the PR btw :)
On Fri, Mar 27, 2015 at 2:31 PM, Eran Medan wrote:
> Hi everyone,
>
> I had
got it thanks. Making sure everything is idempotent is definitely a
critical piece for peace of mind.
On Sat, Mar 28, 2015 at 1:47 PM, Aaron Davidson wrote:
> Note that speculation is off by default to avoid these kinds of unexpected
> issues.
>
> On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran
Note that speculation is off by default to avoid these kinds of unexpected
issues.
On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran
wrote:
>
> It's worth adding that there's no guaranteed that re-evaluated work would
> be on the same host as before, and in the case of node failure, it is not
> gu
I've also been having trouble running 1.3.0 on HDP. The
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041
configuration directive seems to work with pyspark, but not propagate
when using spark-shell. (That is, everything works find with pyspark,
and spark-shell fails with the "bad substi
Looking at SparkSubmit#addJarToClasspath():
uri.getScheme match {
case "file" | "local" =>
...
case _ =>
printWarning(s"Skip remote jar $uri.")
It seems hdfs scheme is not recognized.
FYI
On Thu, Feb 26, 2015 at 6:09 PM, dilm wrote:
> I'm trying to run a spark applicat
Hi, did you resolve this issue or just work around it be keeping your
application jar local? Running into the same issue with 1.3.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22272.html
Se
Hi Ankur
If your hardware is ok, looks like it is config problem. Can you show me
the config of spark-env.sh or JVM config?
Thanks
Wisely Chen
2015-03-28 15:39 GMT+08:00 Ankur Srivastava :
> Hi Wisely,
> I have 26gb for driver and the master is running on m3.2xlarge machines.
>
> I see OOM err
Hi,
I’ve been trying to use Spark Streaming for my real-time analysis
application using the Kafka Stream API on a cluster (using the yarn
version) of 6 executors with 4 dedicated cores and 8192mb of dedicated
RAM.
The thing is, my application should run 24/7 but the disk usage is
leaking. This le
See
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
I haven't tried the SQL statements in above blog myself.
Cheers
On Sat, Mar 28, 2015 at 5:39 AM, Vincent He
wrote:
> thanks for your information . I have read it, I can run sample with scala
> or python
Hi All,
I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB & the
Shuffle Write is 10 GB & Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data in
memo
Thanks for the follow-up, Dale.
bq. hdp 2.3.1
Minor correction: should be hdp 2.1.3
Cheers
On Sat, Mar 28, 2015 at 2:28 AM, Johnson, Dale wrote:
> Actually I did figure this out eventually.
>
> I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1). Spark
> bundles the org/apache/ha
It's worth adding that there's no guaranteed that re-evaluated work would be on
the same host as before, and in the case of node failure, it is not guaranteed
to be elsewhere.
this means things that depend on host-local information is going to generate
different numbers even if there are no ot
Hi all,
I am working with spark 1.0.0. mainly for the usage of GraphX and wished to
apply some custom partitioning strategies on the edge list of the graph.
I have generated an edge list file which has the partition number after the
source and destination id in each line. Initially I am loading the
thanks for your information . I have read it, I can run sample with scala
or python, but for spark-sql shell, I can not get an exmaple running
successfully, can you give me an example I can run with "./bin/spark-sql"
without writing any code? thanks
On Sat, Mar 28, 2015 at 7:35 AM, Ted Yu wrote:
Please take a look at
https://spark.apache.org/docs/latest/sql-programming-guide.html
Cheers
> On Mar 28, 2015, at 5:08 AM, Vincent He wrote:
>
>
> I am learning spark sql and try spark-sql example, I running following code,
> but I got exception "ERROR CliDriver: org.apache.spark.sql.Ana
I am learning spark sql and try spark-sql example, I running following
code, but I got exception "ERROR CliDriver:
org.apache.spark.sql.AnalysisException: cannot recognize input near
'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17", I have two
questions,
1. Do we have a list of the st
I am learning spark sql and try spark-sql example, I running following
code, but I got exception "ERROR CliDriver:
org.apache.spark.sql.AnalysisException: cannot recognize input near
'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17", I have two
questions,
1. Do we have a list of the st
The input file is of format: userid, movieid, rating
>From this plan, I want to extract all possible combinations of movies and
difference between the ratings for each user.
(movie1, movie2),(rating(movie1)-rating(movie2))
This process should be processed for each user in the dataset. Finally, I
My vector dimension is like 360 or so. The data count is about 270k. My
driver has 2.9G memory. I attache a screenshot of current executor status.
I submitted this job with "--master yarn-cluster". I have a total of 7
worker node, one of them acts as the driver. In the screenshot, you can see
all w
Actually I did figure this out eventually.
I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1). Spark bundles
the org/apache/hadoop/hdfs/… classes along with the spark-assembly jar. This
turns out to introduce a small incompatibility with hdp 2.3.1. I carved these
classes out of th
How many dimensions does your data have? The size of the k-means model is k
* d, where d is the dimension of the data.
Since you're using k=1000, if your data has dimension higher than say,
10,000, you will have trouble, because k*d doubles have to fit in the
driver.
Reza
On Sat, Mar 28, 2015 at
This is from my Hive installation
-sh-4.1$ ls /apache/hive/lib | grep derby
derby-10.10.1.1.jar
derbyclient-10.10.1.1.jar
derbynet-10.10.1.1.jar
-sh-4.1$ ls /apache/hive/lib | grep datanucleus
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
-sh-4.1
Hi Wisely,
I have 26gb for driver and the master is running on m3.2xlarge machines.
I see OOM errors on workers and even they are running with 26th of memory.
Thanks
On Fri, Mar 27, 2015, 11:43 PM Wisely Chen wrote:
> Hi
>
> In broadcast, spark will collect the whole 3gb object into master nod
I tried with a different version of driver but same error
./bin/spark-submit -v --master yarn-cluster --driver-class-path
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2
I have put more detail of my problem at
http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed
It is really appreciate if you can help me take a look at this problem. I
have tried various settings and ways to load/partition my data, but I just
cannot get rid tha
This is what am seeing
./bin/spark-submit -v --master yarn-cluster --driver-class-path
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
--jars
/home/dvasthimal/spar
Yes am using yarn-cluster and i did add it via --files. I get "Suitable
error not found error"
Please share the spark-submit command that shows mysql jar containing
driver class used to connect to Hive MySQL meta store.
Even after including it through
--driver-class-path
/home/dvasthimal/spark1
Could someone please share the spark-submit command that shows their mysql
jar containing driver class used to connect to Hive MySQL meta store.
Even after including it through
--driver-class-path
/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
OR (AND)
--jars /home/dvasthimal/spark1.
31 matches
Mail list logo