date:20150328

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread Burak Yavuz

Hi David, Can you also try with Spark 1.3 if possible? I believe there was a 2x improvement on K-Means between 1.2 and 1.3. Thanks, Burak On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 wrote: > Hi Jao, > > Sorry to pop up this old thread. I am have the same problem like you did. I > want to kn

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread davidshen84

Hi Jao, Sorry to pop up this old thread. I am have the same problem like you did. I want to know if you have figured out how to improve k-means on Spark. I am using Spark 1.2.0. My data set is about 270k vectors, each has about 350 dimensions. If I set k=500, the job takes about 3hrs on my cluste

Re: Add partition support in saveAsParquet

2015-03-28 Thread Michael Armbrust

This is something we are hoping to support in Spark 1.4. We'll post more information to JIRA when there is a design. On Thu, Mar 26, 2015 at 11:22 PM, Jianshi Huang wrote: > Hi, > > Anyone has similar request? > > https://issues.apache.org/jira/browse/SPARK-6561 > > When we save a DataFrame int

Re: [spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-28 Thread Michael Armbrust

In this case I'd probably just store it as a String. Our casting rules (which come from Hive) are such that when you use a string as an number of boolean it will be casted to the desired type. Thanks for the PR btw :) On Fri, Mar 27, 2015 at 2:31 PM, Eran Medan wrote: > Hi everyone, > > I had

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Michal Klos

got it thanks. Making sure everything is idempotent is definitely a critical piece for peace of mind. On Sat, Mar 28, 2015 at 1:47 PM, Aaron Davidson wrote: > Note that speculation is off by default to avoid these kinds of unexpected > issues. > > On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Aaron Davidson

Note that speculation is off by default to avoid these kinds of unexpected issues. On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran wrote: > > It's worth adding that there's no guaranteed that re-evaluated work would > be on the same host as before, and in the case of node failure, it is not > gu

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-28 Thread Michael Stone

I've also been having trouble running 1.3.0 on HDP. The spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 configuration directive seems to work with pyspark, but not propagate when using spark-shell. (That is, everything works find with pyspark, and spark-shell fails with the "bad substi

Re: Spark-submit not working when application jar is in hdfs

2015-03-28 Thread Ted Yu

Looking at SparkSubmit#addJarToClasspath(): uri.getScheme match { case "file" | "local" => ... case _ => printWarning(s"Skip remote jar $uri.") It seems hdfs scheme is not recognized. FYI On Thu, Feb 26, 2015 at 6:09 PM, dilm wrote: > I'm trying to run a spark applicat

Re: Spark-submit not working when application jar is in hdfs

2015-03-28 Thread rrussell25

Hi, did you resolve this issue or just work around it be keeping your application jar local? Running into the same issue with 1.3. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22272.html Se

Re: Understanding Spark Memory distribution

2015-03-28 Thread Wisely Chen

Hi Ankur If your hardware is ok, looks like it is config problem. Can you show me the config of spark-env.sh or JVM config? Thanks Wisely Chen 2015-03-28 15:39 GMT+08:00 Ankur Srivastava : > Hi Wisely, > I have 26gb for driver and the master is running on m3.2xlarge machines. > > I see OOM err

[Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-28 Thread Nathan Marin

Hi, I’ve been trying to use Spark Streaming for my real-time analysis application using the Kafka Stream API on a cluster (using the yarn version) of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM. The thing is, my application should run 24/7 but the disk usage is leaking. This le

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Ted Yu

See https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html I haven't tried the SQL statements in above blog myself. Cheers On Sat, Mar 28, 2015 at 5:39 AM, Vincent He wrote: > thanks for your information . I have read it, I can run sample with scala > or python

input size too large | Performance issues with Spark

2015-03-28 Thread nsareen

Hi All, I'm facing performance issues with spark implementation, and was briefly investigating on WebUI logs, i noticed that my RDD size is 55GB & the Shuffle Write is 10 GB & Input Size is 200GB. Application is a web application which does predictive analytics, so we keep most of our data in memo

Re: Can't access file in spark, but can in hadoop

2015-03-28 Thread Ted Yu

Thanks for the follow-up, Dale. bq. hdp 2.3.1 Minor correction: should be hdp 2.1.3 Cheers On Sat, Mar 28, 2015 at 2:28 AM, Johnson, Dale wrote: > Actually I did figure this out eventually. > > I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1). Spark > bundles the org/apache/ha

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Steve Loughran

It's worth adding that there's no guaranteed that re-evaluated work would be on the same host as before, and in the case of node failure, it is not guaranteed to be elsewhere. this means things that depend on host-local information is going to generate different numbers even if there are no ot

Custom edge partitioning in graphX

2015-03-28 Thread arpp

Hi all, I am working with spark 1.0.0. mainly for the usage of GraphX and wished to apply some custom partitioning strategies on the edge list of the graph. I have generated an edge list file which has the partition number after the source and destination id in each line. Initially I am loading the

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Vincent He

thanks for your information . I have read it, I can run sample with scala or python, but for spark-sql shell, I can not get an exmaple running successfully, can you give me an example I can run with "./bin/spark-sql" without writing any code? thanks On Sat, Mar 28, 2015 at 7:35 AM, Ted Yu wrote:

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Ted Yu

Please take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html Cheers > On Mar 28, 2015, at 5:08 AM, Vincent He wrote: > > > I am learning spark sql and try spark-sql example, I running following code, > but I got exception "ERROR CliDriver: org.apache.spark.sql.Ana

Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Vincent He

I am learning spark sql and try spark-sql example, I running following code, but I got exception "ERROR CliDriver: org.apache.spark.sql.AnalysisException: cannot recognize input near 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17", I have two questions, 1. Do we have a list of the st

Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Vincent He

I am learning spark sql and try spark-sql example, I running following code, but I got exception "ERROR CliDriver: org.apache.spark.sql.AnalysisException: cannot recognize input near 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17", I have two questions, 1. Do we have a list of the st

How to add all combinations of items rated by user and difference between the ratings?

2015-03-28 Thread anishm

The input file is of format: userid, movieid, rating >From this plan, I want to extract all possible combinations of movies and difference between the ratings for each user. (movie1, movie2),(rating(movie1)-rating(movie2)) This process should be processed for each user in the dataset. Finally, I

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen

My vector dimension is like 360 or so. The data count is about 270k. My driver has 2.9G memory. I attache a screenshot of current executor status. I submitted this job with "--master yarn-cluster". I have a total of 7 worker node, one of them acts as the driver. In the screenshot, you can see all w

Re: Can't access file in spark, but can in hadoop

2015-03-28 Thread Johnson, Dale

Actually I did figure this out eventually. I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1). Spark bundles the org/apache/hadoop/hdfs/… classes along with the spark-assembly jar. This turns out to introduce a small incompatibility with hdp 2.3.1. I carved these classes out of th

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Reza Zadeh

How many dimensions does your data have? The size of the k-means model is k * d, where d is the dimension of the data. Since you're using k=1000, if your data has dimension higher than say, 10,000, you will have trouble, because k*d doubles have to fit in the driver. Reza On Sat, Mar 28, 2015 at

Re: Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏

This is from my Hive installation -sh-4.1$ ls /apache/hive/lib | grep derby derby-10.10.1.1.jar derbyclient-10.10.1.1.jar derbynet-10.10.1.1.jar -sh-4.1$ ls /apache/hive/lib | grep datanucleus datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar -sh-4.1

Re: Understanding Spark Memory distribution

2015-03-28 Thread Ankur Srivastava

Hi Wisely, I have 26gb for driver and the master is running on m3.2xlarge machines. I see OOM errors on workers and even they are running with 26th of memory. Thanks On Fri, Mar 27, 2015, 11:43 PM Wisely Chen wrote: > Hi > > In broadcast, spark will collect the whole 3gb object into master nod

Re: Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏

I tried with a different version of driver but same error ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen

I have put more detail of my problem at http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed It is really appreciate if you can help me take a look at this problem. I have tried various settings and ways to load/partition my data, but I just cannot get rid tha

Re: Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏

This is what am seeing ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /home/dvasthimal/spar

Re: Can spark sql read existing tables created in hive

2015-03-28 Thread ๏̯͡๏

Yes am using yarn-cluster and i did add it via --files. I get "Suitable error not found error" Please share the spark-submit command that shows mysql jar containing driver class used to connect to Hive MySQL meta store. Even after including it through --driver-class-path /home/dvasthimal/spark1

Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏

Could someone please share the spark-submit command that shows their mysql jar containing driver class used to connect to Hive MySQL meta store. Even after including it through --driver-class-path /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar OR (AND) --jars /home/dvasthimal/spark1.

Re: Why KMeans with mllib is so slow ?

Re: Why KMeans with mllib is so slow ?

Re: Add partition support in saveAsParquet

Re: [spark-sql] What is the right way to represent an “Any” type in Spark SQL?

Re: RDD resiliency -- does it keep state?

Re: RDD resiliency -- does it keep state?

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

Re: Spark-submit not working when application jar is in hdfs

Re: Spark-submit not working when application jar is in hdfs

Re: Understanding Spark Memory distribution

[Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

Re: Anyone has some simple example with spark-sql with spark 1.3

input size too large | Performance issues with Spark

Re: Can't access file in spark, but can in hadoop

Re: RDD resiliency -- does it keep state?

Custom edge partitioning in graphX

Re: Anyone has some simple example with spark-sql with spark 1.3

Re: Anyone has some simple example with spark-sql with spark 1.3

Anyone has some simple example with spark-sql with spark 1.3

Anyone has some simple example with spark-sql with spark 1.3

How to add all combinations of items rated by user and difference between the ratings?

Re: k-means can only run on one executor with one thread?

Re: Can't access file in spark, but can in hadoop

Re: k-means can only run on one executor with one thread?

Re: Spark - Hive Metastore MySQL driver

Re: Understanding Spark Memory distribution

Re: Spark - Hive Metastore MySQL driver

Re: k-means can only run on one executor with one thread?

Re: Spark - Hive Metastore MySQL driver

Re: Can spark sql read existing tables created in hive

Spark - Hive Metastore MySQL driver

31 matches

Site Navigation

Mail list logo

Footer information