Re: Hbase

2014-07-31 Thread Akhil Das
You can use a map function like the following and do whatever you want with the Result. Function, Iterator>{ > public Iterator call(Tuple2 Result> test) { > Result tmp = (Result) test._2; > List kvl = *tmp.getColumn("post".getBytes(), > "title".getBytes());* > for(KeyValue

Re: Ports required for running spark

2014-07-31 Thread Andrew Ash
Also Konstantin do you have a firewall between your Spark services? If that's what's causing these issues, then you may be interested in the ability to configure every port a Spark service listens on -- https://issues.apache.org/jira/browse/SPARK-2157 On Thu, Jul 31, 2014 at 8:47 AM, Haiyang Fu

graphx and subgraph query

2014-07-31 Thread dizzy5112
Hi I have a small problem using graphx. I have a graph whose triplets are represented as: ((101,User({101=0},0,3)),(104,User({101=1},1,0)),1) ((101,User({101=0},0,3)),(105,User({101=1},1,0)),2) ((102,User({102=0},0,3)),(106,User({102=1},1,0)),3) ((102,User({102=0},0,3)),(107,User({102=1},1,1)),4) (

Re:Re: Re:Re: [GraphX] The best way to construct a graph

2014-07-31 Thread Bin
Thanks for the advice. But since I am not the administrator of our spark cluster, I can't do this. Is there any better solution based on the current spark? At 2014-08-01 02:38:15, "shijiaxin" wrote: >Have you tried to write another similar function like edgeListFile in the >same file, and then

Re: Re:Re: [GraphX] The best way to construct a graph

2014-07-31 Thread shijiaxin
Have you tried to write another similar function like edgeListFile in the same file, and then compile the project again? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-The-best-way-to-construct-a-graph-tp11122p11138.html Sent from the Apache Spark Us

Re: SQLCtx cacheTable

2014-07-31 Thread Gurvinder Singh
Thanks Michael for explaination. Actually I tried caching the RDD and making table on it. But the performance for cacheTable was 3X better than caching RDD. Now I know why it is better. But is it possible to add the support for persistence level into cacheTable itself like RDD. May be it is not rel

Re: Issue using kryo serilization

2014-07-31 Thread gpatcham
No,it doesn't implement serializable..It's third party class -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-using-kryo-serilization-tp11129p11136.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Issue using kryo serilization

2014-07-31 Thread ratabora
Does the class your serializing implement serializable? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-using-kryo-serilization-tp11129p11134.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Issue using kryo serilization

2014-07-31 Thread gpatcham
Yes,I did enable that conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.set("spark.kryo.registrator", "com.bigdata.MyRegistrator") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-using-kryo-serilization-tp

Re: Issue using kryo serilization

2014-07-31 Thread Andrew Ash
Did you enable Kryo and have it use your registrator using spark.serializer=org.apache.spark.serializer.KryoSerializer and spark.kryo.registrator=mypackage.MyRegistrator ? It looks like the serializer being used is the default Java one http://spark.apache.org/docs/latest/tuning.html#data-serializ

Re: java.lang.OutOfMemoryError: Java heap space

2014-07-31 Thread Haiyang Fu
http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism On Fri, Aug 1, 2014 at 1:29 PM, Haiyang Fu wrote: > Hi, > here are two tips for you, > 1. increase the parallism level > 2.increase the driver memory > > > On Fri, Aug 1, 2014 at 12:58 AM, Sameer Tilak wrote: > >> Hi everyon

Re: java.lang.OutOfMemoryError: Java heap space

2014-07-31 Thread Haiyang Fu
Hi, here are two tips for you, 1. increase the parallism level 2.increase the driver memory On Fri, Aug 1, 2014 at 12:58 AM, Sameer Tilak wrote: > Hi everyone, > I have the following configuration. I am currently running my app in local > mode. > > val conf = new > SparkConf().setMaster("loca

Issue using kryo serilization

2014-07-31 Thread gpatcham
I'm new to spark programming and here I'm trying to use third party class in map with kryo serializer val deviceApi = new DeviceApi() deviceApi.loadDataFromStream(this.getClass.getClassLoader.getResourceAsStream("20140730.json")) val properties = uaRDD1.map(line => deviceApi.getProperties(lin

Re:Re: [GraphX] The best way to construct a graph

2014-07-31 Thread Bin
It seems that I cannot specify the weights. I have also tried to imitate GraphLoader.edgeListFile, but I can't call The methods and class used in GraphLoader.edgeListFile. Have you successfully done this? At 2014-08-01 12:47:08, "shijiaxin" wrote: >I think you can try GraphLoader.edgeListFil

Re: [GraphX] The best way to construct a graph

2014-07-31 Thread shijiaxin
I think you can try GraphLoader.edgeListFile, and then use join to associate the attributes with each vertex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-The-best-way-to-construct-a-graph-tp11122p11127.html Sent from the Apache Spark User List mail

Re: configuration needed to run twitter(25GB) dataset

2014-07-31 Thread shijiaxin
Is it possible to reduce the number of edge partitions and exploit parallelism fully at the same time? For example, one partition per node, and the threads in the same node share the same partition. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/configurati

Re: Re: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Haiyang Fu
Glad to help you On Fri, Aug 1, 2014 at 11:28 AM, Bin wrote: > Hi Haiyang, > > Thanks, it really is the reason. > > Best, > Bin > > > 在 2014-07-31 08:05:34,"Haiyang Fu" 写道: > > Have you tried to increase the dirver memory? > > > On Thu, Jul 31, 2014 at 3:54 PM, Bin wrote: > >> Hi All, >> >> T

Re:Re: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Bin
Hi Haiyang, Thanks, it really is the reason. Best, Bin 在 2014-07-31 08:05:34,"Haiyang Fu" 写道: Have you tried to increase the dirver memory? On Thu, Jul 31, 2014 at 3:54 PM, Bin wrote: Hi All, The data size of my task is about 30mb. It runs smoothly in local mode. However, when I su

[GraphX] The best way to construct a graph

2014-07-31 Thread Bin
Hi All, I am wondering what is the best way to construct a graph? Say I have some attributes for each user, and specific weight for each user pair. The way I am currently doing is first read user information and edge triple into two arrays, then use sc.parallelize to create vertexRDD and edg

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread Andrew Lee
Could you enable HistoryServer and provide the properties and CLASSPATH for the spark-shell? And 'env' command to list your environment variables? By the way, what does the spark logs says? Enable debug mode to see what's going on in spark-shell when it tries to interact and init HiveContext.

Re: SQLCtx cacheTable

2014-07-31 Thread Michael Armbrust
cacheTable uses a special columnar caching technique that is optimized for SchemaRDDs. It something similar to MEMORY_ONLY_SER but not quite. You can specify the persistence level on the SchemaRDD itself and register that as a temporary table, however it is likely you will not get as good perform

Re: Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread ratabora
Hey Dean! Thanks! Did you try running this on a local environment or one generated by the spark-ec2 script? The environment I am running on is a 4 data node 1 master spark cluster generated by the spark-ec2 script. I haven't modified anything in the environment except for adding data to the ephem

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
Hi Tathagata, I was using the "raw" tag in the web-editor. Seems like this doesn't make it into the mail. Here's the message again, this time without those tags: I've added the following to my spark-env.sh: SPARK_CLASSPATH="/disk.b/spark-master-2014-07-28/external/twitter/target/spark-streamin

Re: Fwd: pyspark crash on mesos

2014-07-31 Thread daijia
I met the same problem. Do you have some solution? Thanks Daijia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-pyspark-crash-on-mesos-tp2256p5.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Extracting an element from the feature vector in LabeledPoint

2014-07-31 Thread Yanbo Liang
Which version you are use? data.features(1) is OK for spark 1.0 2014-08-01 10:01 GMT+08:00 SK : > > Hi, > > I want to extract the individual elements of a feature vector that is part > of a LabeledPoint. I tried the following: > > data.features._1 > data.features(1) > data.features.map(_.1) > >

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread Tathagata Das
Hey Simon, The stuff you are trying to show - logs, contents of spark-env.sh, etc. are missing from the email. At least I am not able to see it (viewing through gmail). Are you pasting screenshots? Those might get blocked out somehow! TD On Thu, Jul 31, 2014 at 6:55 PM, durin wrote: > I've adde

spark-submit registers the driver twice

2014-07-31 Thread salemi
Hi All, I am using the spark-submit command to submit my jar to a standalone cluster with two executor. When I use the spark-submit it deploys the application twice and I see two application entries in the master UI. The master logs as shown below also indicate that submit try to deploy the app

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread chenjie
Hi, Yin and Andrew, thank you for your reply. When I create table in hive cli, it works correctly and the table will be found in hdfs. I forgot start hiveserver2 before and I started it today. Then I run the command below: spark-shell --master spark://192.168.40.164:7077 --driver-class-path co

Extracting an element from the feature vector in LabeledPoint

2014-07-31 Thread SK
Hi, I want to extract the individual elements of a feature vector that is part of a LabeledPoint. I tried the following: data.features._1 data.features(1) data.features.map(_.1) data is a LabeledPoint with a feature vector containing 3 features. All of the above resulted in compilation errors

Re: spark.shuffle.consolidateFiles seems not working

2014-07-31 Thread Aaron Davidson
Make sure to set it before you start your SparkContext -- it cannot be changed afterwards. Be warned that there are some known issues with shuffle file consolidation, which should be fixed in 1.1. On Thu, Jul 31, 2014 at 12:40 PM, Jianshi Huang wrote: > I got the number from the Hadoop admin. I

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
I've added the following to my spark-env.sh: I can now execute without an error in the shell. However, I will get an error when doing this: What am I missing? Do I have to import another jar? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-package

Re: Accessing spark context from executors?

2014-07-31 Thread Sung Hwan Chung
Nevermind. Just creating an empty hadoop configuration from executors did the trick. On Thu, Jul 31, 2014 at 6:16 PM, Sung Hwan Chung wrote: > Is there any way to get SparkContext object from executor? Or hadoop > configuration, etc. The reason is that I would like to write to HDFS from > execu

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
Hi Tathagata, I didn't mean to say this was an error. According to the other thread I linked, right now there shouldn't be any conflicts, so I wanted to use streaming in the shell for easy testing. I thought I had to create my own project in which I'd add streaming as a dependency, but if I can a

Accessing spark context from executors?

2014-07-31 Thread Sung Hwan Chung
Is there any way to get SparkContext object from executor? Or hadoop configuration, etc. The reason is that I would like to write to HDFS from executors.

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread Tathagata Das
I dont see the error. The twitter stuff (as well as kafka and flume stuff) are treated as "external" projects and are not included in the spark shell. This is because we dont want the dependencies of such non-core functionalities to cause random conflicts with that of core spark. Hence its not po

sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
As suggested here , I want to create a minimal project using sbt to be able to use org.apache.spark.streaming.twitter in the shell. My Spark version is the latest Master bran

Re: spark.scheduler.pool seems not working in spark streaming

2014-07-31 Thread Tathagata Das
I filed a JIRA for this task for future reference. https://issues.apache.org/jira/browse/SPARK-2780 On Thu, Jul 31, 2014 at 5:37 PM, Tathagata Das wrote: > Whoa! That worked! I was half afraid it wont, since I hadnt tried it myself. > > TD > > On Wed, Jul 30, 2014 at 8:32 PM, liuwei wrote: >> H

Re: spark.scheduler.pool seems not working in spark streaming

2014-07-31 Thread Tathagata Das
Whoa! That worked! I was half afraid it wont, since I hadnt tried it myself. TD On Wed, Jul 30, 2014 at 8:32 PM, liuwei wrote: > Hi, Tathagata Das: > > I followed your advice and solved this problem, thank you very much! > > > 在 2014年7月31日,上午3:07,Tathagata Das 写道: > >> This is because set

Re: Example standalone app error!

2014-07-31 Thread Tathagata Das
When are you guys getting the error? When Sparkcontext is created? Or when it is being shutdown? If this error is being thrown when the SparkContext is created, then one possible reason maybe conflicting versions of Akka. Spark depends on a version of Akka which is different from that of Scala, and

Re: Spark job finishes then command shell is blocked/hangs?

2014-07-31 Thread bumble123
Oops, I completely forgot sc.stop(). Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-finishes-then-command-shell-is-blocked-hangs-tp11095p11098.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark job finishes then command shell is blocked/hangs?

2014-07-31 Thread nit
which version of spark are you running? have you tried sc.stop as as last line of your program? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-finishes-then-command-shell-is-blocked-hangs-tp11095p11097.html Sent from the Apache Spark User List mai

Re: Installing Spark 0.9.1 on EMR Cluster

2014-07-31 Thread nit
Have you tried flag " --spark-version" of spark-ec2 ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-0-9-1-on-EMR-Cluster-tp11084p11096.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark job finishes then command shell is blocked/hangs?

2014-07-31 Thread bumble123
Hi, My spark job finishes with this output: 14/07/31 16:33:25 INFO SparkContext: Job finished: count at RetrieveData.scala:18, took 0.013189 s However, the command line doesn't go back to normal and instead just hangs. This is my first time running a spark job - is this normal? If not, how do I f

Re: Installing Spark 0.9.1 on EMR Cluster

2014-07-31 Thread Rahul Bhojwani
Can u share the link frm where u downloaded the tar ball. Also can u explain your process so that I can proceed with that, if not writing the custom script. On Fri, Aug 1, 2014 at 3:26 AM, chaitu reddy wrote: > Hi Rahul, > > I am not sure about bootstrapping while creating but we have downloade

Readin from Amazon S3 behaves inconsistently: return different number of lines...

2014-07-31 Thread nit
*First Question:* On Amazon S3 I have a directory with 1024 files, where each file size is ~9Mb; and each line in a file has two entries separated by '\t'. Here is my program, which is calculating total number of entries in the dataset -- val inputId = sc.textFile(inputhPath, noParts).flat

Re: Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread Dean Wampler
Forgot to add that I tried your program with the same input file path. It worked fine. (I used local[2], however...) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler

Re: Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread Dean Wampler
The stack trace suggests it was trying to create a temporary file, not read your file. Of course, it doesn't say what file it couldn't create. Could there be a configuration file, like a Hadoop config file, that was read with a temp dir setting that's invalid for your machine? dean Dean Wampler,

makeLinkRDDs in MLlib ALS

2014-07-31 Thread alwayforver
It seems to me that the way makeLinkRDDs works is by taking advantage of the fact that partition IDs happen to coincide with what we get from userPartitioner, since the HashPartitioner in *val grouped = ratingsByUserBlock.partitionBy(new HashPartitioner(numUserBlocks))* is actually preserving the

Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread Ryan Tabora
Hey all, I was able to spawn up a cluster, but when I'm trying to submit a simple jar via spark-submit it fails to run. I am trying to run the simple "Standalone Application" from the quickstart. Oddly enough, I could get another application running through the spark-shell. What am I doing wrong

Re: Installing Spark 0.9.1 on EMR Cluster

2014-07-31 Thread chaitu reddy
Hi Rahul, I am not sure about bootstrapping while creating but we have downloaded the tar ball , extracted and configured accordingly and it worked fine. I believe you would want to write a custom script which does all these things and add it like a bootstrap action. Thanks, Sai On Jul 31, 2014

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev
Thanks Michael for confirming! On Thu, Jul 31, 2014 at 2:43 PM, Michael Armbrust wrote: > The performance should be the same using the DSL or SQL strings. > > > On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev wrote: > >> I was not sure if registerAsTable() and then query against that table >> have

Re: SchemaRDD select expression

2014-07-31 Thread Michael Armbrust
The performance should be the same using the DSL or SQL strings. On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev wrote: > I was not sure if registerAsTable() and then query against that table have > additional performance impact and if DSL eliminates that. > > > On Thu, Jul 31, 2014 at 2:33 PM, Zong

Installing Spark 0.9.1 on EMR Cluster

2014-07-31 Thread Rahul Bhojwani
I wanted to install Spark version 0.9.1 on Amazon EMR Cluster. Can anyone give the install script which I can pass as the custom bootstrap action while creating a cluster? Thanks -- [image: http://] Rahul K Bhojwani [image: http://]about.me/rahul_bhojwani

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev
I was not sure if registerAsTable() and then query against that table have additional performance impact and if DSL eliminates that. On Thu, Jul 31, 2014 at 2:33 PM, Zongheng Yang wrote: > Looking at what this patch [1] has to do to achieve it, I am not sure > if you can do the same thing in 1.

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
Looking at what this patch [1] has to do to achieve it, I am not sure if you can do the same thing in 1.0.0 using DSL only. Just curious, why don't you use the hql() / sql() methods and pass a query string in? [1] https://github.com/apache/spark/pull/1211/files On Thu, Jul 31, 2014 at 2:20 PM, Bu

Re: configuration needed to run twitter(25GB) dataset

2014-07-31 Thread Ankur Dave
On Thu, Jul 31, 2014 at 08:28 PM, Jiaxin Shi wrote: > We have a 6-nodes cluster , each node has 64GB memory. > [...] > But it ran out of memory. I also try 2D and 1D partition. > > And I also try Giraph under the same configuration, and it runs for 10 > iterations , and then it ran out of memory a

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev
Thanks Zongheng for the pointer. Is there a way to achieve the same in 1.0.0 ? On Thu, Jul 31, 2014 at 1:43 PM, Zongheng Yang wrote: > countDistinct is recently added and is in 1.0.2. If you are using that > or the master branch, you could try something like: > > r.select('keyword, countDis

Re: How do you debug a PythonException?

2014-07-31 Thread Nicholas Chammas
Davies, That was it. Removing the call to cache() let the job run successfully, but this challenges my understanding of how Spark handles caching data. I thought it was safe to cache data sets larger than the cluster could hold in memory. What Spark would do is cache as much as it could and leave

Re: store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread Gerard Maas
To read/write from/to Cassandra I recommend you to use the Spark-Cassandra connector at [1] Using it, saving a Spark Streaming RDD to Cassandra is fairly easy: sparkConfig.set(CassandraConnectionHost, cassandraHost) val sc = new SparkContext(sparkConfig) val ssc = new StreamingContext(sc, Seconds

Re: How do you debug a PythonException?

2014-07-31 Thread Davies Liu
Maybe because you try to cache all the data in memory, but heap of JVM is not big enough. If remove the .cache(), is there still this problem? On Thu, Jul 31, 2014 at 1:33 PM, Nicholas Chammas wrote: > Hmm, looking at this stack trace a bit more carefully, it looks like the > code in the Hadoop

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
countDistinct is recently added and is in 1.0.2. If you are using that or the master branch, you could try something like: r.select('keyword, countDistinct('userId)).groupBy('keyword) On Thu, Jul 31, 2014 at 12:27 PM, buntu wrote: > I'm looking to write a select statement to get a distinct c

Re: How do you debug a PythonException?

2014-07-31 Thread Nicholas Chammas
Hmm, looking at this stack trace a bit more carefully, it looks like the code in the Hadoop API for reading data from the source choked. Is that correct? Perhaps, there is a missing newline (or two. or more) that make 1 line of data too much to read in at once? I'm just guessing here. Gonna try to

Re: RDD operation examples with data?

2014-07-31 Thread Jacob Eisinger
I would check out the source examples on Spark's Github: https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples And, Zhen He put together a great web page with summaries and examples of each function: http://apache-spark-user-list.1001560.n3.nabble.com/A-new-

Re: How do you debug a PythonException?

2014-07-31 Thread Nicholas Chammas
So if I try this again but in the Scala shell (as opposed to the Python one), this is what I get: scala> val a = sc.textFile("s3n://some-path/*.json", minPartitions=sc.defaultParallelism * 3).cache() a: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12 scala> a.map(_.length).max1

RDD operation examples with data?

2014-07-31 Thread Chris Curtin
Hi, I'm learning Spark and I am confused about when to use the many different operations on RDDs. Does anyone have any examples which show example inputs and resulting outputs for the various RDD operations and if the operation takes an Function a simple example of the code? For example, somethin

Re: RDD.coalesce got compilation error

2014-07-31 Thread Jianshi Huang
I see. Thanks Akhil. On Thu, Jul 31, 2014 at 6:08 PM, Akhil Das wrote: > Hi > > According to the documentation > http://spark.apache.org/docs/1.0.0/api/java/index.html it says *coalesce >

Re: spark.shuffle.consolidateFiles seems not working

2014-07-31 Thread Jianshi Huang
I got the number from the Hadoop admin. It's 1M actually. I suspect the consolidation didn't work as expected? Any other reason? On Thu, Jul 31, 2014 at 11:01 AM, Shao, Saisai wrote: > I don’t think it’s a bug of consolidated shuffle, it’s a Linux > configuration problem. The default open file

SchemaRDD select expression

2014-07-31 Thread buntu
I'm looking to write a select statement to get a distinct count on userId grouped by keyword column on a parquet file SchemaRDD equivalent of: SELECT keyword, count(distinct(userId)) from table group by keyword How to write it using the chained select().groupBy() operations? Thanks! -- View

Re: store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread Hari Shreedharan
Off the top of my head, you can use the ForEachDStream to which you pass in the code that writes to Hadoop, and then register that as an output stream, so the function you pass in is periodically executed and causes the data to be written to HDFS. If you are ok with the data being in text forma

Re: Number of partitions and Number of concurrent tasks

2014-07-31 Thread Darin McBeath
Ok, I set the number of spark worker instances to 2 (below is my startup command).  But, this essentially had the effect of increasing my number of workers from 3 to 6 (which was good) but it also reduced my number of cores per worker from 8 to 4 (which was not so good).  In the end, I would sti

Re: Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Yin Huai
I have created https://issues.apache.org/jira/browse/SPARK-2775 to track it. On Thu, Jul 31, 2014 at 11:47 AM, Budde, Adam wrote: > I still see the same “Unresolved attributes” error when using hql + > backticks. > > Here’s a code snippet that replicates this behavior: > > val hiveContext =

Re: Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Budde, Adam
I still see the same “Unresolved attributes” error when using hql + backticks. Here’s a code snippet that replicates this behavior: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val sampleRDD = sc.parallelize(Array("""{"key.one": "value1", "key.two": "value2"}""")) val sampleTa

store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread salemi
Hi, I was wondering what is the best way to store off dstreams in hdfs or casandra. Could somebody provide an example? Thanks, Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/store-spark-streaming-dstream-in-hdfs-or-cassandra-tp11064.html Sent from th

Re: Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Michael Armbrust
Ideally you'd use backticks to reference columns that contain weird characters. I don't believe this works in sql parser, but I'm curious if using the hql parser in HiveContext would work for you? If you wanted to add support for this in the sql parser I'd check out SqlParser.scala. Thought it i

Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Budde, Adam
I’m working with a dataset where each row is stored as a single-line flat JSON object. I want to leverage Spark SQL to run relational queries on this data. Many of the object keys in this dataset have dots in them, e.g.: { “key.number1”: “value1”, “key.number2”: “value2” … } I can successfully

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-31 Thread William Cox
Ah, thanks for the help! That worked great. On Wed, Jul 30, 2014 at 10:31 AM, Zongheng Yang wrote: > To add to this: for this many (>= 20) machines I usually use at least > --wait 600. > > On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas > wrote: > > William, > > > > The error you are seeing

Re: Shark/Spark running on EC2 can read from S3 bucket but cannot write to it - "Wrong FS"

2014-07-31 Thread William Cox
I am running Spark 0.9.1 and Shark 0.9.1. Sorry I didn't include that. On Thu, Jul 31, 2014 at 9:50 AM, William Cox wrote: > *The Shark-specific group appears to be in moderation pause, so I'm asking > here.* > > I'm running Shark/Spark on EC2. I am using Shark to query data from a S3 > bucket

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread Yin Huai
Another way is to set "hive.metastore.warehouse.dir" explicitly to the HDFS dir storing Hive tables by using SET command. For example: hiveContext.hql("SET hive.metastore.warehouse.dir=hdfs://localhost:54310/user/hive/warehouse") On Thu, Jul 31, 2014 at 8:05 AM, Andrew Lee wrote: > Hi All, >

java.lang.OutOfMemoryError: Java heap space

2014-07-31 Thread Sameer Tilak
Hi everyone,I have the following configuration. I am currently running my app in local mode. val conf = new SparkConf().setMaster("local[2]").setAppName("ApproxStrMatch").set("spark.executor.memory", "3g").set("spark.storage.memoryFraction", "0.1") I am getting the following error. I tried set

Shark/Spark running on EC2 can read from S3 bucket but cannot write to it - "Wrong FS"

2014-07-31 Thread William Cox
*The Shark-specific group appears to be in moderation pause, so I'm asking here.* I'm running Shark/Spark on EC2. I am using Shark to query data from a S3 bucket and then write the results back to a S3 bucket. The data is read fine, but when I write I get an error: 14/07/31 16:42:30 INFO schedule

Hbase

2014-07-31 Thread Madabhattula Rajesh Kumar
Hi Team, I'm using below code to read table from hbase Configuration conf = HBaseConfiguration.create(); conf.set(TableInputFormat.INPUT_TABLE, "table1"); JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD( conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);

Re: Spark Deployment Patterns - Automated Deployment & Performance Testing

2014-07-31 Thread Andrew Lee
You should be able to use either SBT or maven to create your JAR files (not a fat jar), and only deploying the JAR for spark-submit. 1. Sync spark libs and versions with your development env and CLASSPATH in your IDE (unfortunately this needs to be hard copied, and may result in split-brain syn

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread Andrew Lee
Hi All, It has been awhile, but what I did to make it work is to make sure the followings: 1. Hive is working when you run Hive CLI and JDBC via Hiveserver2 2. Make sure you have the hive-site.xml from above Hive configuration. The problem here is that you want the hive-site.xml from the Hive

SparkStreaming -- Suppored directory structure

2014-07-31 Thread Yana Kadiyska
Hi, trying to figure out how to process files under the following directory structure: /batch_dir/*.snappy I have a base directory and every so often a new batch gets dropped with a bunch of .snappy files as well as a bunch of .xml files. I'd like to only process the snappy files but can't figur

Re: Number of partitions and Number of concurrent tasks

2014-07-31 Thread Daniel Siegmann
I haven't configured this myself. I'd start with setting SPARK_WORKER_CORES to a higher value, since that's a bit simpler than adding more workers. This defaults to "all available cores" according to the documentation, so I'm not sure if you can actually set it higher. If not, you can get around th

How to share a NonSerializable variable among tasks in the same worker node?

2014-07-31 Thread Fengyun RAO
As shown here: 2 - Why Is My Spark Job so Slow and Only Using a Single Thread? 123456789101112131415 object JSONParser { def parse(raw: String): String = ...}object MyFirstSparkJob { def main

RE: Example standalone app error!

2014-07-31 Thread Alex Minnaar
​I am eager to solve this problem. So if anyone has any suggestions I would be glad to hear them. Thanks, Alex From: Andrew Or Sent: Tuesday, July 29, 2014 4:53 PM To: user@spark.apache.org Subject: Re: Example standalone app error! Hi Alex, Very strange.

SQLCtx cacheTable

2014-07-31 Thread Gurvinder Singh
Hi, I am wondering how can I specify the persistence level in cacheTable. As it is takes only table name as parameter. It should be possible to specify the persistence level. - Gurvinder

Re: Ports required for running spark

2014-07-31 Thread Haiyang Fu
Hi Konstantin, Could you please post your first container's stderr log here which is always the AM log?As far as I know, ports except 8020,8080,8081,50070,50071 are all random socket ports determined by each job. So 33007 maybe just a temporary port for data transferation. The deeper reason for

configuration needed to run twitter(25GB) dataset

2014-07-31 Thread Jiaxin Shi
We have a 6-nodes cluster , each node has 64GB memory. here is the command: ./bin/spark-submit --class org.apache.spark.examples.graphx.LiveJournalPageRank examples/target/scala-2.10/spark-examples-1.0.1-hadoop1.0.4.jar hdfs://dataset/twitter --tol=0.01 --numEPart=144 --numIter=10 But it ran out

Re: Ports required for running spark

2014-07-31 Thread Konstantin Kudryavtsev
Hi Haiyang, you are right, YARN takes over the resource management, bot I constantly got Exception ConnectionRefused on mentioned port. So, I suppose some spark internal communications are done via this port... but I don't know what exactly and how can I change it... Thank you, Konstantin Kudryav

Re: Ports required for running spark

2014-07-31 Thread Larry Xiao
Sorry, I don't have experience with YARN. I checked the YARN page http://spark.apache.org/docs/latest/running-on-yarn.html And for configuration, it refers to http://spark.apache.org/docs/latest/configuration.html " Most of the configs are the same for Spark on YARN as for other deployment mod

Re: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Haiyang Fu
Have you tried to increase the dirver memory? On Thu, Jul 31, 2014 at 3:54 PM, Bin wrote: > Hi All, > > The data size of my task is about 30mb. It runs smoothly in local mode. > However, when I submit it to the cluster, it throws the titled error > (Please see below for the complete output). >

set spark.local.dir on driver program doesn't take effect

2014-07-31 Thread redocpot
Hi, When running spark on ec2 cluster, I find setting spark.local.dir on driver program doesn't take effect. INFO: - standalone mode - cluster launched via python script along with spark - instance type R3.large - ebs attached (using persistent-hdfs) - spark version: 1.0.0 prebuilt-hadoop1,sbt do

Re: Ports required for running spark

2014-07-31 Thread Haiyang Fu
Hi Konstantin, Would you please post some more details? Error info or exception from the log on what situation?when you run spark job on yarn cluster mode ,yarn will take over all the resource management. On Thu, Jul 31, 2014 at 6:17 PM, Konstantin Kudryavtsev < kudryavtsev.konstan...@gmail.com>

Re: understanding use of "filter" function in Spark

2014-07-31 Thread Sean Owen
groupByKey will give you a PairRDD, where for each key k, you have an Iterable over all corresponding (x,y). You can then call mapValues and apply your clustering to the points, to yield a result R. You end up with with a PairRDD of (k,R) pairs. This of course happens in parallel. On Thu, Jul 31,

understanding use of "filter" function in Spark

2014-07-31 Thread Greg
Hi, suppose I have some data of the form k,(x,y) which are all numbers. For each key value (k) I want to do kmeans clustering on all corresponding (x,y) points. For each key value I have few enough points that I'm happy to use a traditional (non-mapreduce) kmeans implementation. The challenge is th

Re: Ports required for running spark

2014-07-31 Thread Konstantin Kudryavtsev
Hi Larry, I'm afraid this is standalone mode, I'm interesting in YARN Also, I don't see port-in-trouble 33007 which i believe related to Akka Thank you, Konstantin Kudryavtsev On Thu, Jul 31, 2014 at 1:11 PM, Larry Xiao wrote: > Hi Konstantin, > > I think you can find it at > https://spar

Re: Implementing percentile through top Vs take

2014-07-31 Thread Bharath Ravi Kumar
Thanks. Turns out I needed the RDD sorted for another purpose, so keeping a sorted pair rdd anyway made sense. And apologies for not reading the source for top before asking the question (/*poor attempt to save time*/). On Thu, Jul 31, 2014 at 12:34 AM, Sean Owen wrote: > No, it's definitely no

Re: Ports required for running spark

2014-07-31 Thread Larry Xiao
Hi Konstantin, I think you can find it at https://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security and you can specify port for master or worker at conf/spark-env.sh Larry On 7/31/14, 6:04 PM, Konstantin Kudryavtsev wrote: Hi there, I'm trying to run

  1   2   >