Re: setting heap space

2014-10-13 Thread Akhil Das
Few things to keep in mind: - I believe Driver memory should not exceed executor memory - Set spark.storage.memoryFraction default is 0.6 - Set spark.rdd.compress default is set to false - Always specify the level of parallelism while doing a groupBy, reduceBy, join, sortBy etc. - If you don't have

Re: setting heap space

2014-10-13 Thread Chengi Liu
Hi Akhil, Thanks for the response.. Another query... do you know how to use "spark.executor.extraJavaOptions" option? SparkConf.set("spark.executor.extraJavaOptions","what value should go in here")? I am trying to find an example but cannot seem to find the same.. On Mon, Oct 13, 2014 at 12:03

Re: setting heap space

2014-10-13 Thread Akhil Das
like this you can set: sparkConf.set("spark.executor.extraJavaOptions", " -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+AggressiveOpts -XX:FreqInlineSize=300 -XX:MaxInlineSize=300 ") Here's a benchmark example

Re: setting heap space

2014-10-13 Thread Chengi Liu
Cool.. Thanks.. And one last final question.. conf = SparkConf.set().set(...) matrix = get_data(..) rdd = sc.parallelize(matrix) # heap error here... How and where do I set set the storage level.. seems like conf is the wrong place to set this thing up..?? as I get this error: py4j.protocol.Py4

Re: Spark SQL - Exception only when using cacheTable

2014-10-13 Thread poiuytrez
This is how the table was created: transactions = parts.map(lambda p: Row(customer_id=long(p[0]), chain=int(p[1]), dept=int(p[2]), category=int(p[3]), company=int(p[4]), brand=int(p[5]), date=str(p[6]), productsize=float(p[7]), productmeasure=str(p[8]), purchasequantity=int(p[9]), purchaseamount=f

Re: setting heap space

2014-10-13 Thread Akhil Das
Like this: import org.apache.spark.storage.StorageLevel val rdd = sc.parallelize(1 to 100).persist(StorageLevel.MEMORY_AND_DISK_SER) Thanks Best Regards On Mon, Oct 13, 2014 at 12:50 PM, Chengi Liu wrote: > Cool.. Thanks.. And one last final question.. > conf = SparkConf.set().set(...)

SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
https://issues.apache.org/jira/browse/SPARK-3106 I'm having the saming errors described in SPARK-3106 (no other types of errors confirmed), running a bunch sql queries on spark 1.2.0 built from latest master HEAD. Any updates to this issue? My main task is to join a huge fact table with a dozen

Re: SparkSQL on Hive error

2014-10-13 Thread Kevin Paul
Thanks Michael, your patch works for me :) Regards, Kelvin Paul On Fri, Oct 3, 2014 at 3:52 PM, Michael Armbrust wrote: > Are you running master? There was briefly a regression here that is > hopefully fixed by spark#2635 . > > On Fri, Oct 3, 2014 at 1

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Hmm... it failed again, just lasted a little bit longer. Jianshi On Mon, Oct 13, 2014 at 4:15 PM, Jianshi Huang wrote: > https://issues.apache.org/jira/browse/SPARK-3106 > > I'm having the saming errors described in SPARK-3106 (no other types of > errors confirmed), running a bunch sql queries

Setting SparkSQL configuration

2014-10-13 Thread Kevin Paul
Hi all, I tried to set the configuration spark.sql.inMemoryColumnarStorage.compressed, and spark.sql.inMemoryColumnarStorage.batchSize in spark.executor.extraJavaOptions but it does not work, my spark.executor.extraJavaOptions contains "Dspark.sql.inMemoryColumnarStorage.compressed=true -Dspark.sql

Issue with Spark Twitter Streaming

2014-10-13 Thread Jahagirdar, Madhu
All, We are using Spark Streaming to receive data from twitter stream. This is running behind proxy. We have done the following configurations inside spark steaming for twitter4j to work behind proxy. def main(args: Array[String]) { val filters = Array("Modi") System.setProperty("twi

Is "Array Of Struct" supported in json RDDs? is it possible to query this?

2014-10-13 Thread shahab
Hello, Given the following structure, is it possible to query, e.g. session[0].id ? In general, is it possible to query "Array Of Struct" in json RDDs? root |-- createdAt: long (nullable = true) |-- id: string (nullable = true) |-- sessions: array (nullable = true) ||-- element: str

Adding Supersonic to the "Powered by Spark" list

2014-10-13 Thread Maya Bercovitch
Hi, We are using Spark for several months now and will be glad to join the Spark family officially :) *Company Name*: Supersonic Supersonic is a mobile advertising company. *URL*: http://www.supersonicads.com/ *Components and use-cases*: Using Spark core for big data crunching, MLlib for predic

Configuration is not effective or configuration errors?

2014-10-13 Thread pol
Hi ALL, I set "spark.storage.blockManagerSlaveTimeoutMs=10" in "spark-default.conf" file, but in http://:4040 page does not exist under the “Environment” tab of the settings, why? ps : "new SparkConf()" in code and use spark standalone mode. Thanks, chepoo --

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-13 Thread arthur.hk.c...@gmail.com
Hi, Thank you so much! By the way, what is the DATEADD function in Scala/Spark? or how to implement "DATEADD(MONTH, 3, '2013-07-01')” and "DATEADD(YEAR, 1, '2014-01-01')” in Spark or Hive? Regards Arthur On 12 Oct, 2014, at 12:03 pm, Ilya Ganelin wrote: > Because of how closures work in

Re: Setting SparkSQL configuration

2014-10-13 Thread Cheng Lian
Currently Spark SQL doesn’t support reading SQL specific configurations via system properties. But for |HiveContext|, you can put them in |hive-site.xml|. On 10/13/14 4:28 PM, Kevin Paul wrote: Hi all, I tried to set the configuration spark.sql.inMemoryColumnarStorage.compressed, and spark.s

Issue on running spark application in Yarn-cluster mode

2014-10-13 Thread vishnu86
When I execute the following in yarn-client mode its working fine and giving the result properly, but when i try to run in Yarn-cluster mode i am getting error spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client /home/rug885/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-13 Thread Pierre B
Is it planned in a "near" future ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-custom-aggregation-function-UDAF-tp15784p16275.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: SparkSQL LEFT JOIN problem

2014-10-13 Thread invkrh
Hi, Thank you Liquan. I just missed some in information in my previous post. I just solved the problem. Actually, I use the first line(schema header) of the CSV file to generate StructType and StructField. However, the input file is UTF-8 Unicode (*with* BOM), so the first char of the file is #6

Inconsistency of implementing accumulator in Java

2014-10-13 Thread WonderfullDay
Hello! I'm new to Spark, and write a Java program with it. But when I implement the Accumulator, something confuse me, here is my implementation: /class /WeightAccumulatorParam /implements /AccumulatorParam{ / @*Override*/ public Weight zero(/Weight /initialValue){ return new Weight();

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-13 Thread Yin Huai
Question 1: Please check http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#hive-tables. Question 2: One workaround is to re-write it. You can use LEFT SEMI JOIN to implement the subquery with EXISTS and use LEFT OUTER JOIN + IS NULL to implement the subquery with NOT EXISTS. SELECT S_NA

Re: Inconsistency of implementing accumulator in Java

2014-10-13 Thread Sean Owen
Are you sure you aren't actually trying to extend AccumulableParam instead of AccumulatorParam? The example you cite does the latter. I do not get a compile error from this example. You also didn't say what version of Spark. (Although there are a few things about the example that could be updated

RE: Spark SQL parser bug?

2014-10-13 Thread Mohammed Guller
Hi Cheng, I am using version 1.1.0. Looks like that bug was fixed sometime after 1.1.0 was released. Interestingly, I tried your code on 1.1.0 and it gives me a different (incorrect) result: case class T(a:String, ts:java.sql.Timestamp) val sqlContext = new org.apache.spark.sql.SQLContext(sc) im

Re: Is "Array Of Struct" supported in json RDDs? is it possible to query this?

2014-10-13 Thread Yin Huai
If you are using HiveContext, it should work in 1.1. Thanks, Yin On Mon, Oct 13, 2014 at 5:08 AM, shahab wrote: > Hello, > > Given the following structure, is it possible to query, e.g. session[0].id > ? > > In general, is it possible to query "Array Of Struct" in json RDDs? > > root > > |--

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread Yin Huai
Hi Shahab, Can you try to use HiveContext? Its should work in 1.1. For SQLContext, this issues was not fixed in 1.1 and you need to use master branch at the moment. Thanks, Yin On Sun, Oct 12, 2014 at 5:20 PM, shahab wrote: > Hi, > > Apparently is it is possible to query nested json using sp

Question about SVM mlllib...

2014-10-13 Thread Alfonso Muñoz Muñoz
Dear friends, Is there any way to know what is the “predicted label” for each “input label”? [CODE] … Val model=SVMWithSGD.train(training,numIterations) Model.clearThreshold() Val scoreAndLabels = test.map{ point => Val score = model.predict(point.features) (predi

Re: Spark SQL parser bug?

2014-10-13 Thread Yin Huai
Seems the reason that you got "wrong" results was caused by timezone. The time in java.sql.Timestamp(long time) means "milliseconds since January 1, 1970, 00:00:00 *GMT*. A negative number is the number of milliseconds before January 1, 1970, 00:00:00 *GMT*." However, in ts>='1970-01-01 00:00:00'

Regarding java version requirement in spark 1.2.0 or upcoming releases

2014-10-13 Thread twinkle sachdeva
Hi, Can somebody please share the plans regarding java version's support for apache spark 1.2.0 or near future releases. Will java 8 become the all feature supported version in apache spark 1.2 or java 1.7 will suffice? Thanks,

Re: Question about SVM mlllib...

2014-10-13 Thread Sean Owen
Aside from some syntax errors, this looks like exactly how you do it, right? Except that you call clearThreshold(), which causes it to return the margin, not a 0/1 prediction. Don't call that. It will default to the behavior you want. On Mon, Oct 13, 2014 at 3:03 PM, Alfonso Muñoz Muñoz wrote: >

Re: Regarding java version requirement in spark 1.2.0 or upcoming releases

2014-10-13 Thread Sean Owen
I have not heard any plans to even drop support for Java 6. I imagine it will remain that way for a while. Java 6 is sufficient. On Mon, Oct 13, 2014 at 3:37 PM, twinkle sachdeva wrote: > Hi, > > Can somebody please share the plans regarding java version's support for > apache spark 1.2.0 or near

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread shahab
Thanks Yin. I trued HiveQL and and it solved that problem. But now I have second query requirement : But since you are main developer behind JSON-Spark integration (I saw your presentation on youtube "Easy JSON Data Manipulation in Spark"), is it possible to perform aggregation kind queries, for e

Re: Inconsistency of implementing accumulator in Java

2014-10-13 Thread WonderfullDay
I am sorry that I forget mentioning the version of spark. The version is spark-1.1.0. I am pretty sure I am not trying to extend AccumulableParam, you can see the code of the implementation of WeightAccumulatorParam in my post, it implements the interface "*AccumulatorParam*". And I am not even i

RE: Spark SQL parser bug?

2014-10-13 Thread Mohammed Guller
Good guess, but that is not the reason. Look at this code: scala> val data = sc.parallelize(132554880L::133554880L::Nil).map(i=> T(i.toString, new java.sql.Timestamp(i))) data: org.apache.spark.rdd.RDD[T] = MappedRDD[17] at map at :17 scala> data.collect res3: Array[T] = Array(T(13255488

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread Yin Huai
Hi Shahab, Do you mean queries with group by and aggregation functions? Once you register the json dataset as a table, you can write queries like querying a regular table. You can join it with other tables and do aggregations. Is it what you were asking for? If not, can you give me a more concrete

Re: Inconsistency of implementing accumulator in Java

2014-10-13 Thread WonderfullDay
Further more, the second building can be passed, but exception will be throw while running: Exception is NoClassDefFoundError. The situation is the same as the sample on line -- VectorAccumulatorParam. So to me, if I don't implement the addAccumulator function, I cannot custom the AccumulatorPara

Re: Spark SQL parser bug?

2014-10-13 Thread Yin Huai
Yeah, it is not related to timezone. I think you hit this issue and it was fixed after 1.1 release. On Mon, Oct 13, 2014 at 11:24 AM, Mohammed Guller wrote: > Good guess, but that is not the reason. Look at this code: > > > > scala> val data =

SparkSQL: StringType for numeric comparison

2014-10-13 Thread invkrh
Hi, I am using SparkSQL 1.1.0. Actually, I have a table as following: root |-- account_id: string (nullable = false) |-- Birthday: string (nullable = true) |-- preferstore: string (nullable = true) |-- registstore: string (nullable = true) |-- gender: string (nullable = true) |-- city_name

persist table schema in spark-sql

2014-10-13 Thread Sadhan Sood
We want to persist table schema of parquet file so as to use spark-sql cli on that table later on? Is it possible or is spark-sql cli only good for tables in hive metastore ? We are reading parquet data using this example: // Read in the parquet file created above. Parquet files are self-describi

read all parquet files in a directory in spark-sql

2014-10-13 Thread Sadhan Sood
How can we read all parquet files in a directory in spark-sql. We are following this example which shows a way to read one file: // Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.// The result of loading a Parquet file is also a SchemaRDD.val

SparkSQL: select syntax

2014-10-13 Thread invkrh
Hi all, A quick question on SparkSql *SELECT* syntax. Does it support queries like: *SELECT t1.*, t2.d, t2.e FROM t1 LEFT JOIN t2 on t1.a = t2.a* It always ends with the exception: *Exception in thread "main" java.lang.RuntimeException: [2.12] failure: string literal expected SELECT t1.*, t2.

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread Nicholas Chammas
Right now I believe the only supported option is to pass a comma-delimited list of paths. I've opened SPARK-3928: Support wildcard matches on Parquet files to request this feature. Nick On Mon, Oct 13, 2014 at 12:21 PM, Sadhan Sood wrote: > Ho

Re: parquetFile and wilcards

2014-10-13 Thread Nicholas Chammas
SPARK-3928: Support wildcard matches on Parquet files On Wed, Sep 24, 2014 at 2:14 PM, Michael Armbrust wrote: > We could certainly do this. The comma separated support is something I > added. > > On Wed, Sep 24, 2014 at 10:20 AM, Nicholas Cham

RE: Spark SQL parser bug?

2014-10-13 Thread Mohammed Guller
That explains it. Thanks! Mohammed From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Monday, October 13, 2014 8:47 AM To: Mohammed Guller Cc: Cheng, Hao; Cheng Lian; user@spark.apache.org Subject: Re: Spark SQL parser bug? Yeah, it is not related to timezone. I think you hit this issue

S3 Bucket Access

2014-10-13 Thread Ranga
Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId and Secret using environment variables ( AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to access

RowMatrix PCA out of heap space error

2014-10-13 Thread Yang
I got this error when trying to perform PCA on a sparse matrix, each row has a nominal length of 8000, and there are 36k rows. each row has on average 3 elements being non-zero. I guess the total size is not that big. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.

Re: Hung spark executors don't count toward worker memory limit

2014-10-13 Thread Keith Simmons
Maybe I should put this another way. If spark has two jobs, A and B, both of which consume the entire allocated memory pool, is it expected that spark can launch B before the executor processes tied to A are completely terminated? On Thu, Oct 9, 2014 at 6:57 PM, Keith Simmons wrote: > Actually,

Re: RowMatrix PCA out of heap space error

2014-10-13 Thread Sean Owen
The Gramian is 8000 x 8000, dense, and full of 8-byte doubles. It's symmetric so can get away with storing it in ~256MB. The catch is that it's going to send around copies of this 256MB array. You may easily be running your driver out of memory given all the overheads and copies, or your executors,

Re: Spark SQL HiveContext Projection Pushdown

2014-10-13 Thread Michael Armbrust
> > Is there any plan to support windowing queries? I know that Shark > supported it in its last release and expected it to be already included. > Someone from redhat is working on this. Unclear if it will make the 1.2 release.

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-13 Thread Michael Armbrust
Its not on the roadmap for 1.2. I'd suggest opening a JIRA. On Mon, Oct 13, 2014 at 4:28 AM, Pierre B < pierre.borckm...@realimpactanalytics.com> wrote: > Is it planned in a "near" future ? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL

Re: SparkSQL: StringType for numeric comparison

2014-10-13 Thread Michael Armbrust
This conversion is done implicitly anytime you use a string column in an operation with a numeric column. If you run explain on your query you should see the cast that is inserted. This is intentional and based on the type semantics of Apache Hive. On Mon, Oct 13, 2014 at 9:03 AM, invkrh wrote:

Re: persist table schema in spark-sql

2014-10-13 Thread Michael Armbrust
If you are running a version > 1.1 you can create external parquet tables. I'd recommend setting spark.sql.hive.convertMetastoreParquet=true. Here's a helper function to do it automatically: /** * Sugar for creating a Hive external table from a parquet path. */ def createParquetTable(name: Strin

Why is parsing a CSV incredibly wasteful with Java Heap memory?

2014-10-13 Thread Aris
Hi guys, I am trying just parse out values from a CSV, everything is a numeric (Double) value, and the input text CSV data is about 1.3 GB in size. When inspect the Java Heap space used by SparkSubmit using JVisualiser VM, I end up eating up 8GB of memory! Moreover, by inspecting the BlockManager

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Turned out it was caused by this issue: https://issues.apache.org/jira/browse/SPARK-3923 Set spark.akka.heartbeat.interval to 100 solved it. Jianshi On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang wrote: > Hmm... it failed again, just lasted a little bit longer. > > Jianshi > > On Mon, Oct 13,

Re: Why is parsing a CSV incredibly wasteful with Java Heap memory?

2014-10-13 Thread Sean Owen
A CSV element like "3.2," takes 4 bytes as text on disk, but, as a Double will always take 8 bytes. Is your input like this? that could explain it. You can map to Float in this case to halve the memory, if that works for your use case. This is just kind of how Strings and floating-point work in th

Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Nick Chammas
Cross posting an interesting question on Stack Overflow . Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multipart-uploads-to-Amazon-S3-from-Apache-Spark-t

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
One thing made me very confused during debuggin is the error message. The important one WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@xxx:50278] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. is of Log Level WARN. Jianshi

Re: mlib model viewing and saving

2014-10-13 Thread Joseph Bradley
Currently, printing (toString) gives a human-readable version of the tree, but it is not a format which is easy to save and load. That sort of serialization is in the works, but not available for trees right now. (Note that the current master actually has toString (for a short summary of the tree

Re: Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Daniil Osipov
Not directly related, but FWIW, EMR seems to back away from s3n usage: "Previously, Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability." http://docs.aws.amazon.c

Re: Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Nicholas Chammas
Oh, that's a straight reversal from their position up until earlier this year . Was there an announcement explaining the change in recommendation? Nick On Mon, Oct 13, 2014 at 4:54 PM, Daniil O

distributing Scala Map datatypes to RDD

2014-10-13 Thread jon.g.massey
Hi guys, Just starting out with Spark and following through a few tutorials, it seems the easiest way to get ones source data into an RDD is using the sc.parallelize function. Unfortunately, my local data is in multiple instances of Map types, and the parallelize function only works on objects with

Re: S3 Bucket Access

2014-10-13 Thread Ranga
Is there a way to specify a request header during the .textFile call? - Ranga On Mon, Oct 13, 2014 at 11:03 AM, Ranga wrote: > Hi > > I am trying to access files/buckets in S3 and encountering a permissions > issue. The buckets are configured to authenticate using an IAMRole provider. > I have

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
The biggest scaling issue was supporting a large number of reduce tasks efficiently, which the JIRAs in that post handle. In particular, our current default shuffle (the hash-based one) has each map task open a separate file output stream for each reduce task, which wastes a lot of memory (since

Re: Why is parsing a CSV incredibly wasteful with Java Heap memory?

2014-10-13 Thread Aris
Thank you Sean. Moving over my data types from Double to Float was an (obvious) big win, and I discovered one more good optimization from the Tuning section -- I modified my original code to call .persist(MEMORY_ONLY_SER) from the FIRST import of the data, and I pass in "--conf spark.rdd.compress=

Problems building Spark for Hadoop 1.0.3

2014-10-13 Thread mildebrandt
Hello, After reading the following pages: https://spark.apache.org/docs/latest/building-with-maven.html http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html I came up with this build command: mvn -Dhadoop.version=1.0.3 -DskipTests -Pdeb -Phadoop-1.0.3 -Phive -Phadoop-provided

Re: Problems building Spark for Hadoop 1.0.3

2014-10-13 Thread Sean Owen
Yes, there is no hadoop-1.0.3 profile. You can look at the parent pom.xml to see the profiles that exist. It doesn't mean 1.0.3 doesn't work, just that there is nothing specific to activate for this version range. I don't think the docs suggest this profile exists. You don't need any extra profile.

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread DB Tsai
For now, with SparkSPARK-3462 parquet pushdown for unionAll PR, you can do the following for unionAll schemaRDD. val files = Array("hdfs://file1.parquet", "hdfs://file2.parquet", "hdfs://file3.parquet") val rdds = paths.map(hc.parquetFile(_)) val unionedRDD = { var temp = rdds(0) fo

Re: Problems building Spark for Hadoop 1.0.3

2014-10-13 Thread mildebrandt
Hi Sean, Thanks for the quick response. I'll give that a try. I'm still a little concerned that yarn support will be built into my target assembly. Are you aware of something I can check after the build competes to be sure that Spark doesn't look for Yarn during runtime? Thanks, -Chris -- View

SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Terry Siu
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables that point to Parquet (compressed with Snappy), which were converted over from Avro if that matters. I am trying to perform a join with th

Re: distributing Scala Map datatypes to RDD

2014-10-13 Thread Stephen Boesch
is the following what you are looking for? scala > sc.parallelize(myMap.map{ case (k,v) => (k,v) }.toSeq) res2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at :21 2014-10-13 14:02 GMT-07:00 jon.g.massey : > Hi guys, > Just starting out with Spark and fol

Spark Cluster health check

2014-10-13 Thread Tarun Garg
Hi All,I am doing a POC and written a Job in java. so the architecture has kafka and spark.Now i want a process to notify me whenever system performance is getting down or in crunch of resources, like CPU or RAM. I understand org.apache.spark.streaming.scheduler.StreamingListener, but it has ver

Re: distributing Scala Map datatypes to RDD

2014-10-13 Thread Sean Owen
Map.toSeq already does that even. You can skip the map. You can put together Maps with ++ too. You should have an RDD of pairs then, but to get the special RDD functions you're looking for remember to import SparkContext._ On Mon, Oct 13, 2014 at 10:58 PM, Stephen Boesch wrote: > is the following

Does SparkSQL work with custom defined SerDe?

2014-10-13 Thread Chen Song
In Hive, the table was created with custom SerDe, in the following way. row format serde "abc.ProtobufSerDe" with serdeproperties ("serialization.class"= "abc.protobuf.generated.LogA$log_a") When I start spark-sql shell, I always got the following exception, even for a simple query. select user

Re: S3 Bucket Access

2014-10-13 Thread Daniil Osipov
(Copying the user list) You should use spark_ec2 script to configure the cluster. If you use trunk version you can use the new --copy-aws-credentials option to configure the S3 parameters automatically, otherwise either include them in your SparkConf variable or add them to /root/spark/ephemeral-hd

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread Sadhan Sood
Thanks Nick, DB - that was helpful. On Mon, Oct 13, 2014 at 5:44 PM, DB Tsai wrote: > For now, with SparkSPARK-3462 parquet pushdown for unionAll PR, you > can do the following for unionAll schemaRDD. > > val files = Array("hdfs://file1.parquet", "hdfs://file2.parquet", > "hdfs://file3.parquet

How to construct graph in graphx

2014-10-13 Thread Soumitra Siddharth Johri
Hi, I am new to scala/graphx and am having problems converting a tsv file to a graph. I have a flat tab separated file like below: n1 P1 n2 n3 P1 n4 n2 P2 n3 n3 P2 n1 n1 P3 n4 n3 P3 n2 where n1,n2,n3,n4 are the nodes of the graph and R1,P2,P3 are the properties which should form the edges b

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Ilya Ganelin
Thank you for the details! Would you mind speaking to what tools proved most useful as far as identifying bottlenecks or bugs? Thanks again. On Oct 13, 2014 5:36 PM, "Matei Zaharia" wrote: > The biggest scaling issue was supporting a large number of reduce tasks > efficiently, which the JIRAs in

Re: S3 Bucket Access

2014-10-13 Thread Ranga
Hi Daniil Could you provide some more details on how the cluster should be launched/configured? The EC2 instance that I am dealing with uses the concept of IAMRoles. I don't have any "keyfile" to specify to the spark-ec2 script. Thanks for your help. - Ranga On Mon, Oct 13, 2014 at 3:04 PM, Dan

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now has raised the bar with the ability to sort a PB. Like some of the folks in the list, a summary of what worked (and didn't) as well as the monitoring practices would be good. Cheers P.S: What are you folks planning next ? O

Re: How to construct graph in graphx

2014-10-13 Thread Ankur Dave
At 2014-10-13 18:22:44 -0400, Soumitra Siddharth Johri wrote: > I have a flat tab separated file like below: > > [...] > > where n1,n2,n3,n4 are the nodes of the graph and R1,P2,P3 are the > properties which should form the edges between the nodes. > > How can I construct a graph from the above f

Re: S3 Bucket Access

2014-10-13 Thread Daniil Osipov
There is detailed information available in the official documentation[1]. If you don't have a key pair, you can generate one as described in AWS documentation [2]. That should be enough to get started. [1] http://spark.apache.org/docs/latest/ec2-scripts.html [2] http://docs.aws.amazon.com/AWSEC2/l

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Michael Armbrust
There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use built in parquet support when the serde looks like parquet. On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu wrote: > I am currently using Spark 1.

Spark can't find jars

2014-10-13 Thread Andy Srine
Hi Guys, Spark rookie here. I am getting a file not found exception on the --jars. This is on the yarn cluster mode and I am running the following command on our recently upgraded Spark 1.1.1 environment. ./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class myEngine --driver

Re: Spark can't find jars

2014-10-13 Thread Jimmy
Having the exact same error with the exact same jar Do you work for Altiscale? :) J Sent from my iPhone > On Oct 13, 2014, at 5:33 PM, Andy Srine wrote: > > Hi Guys, > > Spark rookie here. I am getting a file not found exception on the --jars. > This is on the yarn cluster mode and I am

Re: Spark can't find jars

2014-10-13 Thread HARIPRIYA AYYALASOMAYAJULA
Helo, Can you check if the jar file is available in the target->scala-2.10 folder? When you use sbt package to make the jar file, that is where the jar file would be located. The following command works well for me: spark-submit --class “Classname" --master yarn-cluster jarfile(withcomplete

Re: Spark can't find jars

2014-10-13 Thread Sean McNamara
I’ve run into this as well. I haven’t had a chance to troubleshoot what exactly was going on, but I got around it by building my app as a single uberjar. Sean On Oct 13, 2014, at 6:40 PM, HARIPRIYA AYYALASOMAYAJULA mailto:aharipriy...@gmail.com>> wrote: Helo, Can you check if the jar file

Re: Spark can't find jars

2014-10-13 Thread Jimmy McErlain
That didnt seem to work... the jar files are in the target > scala2.10 folder when I package, then I move the jar to the cluster and launch the app... still the same error... Thoughts? J ᐧ *JIMMY MCERLAIN* DATA SCIENTIST (NERD) *. . . . . . . . . . . . . . . . . .* *IF WE CAN’T DOUBLE YOUR

Re: Spark can't find jars

2014-10-13 Thread Jimmy McErlain
BTW this has always worked for me before until we upgraded the cluster to Spark 1.1.1... J ᐧ *JIMMY MCERLAIN* DATA SCIENTIST (NERD) *. . . . . . . . . . . . . . . . . .* *IF WE CAN’T DOUBLE YOUR SALES,* *ONE OF US IS IN THE WRONG BUSINESS.* *E*: ji...@sellpoints.com *M*: *510.303.7751*

Re: Spark can't find jars

2014-10-13 Thread HARIPRIYA AYYALASOMAYAJULA
Well in the cluster, can you try copying the entire folder and then run? For example my home folder say helloWorld consists of the src, target etc. can you copy the entire folder in the cluster ? I doubt it is looking for some dependencies and is missing that when it runs your jar file. or if you

Re: Spark can't find jars

2014-10-13 Thread HARIPRIYA AYYALASOMAYAJULA
Or if it has something to do with the way you package your files - try another alternative method and see if it works On Monday, October 13, 2014, HARIPRIYA AYYALASOMAYAJULA < aharipriy...@gmail.com> wrote: > Well in the cluster, can you try copying the entire folder and then run? > For example m

Re: ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-13 Thread Tao Xiao
Thanks Akhil. Both ways work for me, but I'd like to know why that exception was thrown. The class HBaseApp and related class were all contained in my application jar, why was *com.xt.scala.HBaseApp$$* *anonfun$testHBase$1* not found ? 2014-10-13 14:53 GMT+08:00 Akhil Das : > Adding your applica

Re: Processing order in Spark

2014-10-13 Thread Tobias Pfeiffer
Sean, thanks, I didn't know about repartitionAndSortWithinPartitions, that seems very helpful! Tobias

Can's create Kafka stream in spark shell

2014-10-13 Thread Gary Zhao
Hello I'm trying to connect kafka in spark shell, but failed as below. Could you take a look what I missed. scala> val kafkaStream = KafkaUtils.createStream(ssc, "test-vip.snc1:2181", "test_spark", Map("user-test"->1)) error: bad symbolic reference. A signature in KafkaUtils.class refers to term

some more heap space error

2014-10-13 Thread Chengi Liu
Hi, I posted a query yesterday and have tried out all the options given in responses.. Basically, I am reading a very fat matrix (2000 by 50 dimension matrix) and am trying to run kmeans on it. I keep on getting heap error.. Now, I am even using persist(StorageLevel.DISK_ONLY_2) option.. Ho

Re: How to construct graph in graphx

2014-10-13 Thread Ankur Dave
At 2014-10-13 21:08:15 -0400, Soumitra Johri wrote: > There is no 'long' field in my file. So when I form the edge I get a type > mismatch error. Is it mandatory for GraphX that every vertex should have a > distinct id. ? > > in my case n1,n2,n3,n4 are all strings. (+user so others can see the s

Re: Spark Cluster health check

2014-10-13 Thread Akhil Das
Hi Tarun, You can use Ganglia for monitoring the entire cluster, and if you want some more custom functionality like sending emails etc, then you can go after nagios. Thanks Best Regards On Tue, Oct 14, 2014 at 3:31 AM, Tarun Garg wrote: > Hi All, > > I am doing a POC and written a Job in java