re: How to incrementally compile spark examples using mvn

2014-11-19 Thread Sean Owen
Why not install them? It doesn't take any work and is the only correct way to do it. mvn install is all you need. On Nov 20, 2014 2:35 AM, "Yiming (John) Zhang" wrote: > Hi Sean, > > Thank you for your reply. I was wondering whether there is a method of > reusing locally-built components without

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you, but I'm just considering a free options. 2014-11-20 7:53 GMT+01:00 Akhil Das : > You can also look at the Amazon's kinesis if you don't want to handle the > pain of maintaining kafka/flume infra. > > Thanks > Best Regards > > On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz > wrote: >

Re: Joining DStream with static file

2014-11-19 Thread Akhil Das
1. You don't have to create another sparkContext. you can simply call the *ssc.sparkContext* 2. May be after the transformation on geoData, you could do a persist so next time, it will be read from memory. Thanks Best Regards On Thu, Nov 20, 2014 at 6:43 AM, YaoPau wrote: > Here is my attempt:

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Akhil Das
You can also look at the Amazon's kinesis if you don't want to handle the pain of maintaining kafka/flume infra. Thanks Best Regards On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz wrote: > Thank you for your answer, I don't know if I typed the question > correctly. But your nswer helps me. >

Re: Transform RDD.groupBY result to multiple RDDs

2014-11-19 Thread Sean Owen
What's your use case? You would not generally want to make so many small RDDs. On Nov 20, 2014 6:19 AM, "Dai, Kevin" wrote: > Hi, all > > > > Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K. > > > > My question is how to make each groupBy resukt whick is (K, iterable[V

Re: Spark Streaming not working in YARN mode

2014-11-19 Thread Akhil Das
Make sure the executor cores are set to a value which is >= 2 while submitting the job. Thanks Best Regards On Thu, Nov 20, 2014 at 10:36 AM, kam lee wrote: > I created a simple Spark Streaming program - it received numbers and > computed averages and sent the results to Kafka. > > It worked pe

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread Daniel Haviv
You can save the results as parquet file or as text file and created a hive table based on these files Daniel > On 20 בנוב׳ 2014, at 08:01, akshayhazari wrote: > > Sorry about the confusion I created . I just have started learning this week. > Silly me, I was actually writing the schema to a

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread akshayhazari
Sorry about the confusion I created . I just have started learning this week. Silly me, I was actually writing the schema to a txt file and expecting records. This is what I was supposed to do. Also if you could let me know about adding the data from jsonFile/jsonRDD methods of hiveContext to hive

Re: How to view log on yarn-client mode?

2014-11-19 Thread Sandy Ryza
While the app is running, you can find logs from the YARN web UI by navigating to containers through the "Nodes" link. After the app has completed, you can use the YARN logs command: yarn logs -applicationId -Sandy On Wed, Nov 19, 2014 at 6:01 PM, innowireless TaeYun Kim < taeyun@innowirele

Naive Baye's classification confidence

2014-11-19 Thread jatinpreet
I have been trying the Naive Baye's implementation of Spark's MLlib.During testing phase, I wish to eliminate data with low confidence of prediction. My data set primarily consists of form based documents like reports and application forms. They contain key-value pair type text and hence I assume

Transform RDD.groupBY result to multiple RDDs

2014-11-19 Thread Dai, Kevin
Hi, all Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K. My question is how to make each groupBy resukt whick is (K, iterable[V]) a RDD. BTW, can we transform it as a DStream and also each groupBY result is a RDD in it? Best Regards, Kevin.

Spark Streaming not working in YARN mode

2014-11-19 Thread kam lee
I created a simple Spark Streaming program - it received numbers and computed averages and sent the results to Kafka. It worked perfectly in local mode as well as standalone master/slave mode across a two-node cluster. It did not work however in yarn-client or yarn-cluster mode. The job was acce

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread akshayhazari
Thanks for replying .I was unable to figure out how after I use jsonFile/jsonRDD be able to load data into a hive table. Also I was able to save the SchemaRDD I got via hiveContext.sql(...).saveAsParquetFile(Path) ie. save schemardd as parquetfile but when I tried to fetch data from parquet file ba

insertIntoTable failure deleted pre-existing _metadata file

2014-11-19 Thread Daniel Haviv
Hello, I'm loading and saving json files into an existing directory with parquet files using the insertIntoTable method. If the method fails for some reason (differences in the schema in my case), the _metadata file of the parquet dir is automatically deleted, rendering the existing parquet file

Re: Can we make EdgeRDD and VertexRDD storage level to MEMORY_AND_DISK?

2014-11-19 Thread Harihar Nahak
Just figured it out using Graph constructor you can pass the storage level for both Edge and Vertex : Graph.fromEdges(edges, defaultValue = ("",""),StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK ) Thanks to this post : https://issues.apache.org/jira/browse/SPARK-1991 - --Hari

How to view log on yarn-client mode?

2014-11-19 Thread innowireless TaeYun Kim
Hi, How can I view log on yarn-client mode? When I insert the following line on mapToPair function for example, System.out.println("TEST TEST"); On local mode, it is displayed on console. But on yarn-client mode, it is not on anywhere. When I use yarn resource manager web UI, the siz

Re: Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread Tobias Pfeiffer
Hi, it looks what you are trying to use as a Tuple cannot be inferred to be a Tuple from the compiler. Try to add type declarations and maybe you will see where things fail. Tobias

re: How to incrementally compile spark examples using mvn

2014-11-19 Thread Yiming (John) Zhang
Hi Sean, Thank you for your reply. I was wondering whether there is a method of reusing locally-built components without installing them? That is, if I have successfully built the spark project as a whole, how should I configure it so that I can incrementally build (only) the "spark-examples" s

Re: PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
Casting to Tuple2 is easy, but the output of reduceByKey is presumably a new Tuple2 instance so I'll need to map those to new instances of my class. Not sure how much overhead will be added by the creation of those new instances. If I do that everywhere in my code though, it will make the code rea

Joining DStream with static file

2014-11-19 Thread YaoPau
Here is my attempt: val sparkConf = new SparkConf().setAppName("LogCounter") val ssc = new StreamingContext(sparkConf, Seconds(2)) val sc = new SparkContext() val geoData = sc.textFile("data/geoRegion.csv") .map(_.split(',')) .map(line => (line(0), (line(1),line(2),line(3

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Ritesh Kumar Singh
As Marcelo mentioned, the issue occurs mostly when incompatible classes are used by executors or drivers. Try out if the output is coming on spark-shell. If yes, then most probably in your case, there might be some issue with your configuration files. It will be helpful if you can paste the conten

Re: PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Michael Armbrust
I think you should also be able to get away with casting it back and forth in this case using .asInstanceOf. On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann wrote: > I have a class which is a subclass of Tuple2, and I want to use it with > PairRDDFunctions. However, I seem to be limited by the

PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
I have a class which is a subclass of Tuple2, and I want to use it with PairRDDFunctions. However, I seem to be limited by the invariance of T in RDD[T] (see SPARK-1296 ). My Scala-fu is weak: the only way I could think to make this work would be t

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
Sorry meant cdh 5.2 w/ spark 1.1. On Wed, Nov 19, 2014, 17:41 Anson Abraham wrote: > yeah CDH distribution (1.1). > > On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin > wrote: > >> On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham >> wrote: >> > yeah but in this case i'm not building any files. j

Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
I would use just textFile unless you actually need a guarantee that you will be seeing a whole file at time (textFile splits on new lines). RDDs are immutable, so you cannot add data to them. You can however union two RDDs, returning a new RDD that contains all the data. On Wed, Nov 19, 2014 at

Spark Standalone Scheduling

2014-11-19 Thread TJ Klein
Hi, I am running some Spark code on my cluster in standalone mode. However, I have noticed that the most powerful machines (32 cores, 192 Gb mem) hardly get any tasks, whereas my small machines (8 cores, 128 Gb mem) all get plenty of tasks. The resources are all displayed correctly in the WebUI an

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
yeah CDH distribution (1.1). On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin wrote: > On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham > wrote: > > yeah but in this case i'm not building any files. just deployed out > config > > files in CDH5.2 and initiated a spark-shell to just read and output

Re: Reading nested JSON data with Spark SQL

2014-11-19 Thread Simone Franzini
This works great, thank you! Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Wed, Nov 19, 2014 at 3:40 PM, Michael Armbrust wrote: > You can extract the nested fields in sql: SELECT field.nestedField ... > > If you don't do that then nested fields are represented as rows with

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Marcelo Vanzin
On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham wrote: > yeah but in this case i'm not building any files. just deployed out config > files in CDH5.2 and initiated a spark-shell to just read and output a file. In that case it is a little bit weird. Just to be sure, you are using CDH's version of

Re: Spark on YARN

2014-11-19 Thread Sean Owen
I think your config may be the issue then. It sounds like 1 server is configured in a different YARN group that states it has way less resource than it does. On Wed, Nov 19, 2014 at 5:27 PM, Alan Prando wrote: > Hi all! > > Thanks for answering! > > @Sean, I tried to run with 30 executor-cores ,

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
yeah but in this case i'm not building any files. just deployed out config files in CDH5.2 and initiated a spark-shell to just read and output a file. On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin wrote: > Hi Anson, > > We've seen this error when incompatible classes are used in the driver >

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you for your answer, I don't know if I typed the question correctly. But your nswer helps me. I'm going to make the question again for knowing if you understood me. I have this topology: DataSource1, , DataSourceN --> Kafka --> SparkStreaming --> HDFS

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you for your answer, I don't know if I typed the question correctly. But your nswer helps me. I'm going to make the question again for knowing if you understood me. I have this topology: DataSource1, , DataSourceN --> Kafka --> SparkStreaming --> HDFS DataSource1, , DataSourceN

Re: Strategies for reading large numbers of files

2014-11-19 Thread soojin
Hi Landon, I tried this but it didn't work for me. I get Task not serializable exception: Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration How do you make org.apache.hadoop.conf.Configuration hadoopConfiguration available to tasks? -- View this message in conte

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Marcelo Vanzin
Hi Anson, We've seen this error when incompatible classes are used in the driver and executors (e.g., same class name, but the classes are different and thus the serialized data is different). This can happen for example if you're including some 3rd party libraries in your app's jar, or changing t

Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
In general you should be able to read full directories of files as a single RDD/SchemaRDD. For documentation I'd suggest the programming guides: http://spark.apache.org/docs/latest/quick-start.html http://spark.apache.org/docs/latest/sql-programming-guide.html For Avro in particular, I have been

Re: Reading nested JSON data with Spark SQL

2014-11-19 Thread Michael Armbrust
You can extract the nested fields in sql: SELECT field.nestedField ... If you don't do that then nested fields are represented as rows within rows and can be retrieved as follows: t.getAs[Row](0).getInt(0) Also, I would write t.getAs[Buffer[CharSequence]](12) as t.getAs[Seq[String]](12) since we

[SQL]Proper use of spark.sql.thriftserver.scheduler.pool

2014-11-19 Thread Yana Kadiyska
Hi sparkers, I'm trying to use spark.sql.thriftserver.scheduler.pool for the first time (earlier I was stuck because of https://issues.apache.org/jira/browse/SPARK-4037) I have two pools setup: [image: Inline image 1] and would like to issue a query against the "low priority" pool. I am doing t

How to get list of edges between two Vertex ?

2014-11-19 Thread Harihar Nahak
Hi, I have a graph where no. of edges b/w two vertices are more than once possible. Now I need to find out who are top vertices between which no. of calls happen more? output should look like (V1, V2 , No. of edges) So I need to know, how to find out total no. of edges b/w only that two verti

Reading nested JSON data with Spark SQL

2014-11-19 Thread Simone Franzini
I have been using Spark SQL to read in JSON data, like so: val myJsonFile = sqc.jsonFile(args("myLocation")) myJsonFile.registerTempTable("myTable") sqc.sql("mySQLQuery").map { row => myFunction(row) } And then in myFunction(row) I can read the various columns with the Row.getX methods. However, t

Can we make EdgeRDD and VertexRDD storage level to MEMORY_AND_DISK?

2014-11-19 Thread Harihar Nahak
Hi, I'm running out of memory when I run a GraphX program for dataset moe than 10 GB, It was handle pretty well in case of noraml spark operation when did StorageLevel.MEMORY_AND_DISK. In case of GraphX I found its only allowed storing in memory, and it is because in Graph constructor, this pro

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
Question ... when you mean different versions, different versions of dependency files? what are the dependency files for spark? On Tue Nov 18 2014 at 5:27:18 PM Anson Abraham wrote: > when cdh cluster was running, i did not set up spark role. When I did for > the first time, it was working ie,

Re: Efficient way to split an input data set into different output files

2014-11-19 Thread Nicholas Chammas
I don't have a solution for you, but it sounds like you might want to follow this issue: SPARK-3533 - Add saveAsTextFileByKey() method to RDDs On Wed Nov 19 2014 at 6:41:11 AM Tom Seddon wrote: > I'm trying to set up a PySpark ETL job that take

Re: Bug in Accumulators...

2014-11-19 Thread Jake Mannix
I'm running into similar problems with accumulators failing to serialize properly. Are there any examples of accumulators being used in more complex environments than simply initializing them in the same class and then using them in a .foreach() on an RDD referenced a few lines below? >From the a

NEW to spark and sparksql

2014-11-19 Thread Sam Flint
Hi, I am new to spark. I have began to read to understand sparks RDD files as well as SparkSQL. My question is more on how to build out the RDD files and best practices. I have data that is broken down by hour into files on HDFS in avro format. Do I need to create a separate RDD for each

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
Btw, if you want to write to Spark Streaming from Flume -- there is a sink (it is a part of Spark, not Flume). See Approach 2 here: http://spark.apache.org/docs/latest/streaming-flume-integration.html On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan < hshreedha...@cloudera.com> wrote: > As of

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
As of now, you can feed Spark Streaming from both kafka and flume. Currently though there is no API to write data back to either of the two directly. I sent a PR which should eventually add something like this: https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scal

Re: rack-topology.sh no such file or directory

2014-11-19 Thread Matei Zaharia
Your Hadoop configuration is set to look for this file to determine racks. Is the file present on cluster nodes? If not, look at your hdfs-site.xml and remove the setting for a rack topology script there (or it might be in core-site.xml). Matei > On Nov 19, 2014, at 12:13 PM, Arun Luthra wrot

Re: Converting a json struct to map

2014-11-19 Thread Yin Huai
Oh, actually, we do not support MapType provided by the schema given to jsonRDD at the moment (my bad..). Daniel, you need to wait for the patch of 4476 (I should have one soon). Thanks, Yin On Wed, Nov 19, 2014 at 2:32 PM, Daniel Haviv wrote: > Thank you Michael > I will try it out tomorrow >

rack-topology.sh no such file or directory

2014-11-19 Thread Arun Luthra
I'm trying to run Spark on Yarn on a hortonworks 2.1.5 cluster. I'm getting this error: 14/11/19 13:46:34 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@#/user/Executor#- 2027837001] with ID 42 14/11/19 13:46:34 WARN net.ScriptBasedMapping:

querying data from Cassandra through the Spark SQL Thrift JDBC server

2014-11-19 Thread Mohammed Guller
Hi - I was curious if anyone is using the Spark SQL Thrift JDBC server with Cassandra. It would be great be if you could share how you got it working? For example, what config changes have to be done in hive-site.xml, what additional jars are required, etc.? I have a Spark app that can programm

Re: spark streaming and the spark shell

2014-11-19 Thread Tian Zhang
I am hitting the same issue, i.e., after running for some time, if spark streaming job lost or timeout kafka connection, it will just start to return empty RDD's .. Is there a timeline for when this issue will be fixed so that I can plan accordingly? Thanks. Tian -- View this message in conte

Re: Converting a json struct to map

2014-11-19 Thread Daniel Haviv
Thank you Michael I will try it out tomorrow Daniel > On 19 בנוב׳ 2014, at 21:07, Michael Armbrust wrote: > > You can override the schema inference by passing a schema as the second > argument to jsonRDD, however thats not a super elegant solution. We are > considering one option to make thi

Re: [SQL] HiveThriftServer2 failure detection

2014-11-19 Thread Yana Kadiyska
https://issues.apache.org/jira/browse/SPARK-4497 On Wed, Nov 19, 2014 at 1:48 PM, Michael Armbrust wrote: > This is not by design. Can you please file a JIRA? > > On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska > wrote: > >> Hi all, I am running HiveThriftServer2 and noticed that the process st

Re: Converting a json struct to map

2014-11-19 Thread Michael Armbrust
You can override the schema inference by passing a schema as the second argument to jsonRDD, however thats not a super elegant solution. We are considering one option to make this easier here: https://issues.apache.org/jira/browse/SPARK-4476 On Tue, Nov 18, 2014 at 11:06 PM, Akhil Das wrote: >

Re: Shuffle Intensive Job: sendMessageReliably failed because ack was not received within 60 sec

2014-11-19 Thread Michael Armbrust
That error can mean a whole bunch of things (and we've been working in recently to make it more descriptive). Often the actual cause is in the executor logs. On Wed, Nov 19, 2014 at 10:50 AM, Gary Malouf wrote: > Has anyone else received this type of error? We are not sure what the > issue is

Re: tableau spark sql cassandra

2014-11-19 Thread Michael Armbrust
The whole stacktrack/exception would be helpful. Hive is an optional dependency of Spark SQL, but you will need to include it if you are planning to use the thrift server to connect to Tableau. You can enable it by add -Phive when you build Spark. You might also try asking on the cassandra maili

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread Michael Armbrust
I am not very familiar with the JSONSerDe for Hive, but in general you should not need to manually create a schema for data that is loaded from hive. You should just be able to call saveAsParquetFile on any SchemaRDD that is returned from hctx.sql(...). I'd also suggest you check out the jsonFile

Re: Merging Parquet Files

2014-11-19 Thread Michael Armbrust
On Wed, Nov 19, 2014 at 12:41 AM, Daniel Haviv wrote: > > Another problem I have is that I get a lot of small json files and as a > result a lot of small parquet files, I'd like to merge the json files into > a few parquet files.. how I do that? > You can use `coalesce` on any RDD to merge files.

Shuffle Intensive Job: sendMessageReliably failed because ack was not received within 60 sec

2014-11-19 Thread Gary Malouf
Has anyone else received this type of error? We are not sure what the issue is nor how to correct it to get our job to complete...

Re: SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-19 Thread Michael Armbrust
On Tue, Nov 18, 2014 at 10:34 PM, Night Wolf wrote: > > Is there a better way to mock this out and test Hive/metastore with > SparkSQL? > I would use TestHive which creates a fresh metastore each time it is invoked.

Re: [SQL] HiveThriftServer2 failure detection

2014-11-19 Thread Michael Armbrust
This is not by design. Can you please file a JIRA? On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska wrote: > Hi all, I am running HiveThriftServer2 and noticed that the process stays > up even though there is no driver connected to the Spark master. > > I started the server via sbin/start-thrifts

Re: Debugging spark java application

2014-11-19 Thread Akhil Das
For debugging you can refer these two threads http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-hit-breakpoints-using-IntelliJ-In-functions-used-by-an-RDD-td12754.html http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3ccahuq+_ygfioj2aa3e2zsh7zfsv_z-wsorhvbipahxjlm2fj.

[SQL] HiveThriftServer2 failure detection

2014-11-19 Thread Yana Kadiyska
Hi all, I am running HiveThriftServer2 and noticed that the process stays up even though there is no driver connected to the Spark master. I started the server via sbin/start-thriftserver and my namenodes are currently not operational. I can see from the log that there was an error on startup: 14

tableau spark sql cassandra

2014-11-19 Thread jererc
Hello! I'm working on a POC with Spark SQL, where I’m trying to get data from Cassandra into Tableau using Spark SQL. Here is the stack: - cassandra (v2.1) - spark SQL (pre build v1.1 hadoop v2.4) - cassandra / spark sql connector (https://github.com/datastax/spark-cassandra-connector) - hive - m

Re: Spark on YARN

2014-11-19 Thread Alan Prando
Hi all! Thanks for answering! @Sean, I tried to run with 30 executor-cores , and 1 machine still without processing. @Vanzin, I checked RM's web UI, and all nodes were detecteds and "RUNNING". The interesting fact is that available memory and available core of 1 node was different of other 2, wit

Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Hi, I'm starting with Spark and I just trying to understand if I want to use Spark Streaming, should I use to feed it Flume or Kafka? I think there's not a official Sink for Flume to Spark Streaming and it seems that Kafka it fits better since gives you readibility. Could someone give a good scen

Re: Getting spark job progress programmatically

2014-11-19 Thread Mark Hamstra
This is already being covered by SPARK-2321 and SPARK-4145. There are pull requests that are already merged or already very far along -- e.g., https://github.com/apache/spark/pull/3009 If there is anything that needs to be added, please add it to those issues or PRs. On Wed, Nov 19, 2014 at 7:55

Re: Getting spark job progress programmatically

2014-11-19 Thread Aniket Bhatnagar
Thanks for pointing this out Mark. Had totally missed the existing JIRA items On Wed Nov 19 2014 at 21:42:19 Mark Hamstra wrote: > This is already being covered by SPARK-2321 and SPARK-4145. There are > pull requests that are already merged or already very far along -- e.g., > https://github.co

Re: Getting spark job progress programmatically

2014-11-19 Thread Aniket Bhatnagar
I have for now submitted a JIRA ticket @ https://issues.apache.org/jira/browse/SPARK-4473. I will collate all my experiences (& hacks) and submit them as a feature request for public API. On Tue Nov 18 2014 at 20:35:00 andy petrella wrote: > yep, we should also propose to add this stuffs in the p

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Aniket Bhatnagar
Thanks Daniel :-). It seems to make sense and something I was hoping for. I will proceed with this assumption and report back if I see any anomalies. On Wed Nov 19 2014 at 19:30:02 Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Ah, so I misunderstood you too :). > > My reading of org

Re: Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread Olivier Girardot
can you please post the full source of your code and some sample data to run it on ? 2014-11-19 16:23 GMT+01:00 YaoPau : > I joined two datasets together, and my resulting logs look like this: > > > (975894369,((72364,20141112T170627,web,MEMPHIS,AR,US,Central),(Male,John,Smith))) > > (253142991,(

Re: Why is ALS class serializable ?

2014-11-19 Thread Cheng Lian
When a field of an object is enclosed in a closure, the object itself is also enclosed automatically, thus the object need to be serializable. On 11/19/14 6:39 PM, Hao Ren wrote: Hi, When reading through ALS code, I find that: class ALS private ( private var numUserBlocks: Int, priv

Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread YaoPau
I joined two datasets together, and my resulting logs look like this: (975894369,((72364,20141112T170627,web,MEMPHIS,AR,US,Central),(Male,John,Smith))) (253142991,((30058,20141112T171246,web,ATLANTA16,GA,US,Southeast),(Male,Bob,Jones))) (295305425,((28110,20141112T170454,iph,CHARLOTTE2,NC,US,South

"can not found scala.reflect related methods" when running spark program

2014-11-19 Thread Dingfei Zhang
Hi, I wrote below simple spark code, and met a runtime issue which seems that the system can't find some methods of scala refect library. package org.apache.spark.examples import scala.io.Source import scala.reflect._ import scala.reflect.api.JavaUniverse import scala.reflect.runtime.universe i

GraphX bug re-opened

2014-11-19 Thread Gary Malouf
We keep running into https://issues.apache.org/jira/browse/SPARK-2823 when trying to use GraphX. The cost of repartitioning the data is really high for us (lots of network traffic) which is killing the job performance. I understand the bug was reverted to stabilize unit tests, but frankly it make

Debugging spark java application

2014-11-19 Thread Mukesh Jha
Hello experts, Is there an easy way to debug a spark java application? I'm putting debug logs in the map's function but there aren't any logs on the console. Also can i include my custom jars while launching spark-shell and do my poc there? This might me a naive question but any help here is ap

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Daniel Darabos
Ah, so I misunderstood you too :). My reading of org/ apache/spark/Aggregator.scala is that your function will always see the items in the order that they are in the input RDD. An RDD partition is always accessed as an iterator, so it will not be read out of order. On Wed, Nov 19, 2014 at 2:28 PM

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Aniket Bhatnagar
Thanks Daniel. I can understand that the keys will not be in sorted order but what I am trying to understanding is whether the functions are passed values in sorted order in a given partition. For example: sc.parallelize(1 to 8).map(i => (1, i)).sortBy(t => t._2).foldByKey(0)((a, b) => b).collect

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Daniel Darabos
Akhil, I think Aniket uses the word "persisted" in a different way than what you mean. I.e. not in the RDD.persist() way. Aniket asks if running combineByKey on a sorted RDD will result in a sorted RDD. (I.e. the sorting is preserved.) I think the answer is no. combineByKey uses AppendOnlyMap, whi

Efficient way to split an input data set into different output files

2014-11-19 Thread Tom Seddon
I'm trying to set up a PySpark ETL job that takes in JSON log files and spits out fact table files for upload to Redshift. Is there an efficient way to send different event types to different outputs without having to just read the same cached RDD twice? I have my first RDD which is just a json p

Re: Understanding spark operation pipeline and block storage

2014-11-19 Thread Hao Ren
Anyone has idea on this ? Thx -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-spark-operation-pipeline-and-block-storage-tp18201p19263.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Why is ALS class serializable ?

2014-11-19 Thread Hao Ren
Hi, When reading through ALS code, I find that: class ALS private ( private var numUserBlocks: Int, private var numProductBlocks: Int, private var rank: Int, private var iterations: Int, private var lambda: Double, private var implicitPrefs: Boolean, private var alpha:

RE: Spark to eliminate full-table scan latency

2014-11-19 Thread bchazalet
You can serve queries over your RDD data yes, and return results to the user/client as long as your driver is alive. For example, I have built a play! application that acts as a driver (creating a spark context), loads up data from my database, organize it and subsequently receive and process use

Re: Merging Parquet Files

2014-11-19 Thread Daniel Haviv
Very cool thank you! On Wed, Nov 19, 2014 at 11:15 AM, Marius Soutier wrote: > You can also insert into existing tables via .insertInto(tableName, > overwrite). You just have to import sqlContext._ > > On 19.11.2014, at 09:41, Daniel Haviv wrote: > > Hello, > I'm writing a process that ingests

Getting Parts of Iterables in Function's call method

2014-11-19 Thread jelgh
Hello, I run groupBy on a JavaRDD so that I get a JavaPairRDD>. If I then run for instance a reduceByKey, could I get a partitions of the grouped Iterable in the reduce function's call method? Or will I always get a full group's Iterable? If you always get a full group's Iterable, you know you

How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread akshayhazari
The below part of code contains a part which creates a table in hive from data and and another part below creates a Schema. *Now if I try to save the quried data as a parquet file where hctx.sql("Select * from sparkHive1") returns me a SchemaRDD which contains records from table .* hctx.sq

Re: Merging Parquet Files

2014-11-19 Thread Marius Soutier
You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._ On 19.11.2014, at 09:41, Daniel Haviv wrote: > Hello, > I'm writing a process that ingests json files and saves them a parquet files. > The process is as such: > > val sqlContex

Merging Parquet Files

2014-11-19 Thread Daniel Haviv
Hello, I'm writing a process that ingests json files and saves them a parquet files. The process is as such: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val jsonRequests=sqlContext.jsonFile("/requests") val parquetRequests=sqlContext.parquetFile("/requests_parquet") jsonRequests.regi

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Akhil Das
If something is persisted you can easily see them under the Storage tab in the web ui. Thanks Best Regards On Tue, Nov 18, 2014 at 7:26 PM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > I am trying to figure out if sorting is persisted after applying Pair RDD > transformations and I am