date:20160221

RE: how to set database in DataFrame.saveAsTable?

2016-02-21 Thread Mich Talebzadeh

I looked at doc on this. It is not clear what goes behind the scene. Very little documentation on it First in Hive a database has to exist before it can be used so sql(“use mytable”) will not create a database for you. Also you cannot call your table mytable in database mytable! Reme

Re: Element appear in both 2 splits of RDD after using randomSplit

2016-02-21 Thread nguyen duc tuan

That's very useful information. The reason for weird problem is because of the non-determination of RDD before applying randomSplit. By caching RDD, we can make RDD become deterministic and so problem is solved. Thank you for your help. 2016-02-21 11:12 GMT+07:00 Ted Yu : > Have you looked at: >

filter by dict() key in pySpark

2016-02-21 Thread Franc Carter

I have a DataFrame that has a Python dict() as one of the columns. I'd like to filter he DataFrame for those Rows that where the dict() contains a specific value. e.g something like this:- DF2 = DF1.filter('name' in DF1.params) but that gives me this error ValueError: Cannot convert column i

Fwd: Evaluating spark streaming use case

2016-02-21 Thread Jatin Kumar

Hello Spark users, I have to aggregate messages from kafka and at some fixed interval (say every half hour) update a memory persisted RDD and run some computation. This computation uses last one day data. Steps are: - Read from realtime Kafka topic X in spark streaming batches of 5 seconds - Filt

Re: Evaluating spark streaming use case

2016-02-21 Thread Gerard Maas

It sounds like another window operation on top of the 30-min window will achieve the desired objective. Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl) to a long enough value and you will require enough resources (mem & disk) to keep the required data. -kr, Gerard. O

spark-xml can't recognize schema

2016-02-21 Thread Prathamesh Dharangutte

I am trying to parse xml file using spark-xml. But for some reason when i print schema it only shows root instead of the hierarchy. I am using sqlcontext to read the data. I am proceeding according to this video : https://www.youtube.com/watch?v=NemEp53yGbI The structure of xml file is somewhat l

Re: spark-xml can't recognize schema

2016-02-21 Thread Sebastian Piu

Can you paste the code you are using? On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte wrote: > I am trying to parse xml file using spark-xml. But for some reason when i > print schema it only shows root instead of the hierarchy. I am using > sqlcontext to read the data. I am proceeding accord

Re: spark-xml can't recognize schema

2016-02-21 Thread Prathamesh Dharangutte

This is the code I am using for parsing xml file: import org.apache.spark.{SparkConf,SparkContext} import org.apache.spark.sql.{DataFrame,SQLContext} import com.databricks.spark.xml object XmlProcessing { def main(args : Array[String]) = { val conf = new SparkConf() .setAppName("

Re: spark-xml can't recognize schema

2016-02-21 Thread Sebastian Piu

Just ran that code and it works fine, here is the output: What version are you using? val ctx = SQLContext.getOrCreate(sc) val df = ctx.read.format("com.databricks.spark.xml").option("rowTag", "book").load("file:///tmp/sample.xml") df.printSchema() root |-- name: long (nullable = true) |-- ord

Re: spark-xml can't recognize schema

2016-02-21 Thread Prathamesh Dharangutte

I am using spark 1.4.0 with scala 2.10.4 and 0.3.2 of spark-xmlOrderid is empty for some books and multiple entries of it for other books,did you include ‎that in your xml file?

Re: spark-xml can't recognize schema

2016-02-21 Thread Sebastian Piu

No because you didn't say that explicitly. Can you share a sample file too? On Sun, 21 Feb 2016, 14:34 Prathamesh Dharangutte wrote: > I am using spark 1.4.0 with scala 2.10.4 and 0.3.2 of spark-xml > Orderid is empty for some books and multiple entries of it for other > books,did you include ‎

RDD[org.apache.spark.sql.Row] filter ERROR

2016-02-21 Thread Tenghuan He

Hi, everyone. I have a spark program, where df0 is a DataFrame val idList = List("1", "2", "3", ...) val rdd0 = df0.rdd val schema = df0.schema val rdd1 = rdd0.filter(r => !idList.contains(r(0))) val df1 = sc.createDataFrame(rdd1, schema) When I run df1.count(), I got the following Exception at

Re: RDD[org.apache.spark.sql.Row] filter ERROR

2016-02-21 Thread Ted Yu

I tried the following in spark-shell: scala> val df0 = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") df0: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields] scala> val idList = List("1", "2", "3") idList: List[String] = List(1, 2, 3) scala> val rdd

Re: spark-xml can't recognize schema

2016-02-21 Thread Dave Moyers

Make sure the xml input file is well formed (check your end tags). Sent from my iPhone > On Feb 21, 2016, at 8:14 AM, Prathamesh Dharangutte > wrote: > > This is the code I am using for parsing xml file: > > > > import org.apache.spark.{SparkConf,SparkContext} > import org.apache.spark.sq

Stream group by

2016-02-21 Thread Vinti Maheshwari

Hello, I have input lines like below *Input* t1, file1, 1, 1, 1 t1, file1, 1, 2, 3 t1, file2, 2, 2, 2, 2 t2, file1, 5, 5, 5 t2, file2, 1, 1, 2, 2 and i want to achieve the output like below rows which is a vertical addition of the corresponding numbers. *Output* “file1” : [ 1+1+5, 1+2+5, 1+3+5

Re: Evaluating spark streaming use case

2016-02-21 Thread Ted Yu

w.r.t. cleaner TTL, please see: [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 FYI On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas wrote: > > It sounds like another window operation on top of the 30-min window will > achieve the desired objective. > Just keep in mind that you'll n

Re: Behind the scene of RDD to DataFrame

2016-02-21 Thread Weiwei Zhang

Thanks a lot! Best Regards, Weiwei On Sat, Feb 20, 2016 at 11:53 PM, Hemant Bhanawat wrote: > toDF internally calls sqlcontext.createDataFrame which transforms the RDD > to RDD[InternalRow]. This RDD[InternalRow] is then mapped to a dataframe. > > Type conversions (from scala types to catalyst

Specify number of executors in standalone cluster mode

2016-02-21 Thread Saiph Kappa

Hi, I'm running a spark streaming application onto a spark cluster that spans 6 machines/workers. I'm using spark cluster standalone mode. Each machine has 8 cores. Is there any way to specify that I want to run my application on all 6 machines and just use 2 cores on each machine? Thanks

Re: Stream group by

2016-02-21 Thread Jatin Kumar

Hello Vinti, One way to get this done is you split your input line into key and value tuple and then you can simply use groupByKey and handle the values the way you want. For example: Assuming you have already split the values into a 5 tuple: myDStream.map(record => (record._2, (record._3, record

Re: Stream group by

2016-02-21 Thread Vinti Maheshwari

Thanks for your reply Jatin. I changed my parsing logic to what you suggested: def parseCoverageLine(str: String) = { val arr = str.split(",") ... ... (arr(0), arr(1) :: count.toList) // (test, [file, 1, 1, 2]) } Then in the grouping, can i use a global hash map p

Re: Stream group by

2016-02-21 Thread Vinti Maheshwari

Nevermind, seems like an executor level mutable map is not recommended as stated in http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/ On Sun, Feb 21, 2016 at 9:54 AM, Vinti Maheshwari wrote: > Thanks for your reply Jatin. I changed my parsing logic to what you > s

Re: Stream group by

2016-02-21 Thread Jatin Kumar

You will need to do a collect and update a global map if you want to. myDStream.map(record => (record._2, (record._3, record_4, record._5)) .groupByKey((r1, r2) => (r1._1 + r2._1, r1._2 + r2._2, r1._3 + r2._3)) .foreachRDD(rdd => { rdd.collect().foreach((fileName, valu

Re: RDD[org.apache.spark.sql.Row] filter ERROR

2016-02-21 Thread Tenghuan He

Hi Ted, Thanks a lot for you reply I tried your code in spark-shell on my laptop it works well. But when I tried it on another computer installed with spark I got an Error scala> val df0 = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") :11: error: value toDF is not a m

Re: Evaluating spark streaming use case

2016-02-21 Thread Chris Fregly

good catch on the cleaner.ttl @jatin- when you say "memory-persisted RDD", what do you mean exactly? and how much data are you talking about? remember that spark can evict these memory-persisted RDDs at any time. they can be recovered from Kafka, but this is not a good situation to be in.

Re: RDD[org.apache.spark.sql.Row] filter ERROR

2016-02-21 Thread Ted Yu

Do you have the following in your IDEA Scala console ? import scala.language.implicitConversions On Sun, Feb 21, 2016 at 10:16 AM, Tenghuan He wrote: > Hi Ted, > > Thanks a lot for you reply > > I tried your code in spark-shell on my laptop it works well. > But when I tried it on another compu

Re: Evaluating spark streaming use case

2016-02-21 Thread Ted Yu

w.r.t. the new mapWithState(), there have been some bug fixes since the release of 1.6.0 e.g. SPARK-13121 java mapWithState mishandles scala Option Looks like 1.6.1 RC should come out next week. FYI On Sun, Feb 21, 2016 at 10:47 AM, Chris Fregly wrote: > good catch on the cleaner.ttl > > @jat

Re: Spark Job Hanging on Join

2016-02-21 Thread Gourav Sengupta

Hi Tamara, few basic questions first. How many executors are you using? Is the data getting all cached into the same executor? How many partitions do you have of the data? How many fields are you trying to use in the join? If you need any help in finding answer to these questions please let me k

Re: Spark Job Hanging on Join

2016-02-21 Thread Gourav Sengupta

Sorry, please include the following questions to the list above: the SPARK version? whether you are using RDD or DataFrames? is the code run locally or in SPARK Cluster mode or in AWS EMR? Regards, Gourav Sengupta On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta wrote: > Hi Tamara, > > few b

Re: Stream group by

2016-02-21 Thread ayan guha

I believe the best way would be to use reduceByKey operation. On Mon, Feb 22, 2016 at 5:10 AM, Jatin Kumar < jku...@rocketfuelinc.com.invalid> wrote: > You will need to do a collect and update a global map if you want to. > > myDStream.map(record => (record._2, (record._3, record_4, record._5)) >

Re:RE: how to set database in DataFrame.saveAsTable?

2016-02-21 Thread Jacky Wang

df.write.saveAsTable("db_name.tbl_name") // works, spark-shell, latest spark version 1.6.0 df.write.saveAsTable("db_name.tbl_name") // NOT work, spark-shell, old spark version 1.4 -- Jacky Wang At 2016-02-21 17:35:53, "Mich Talebzadeh" wrote: I looked at doc on this. It is not clear

Re: Stream group by

2016-02-21 Thread Vinti Maheshwari

Yeah, i tried with reduceByKey, was able to do it. Thanks Ayan and Jatin. For reference, final solution: def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("HBaseStream") val sc = new SparkContext(conf) // create a StreamingContext, the main entry point for a

Re: Stream group by

2016-02-21 Thread Vinti Maheshwari

> > Yeah, i tried with reduceByKey, was able to do it. Thanks Ayan and Jatin. > For reference, final solution: > > def main(args: Array[String]): Unit = { > val conf = new SparkConf().setAppName("HBaseStream") > val sc = new SparkContext(conf) > // create a StreamingContext, the main en

Re:RE: how to set database in DataFrame.saveAsTable?

2016-02-21 Thread Mich Talebzadeh

Well my version of Spark is 1.5.2 On 21/02/2016 23:54, Jacky Wang wrote: > df.write.saveAsTable("db_name.tbl_name") // works, spark-shell, latest spark > version 1.6.0 > df.write.saveAsTable("db_name.tbl_name") // NOT work, spark-shell, old spark > version 1.4 > > -- > > Jacky Wang >

RE: Submitting Jobs Programmatically

2016-02-21 Thread Patrick Mi

Hi there, I had similar problem in Java with the standalone cluster on Linux but got that working by passing the following option -Dspark.jars=file:/path/to/sparkapp.jar sparkapp.jar has the launch application Hope that helps. Regards, Patrick -Original Message- From: Arko Provo Mu

Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh

Hi, I have observed that Spark SQL is not returning records for hive bucketed ORC tables on HDP. On spark SQL , I am able to list all tables , but queries on hive bucketed tables are not returning records. I have also tried the same for non-bucketed hive tables. it is working fine. Same is

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread Varadharajan Mukundan

Hi, Is the transaction attribute set on your table? I observed that hive transaction storage structure do not work with spark yet. You can confirm this by looking at the transactional attribute in the output of "desc extended " in hive console. If you'd need to access transactional table, conside

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh

Hi Varadharajan, Thanks for your response. Yes it is transnational table; See below *show create table. * Table hardly have 3 records , and after triggering minor compaction on tables , it start showing results on spark SQL. > *ALTER TABLE hivespark COMPACT 'major';* > *show create table hiv

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread Varadharajan Mukundan

Yes, I was burned down by this issue couple of weeks back. This also means that after every insert job, compaction should be run to access new rows from Spark. Sad that this issue is not documented / mentioned anywhere. On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh wrote: > Hi Varadharajan, > >

Error :Type mismatch error when passing hdfs file path to spark-csv load method

2016-02-21 Thread Divya Gehlot

Hi, I am trying to dynamically create Dataframe by reading subdirectories under parent directory My code looks like > import org.apache.spark._ > import org.apache.spark.sql._ > val hadoopConf = new org.apache.hadoop.conf.Configuration() > val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new >

Re: Specify number of executors in standalone cluster mode

2016-02-21 Thread Hemant Bhanawat

Max number of cores per executor can be controlled using spark.executor.cores. And maximum number of executors on a single worker can be determined by environment variable: SPARK_WORKER_INSTANCES. However, to ensure that all available cores are used, you will have to take care of how the stream is

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh

Compaction would have been triggered automatically as following properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property not been set for these tables. hive.compactor.initiator.on true hive.compactor.worker.threads 1 Do

Re: Accessing Web UI

2016-02-21 Thread Vasanth Bhat

Thanks Gourav, Eduardo I tried http://localhost:8080 and http://OAhtvJ5MCA:8080/ . Both cases the forefox just hangs. Also I tried with lynx text based browser. I get the message "HTTP request sent; waiting for response." and it hangs as well. Is there way to enable debug logs in spark

Re: Error :Type mismatch error when passing hdfs file path to spark-csv load method

2016-02-21 Thread Jonathan Kelly

On the line preceding the one that the compiler is complaining about (which doesn't actually have a problem in itself), you declare df as "df"+fileName, making it a string. Then you try to assign a DataFrame to df, but it's already a string. I don't quite understand your intent with that previous l

[Please Help] Log redirection on EMR

2016-02-21 Thread HARSH TAKKAR

Hi In am using an EMR cluster for running my spark jobs, but after the job finishes logs disappear, I have added a log4j.properties in my jar, but all the logs still redirects to EMR resource manager which vanishes after jobs completes, is there a way i could redirect the logs to a location in f

How to start spark streaming application with recent past timestamp for replay of old batches?

2016-02-21 Thread ashokkumar rajendran

Hi Folks, I am exploring spark for streaming from two sources (a) Kinesis and (b) HDFS for some of our use-cases. Since we maintain state gathered over last x hours in spark streaming, we would like to replay the data from last x hours as batches during deployment. I have gone through the Spark

Re: [Please Help] Log redirection on EMR

2016-02-21 Thread Sabarish Sasidharan

Your logs are getting archived in your logs bucket in S3. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html Regards Sab On Mon, Feb 22, 2016 at 12:14 PM, HARSH TAKKAR wrote: > Hi > > In am using an EMR cluster for running my spark jobs, but after the jo

[Example] : read custom schema from file

2016-02-21 Thread Divya Gehlot

Hi, Can anybody help me by providing me example how can we read schema of the data set from the file. Thanks, Divya

Re: How to start spark streaming application with recent past timestamp for replay of old batches?

2016-02-21 Thread Akhil Das

On Mon, Feb 22, 2016 at 12:18 PM, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi Folks, > > > > I am exploring spark for streaming from two sources (a) Kinesis and (b) > HDFS for some of our use-cases. Since we maintain state gathered over last > x hours in spark streaming, we

48 matches

Mail list logo