date:20150821

Re: How can I save the RDD result as Orcfile with spark1.3?

2015-08-21 Thread dong.yajun

hi Ted, thanks for your reply, are there any other way to do this with spark 1.3? such as write the orcfile manually in foreachPartition method? On Sat, Aug 22, 2015 at 12:19 PM, Ted Yu wrote: > ORC support was added in Spark 1.4 > See SPARK-2883 > > On Fri, Aug 21, 2015 at 7:36 PM, dong.yajun

Re: How can I save the RDD result as Orcfile with spark1.3?

2015-08-21 Thread Ted Yu

ORC support was added in Spark 1.4 See SPARK-2883 On Fri, Aug 21, 2015 at 7:36 PM, dong.yajun wrote: > Hi list, > > Is there a way to save the RDD result as Orcfile in spark1.3? due to some > reasons we can't upgrade our spark version to 1.4 now. > > -- > *Ric Dong* > >

Re: spark streaming 1.3 kafka error

2015-08-21 Thread Shushant Arora

it comes at start of each tasks when there is new data inserted in kafka.( data inserted is very few) kafka topic has 300 partitions - data inserted is ~10 MB. Tasks gets failed and it retries which succeed and after certain no of fail tasks it kills the job. On Sat, Aug 22, 2015 at 2:08 AM, A

Re: How to list all dataframes and RDDs available in current session?

2015-08-21 Thread Raghavendra Pandey

You get the list of all the persistet rdd using spark context... On Aug 21, 2015 12:06 AM, "Rishitesh Mishra" wrote: > I am not sure if you can view all RDDs in a session. Tables are maintained > in a catalogue . Hence its easier. However you can see the DAG > representation , which lists all th

Re: Serializing MLlib MatrixFactorizationModel

2015-08-21 Thread Madawa Soysa

Hi Joseph, I have used the built in save as you suggested. The directory gets created but the complete model doesn't gets written to the file. Only a part of the model gets written to the file. Please find the attached part that was written when tested with the above method. On 18 August 2015 at

How can I save the RDD result as Orcfile with spark1.3?

2015-08-21 Thread dong.yajun

Hi list, Is there a way to save the RDD result as Orcfile in spark1.3? due to some reasons we can't upgrade our spark version to 1.4 now. -- *Ric Dong*

Streaming: BatchTime OffsetRange Mapping?

2015-08-21 Thread Susan Zhang

I've been reading documentation on accessing offsetRanges and updating ZK yourself when using DirectKafkaInputDStream (from createDirectStream), along with the code changes in this PR: https://github.com/apache/spark/pull/4805. I'm planning on adding a listener to update ZK (for monitoring purpose

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang

Is there a workaround without updating Hadoop? Would really appreciate if someone can explain what spark is trying to do here and what is an easy way to turn this off. Thanks all! On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > Did you try with hadoop

Re: Finding the number of executors.

2015-08-21 Thread Du Li

Following is a method that retrieves the list of executors registered to a spark context. It worked perfectly with spark-submit in standalone mode for my project. /** * A simplified method that just returns the current active/registered executors * excluding the driver. * @param sc *

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Dan LaBar

Nathan, I achieve this using rowNumber. Here is a Python DataFrame example: from pyspark.sql.window import Window from pyspark.sql.functions import desc, rowNumber yourOutputDF = ( yourInputDF .withColumn("first", rowNumber() .over(Window.partitionBy("userID").orderBy("d

How frequently should full gc we expect

2015-08-21 Thread java8964

In the test job I am running in Spark 1.3.1 in our stage cluster, I can see following information on the application stage information: MetricMin25th percentileMedian75th percentileMaxDuration0 ms1.1 min1.5 min1.7 min3.4 minGC Time11 s16 s21 s25 s54 s >From the GC output log, I can see it is abou

Re: Finding the number of executors.

2015-08-21 Thread Virgil Palanciuc

Hi Akhil, I'm using spark 1.4.1. Number of executors is not in the command line, not in the getExecutorMemoryStatus (I already mentioned that I tried that, works in spark-shell but not when executed via spark-submit). I tried looking at "defaultParallelism" too, it's 112 (7 executors * 16 cores) w

Re: Worker Machine running out of disk for Long running Streaming process

2015-08-21 Thread Tathagata Das

Could you periodically (say every 10 mins) run System.gc() on the driver. The cleaning up shuffles is tied to the garbage collection. On Fri, Aug 21, 2015 at 2:59 AM, gaurav sharma wrote: > Hi All, > > > I have a 24x7 running Streaming Process, which runs on 2 hour windowed data > > The issue i

Re: spark streaming 1.3 kafka error

2015-08-21 Thread Akhil Das

That looks like you are choking your kafka machine. Do a top on the kafka machines and see the workload, it may happen that you are spending too much time on disk io etc. On Aug 21, 2015 7:32 AM, "Cody Koeninger" wrote: > Sounds like that's happening consistently, not an occasional network > prob

Re: Finding the number of executors.

2015-08-21 Thread Akhil Das

Which version spark are you using? There was a discussion happened over here http://apache-spark-user-list.1001560.n3.nabble.com/Determine-number-of-running-executors-td19453.html http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccacbyxk+ya1rbbnkwjheekpnbsbh10rykuzt-laqgpdanvhm

Re: Spark ec2 lunch problem

2015-08-21 Thread Akhil Das

It may happen that the version of spark-ec2 script you are using is buggy or sometime AWS have problem provisioning machines. On Aug 21, 2015 7:56 AM, "Garry Chen" wrote: > Hi All, > > I am trying to lunch a spark ec2 cluster by running > spark-ec2 --key-pair=key --identity-file=

Re: Set custm worker id ?

2015-08-21 Thread Akhil Das

You can try adding a humanly readable entry in your /etc/hosts file of the worker machine and then you can set the SPARK_LOCAL_IP pointing to this hostname on that machines spark-env.sh file. On Aug 21, 2015 11:57 AM, wrote: > Hi, > > Is it possible in standalone to set up worker ID names? to avo

Re: Spark streaming multi-tasking during I/O

2015-08-21 Thread Akhil Das

You can look at the spark.streaming.concurrentJobs by default it runs a single job. If set it to 2 then it can run 2 jobs parallely. Its an experimental flag, but go ahead and give it a try. On Aug 21, 2015 3:36 AM, "Sateesh Kavuri" wrote: > Hi, > > My scenario goes like this: > I have an algorit

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Akhil Das

Did you try sorting it by datetime and doing a groupBy on the userID? On Aug 21, 2015 12:47 PM, "Nathan Skone" wrote: > Raghavendra, > > Thanks for the quick reply! I don’t think I included enough information in > my question. I am hoping to get fields that are not directly part of the > aggregat

RE: SparkR csv without headers

2015-08-21 Thread Felix Cheung

You could also rename them with names Unfortunately the API doesn't show the example of that https://spark.apache.org/docs/latest/api/R/index.html On Thu, Aug 20, 2015 at 7:43 PM -0700, "Sun, Rui" wrote: Hi, You can create a DataFrame using load.df() with a specified schema. Something like

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Nathan Skone

Raghavendra, Thanks for the quick reply! I don’t think I included enough information in my question. I am hoping to get fields that are not directly part of the aggregation. Imagine a dataframe representing website views with a userID, datetime, and a webpage address. How could I find the oldes

Re: Spark streaming multi-tasking during I/O

2015-08-21 Thread Rishitesh Mishra

Hi Sateesh, It is interesting to know , how did you determine that the Dstream runs on a single core. Did you mean receivers? Coming back to your question, could you not start disk io in a separate thread, so that the sceduler can go ahead and assign other tasks ? On 21 Aug 2015 16:06, "Sateesh Ka

Re: DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Reynold Xin

You've probably hit this bug: https://issues.apache.org/jira/browse/SPARK-7180 It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to false and see if it goes away. On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov wrote: > Hi, > > I'm using spark 1.3.1 built against hadoop 1

Set custm worker id ?

2015-08-21 Thread Saif.A.Ellafi

Hi, Is it possible in standalone to set up worker ID names? to avoid the worker-19248891237482379-ip..-port ?? Thanks, Saif

Re: Tungsten and sun.misc.Unsafe

2015-08-21 Thread Marek Kolodziej

Thanks Reynold, that helps a lot. I'm glad you're involved with that Google Doc community effort. I think it's because of that doc that the JEP's wording and scope changed for the better since it originally got introduced. Marek On Fri, Aug 21, 2015 at 11:18 AM, Reynold Xin wrote: > I'm actuall

Re: Tungsten and sun.misc.Unsafe

2015-08-21 Thread Reynold Xin

I'm actually somewhat involved with the Google Docs you linked to. I don't think Oracle will remove Unsafe in JVM 9. As you said, JEP 260 already proposes making Unsafe available. Given the widespread use of Unsafe for performance and advanced functionalities, I don't think Oracle can just remove

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Raghavendra Pandey

Impact, You can group by the data and then sort it by timestamp and take max to select the oldest value. On Aug 21, 2015 11:15 PM, "Impact" wrote: > I am also looking for a way to achieve the reducebykey functionality on > data > frames. In my case I need to select one particular row (the oldest,

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Raghavendra Pandey

Did you try with hadoop version 2.7.1 .. It is known that s3a works really well with parquet which is available in 2.7. They fixed lot of issues related to metadata reading there... On Aug 21, 2015 11:24 PM, "Jerrick Hoang" wrote: > @Cheng, Hao : Physical plans show that it got stuck on scanning

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang

@Cheng, Hao : Physical plans show that it got stuck on scanning S3! (table is partitioned by date_prefix and hour) explain select count(*) from test_table where date_prefix='20150819' and hour='00'; TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)] TungstenExchange SingleP

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Impact

I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row (the oldest, based on a timestamp column value) by key. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-

Datediff in minutes

2015-08-21 Thread Stefan Panayotov

I can't find a way to specify the datediff function to give me the difference in minutes or seconds. Can anyone please help? Thanks, Stefan Panayotov Sent from my Windows Phone

RE: Remoting warning when submitting to cluster

2015-08-21 Thread javidelgadillo

I believe this was caused by some network configuration on my machines. After installing VirtualBox, some new network interfaces were installed on the machines and the Akka software was binding to one of the VirtualBox interfaces and not the interface that belonged to my Ethernet card. Once I d

Re: build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Marcelo Vanzin

That was only true until Spark 1.3. Spark 1.4 can be built with JDK7 and pyspark will still work. On Fri, Aug 21, 2015 at 8:29 AM, Chen Song wrote: > Thanks Sean. > > So how PySpark is supported. I thought PySpark needs jdk 1.6. > > Chen > > On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen wrote: >>

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-21 Thread Burak Yavuz

Hi Naveen, As I mentioned before, the code is private therefore not accessible. Just copy and use the snippet that I sent. Copying it here again: https://github.com/apache/spark/blob/43e0135421b2262cbb0e06aae53523f663b4f959/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L270 Bes

Re: Data locality with HDFS not being seen

2015-08-21 Thread Sameer Farooqui

Hi Sunil, Have you seen this fix in Spark 1.5 that may fix the locality issue?: https://issues.apache.org/jira/browse/SPARK-4352 On Thu, Aug 20, 2015 at 4:09 AM, Sunil wrote: > Hello . I am seeing some unexpected issues with achieving HDFS > data > locality. I expect the tasks to be ex

Re: build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Sean Owen

Spark 1.4 requires Java 7. On Fri, Aug 21, 2015, 3:12 PM Chen Song wrote: > I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support > PySpark, I used JDK 1.6. > > I got the following error, > > [INFO] --- scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) > @ spark-str

RE: Spark ec2 lunch problem

2015-08-21 Thread Garry Chen

No, the message never end. I have to ctrl-c out of it. Garry From: shahid ashraf [mailto:sha...@trialx.com] Sent: Friday, August 21, 2015 11:13 AM To: Garry Chen Cc: user@spark.apache.org Subject: Re: Spark ec2 lunch problem Does the cluster work at the end ? On Fri, Aug 21, 2015 at 8:25 PM,

Re: build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Chen Song

Thanks Sean. So how PySpark is supported. I thought PySpark needs jdk 1.6. Chen On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen wrote: > Spark 1.4 requires Java 7. > > On Fri, Aug 21, 2015, 3:12 PM Chen Song wrote: > >> I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support >> PySp

Re: SparkSQL concerning materials

2015-08-21 Thread Sameer Farooqui

Have you seen the Spark SQL paper?: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf On Thu, Aug 20, 2015 at 11:35 PM, Dawid Wysakowicz < wysakowicz.da...@gmail.com> wrote: > Hi, > > thanks for answers. I have read answers you provided, but I rather look > for some materials on

Re: Spark ec2 lunch problem

2015-08-21 Thread shahid ashraf

Does the cluster work at the end ? On Fri, Aug 21, 2015 at 8:25 PM, Garry Chen wrote: > Hi All, > > I am trying to lunch a spark ec2 cluster by running > spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc > --subnet-id=subnet-011 --spark-version=1.4.1 launch spark-cl

Spark ec2 lunch problem

2015-08-21 Thread Garry Chen

Hi All, I am trying to lunch a spark ec2 cluster by running spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc --subnet-id=subnet-011 --spark-version=1.4.1 launch spark-cluster but getting following message endless. Please help. Warning: SSH connection error. (Thi

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-08-21 Thread Ravi Mody

I've been able to almost halve my memory usage with no instability issues. I lowered my storage.memoryFraction and increased my shuffle.memoryFraction (essentially swapping them). I set spark.yarn.executor.memoryOverhead to 6GB. And I lowered executor-cores in case other jobs are using the availab

Finding the number of executors.

2015-08-21 Thread Virgil Palanciuc

Is there any reliable way to find out the number of executors programatically - regardless of how the job is run? A method that preferably works for spark-standalone, yarn, mesos, regardless whether the code runs from the shell or not? Things that I tried and don't work: - sparkContext.getExecuto

Re: spark streaming 1.3 kafka error

2015-08-21 Thread Cody Koeninger

Sounds like that's happening consistently, not an occasional network problem? Look at the Kafka broker logs Make sure you've configured the correct kafka broker hosts / ports (note that direct stream does not use zookeeper host / port). Make sure that host / port is reachable from your driver an

Re: Want to install lz4 compression

2015-08-21 Thread Ted Yu

Have you read this ? http://stackoverflow.com/questions/22716346/how-to-use-lz4-compression-in-linux-3-11 > On Aug 21, 2015, at 6:57 AM, > wrote: > > Hi all, > > I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop’s > native libraries, so I am not being able to use

build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Chen Song

I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support PySpark, I used JDK 1.6. I got the following error, [INFO] --- scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) @ spark-streaming_2.10 --- java.lang.UnsupportedClassVersionError: org/apache/hadoop/io/LongWritabl

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964

What version of Spark you are using, or comes with DSE 4.7? We just cannot reproduce it in Spark. yzhang@localhost>$ more test.sparkval pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs.reduceByKey((x,y) => x + y).collectyzhang@localhost>$ ~/spark/bin/spark-shell --master local -i t

Want to install lz4 compression

2015-08-21 Thread Saif.A.Ellafi

Hi all, I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop's native libraries, so I am not being able to use it. Can anyone suggest on how to proceed? Hopefully I wont have to recompile hadoop. I tried changing the --driver-library-path to point directly into lz4 stand a

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

HI Abhishek, I have even tried that but rdd2 is empty Regards, Satish On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh < abhis...@tetrationanalytics.com> wrote: > You had: > > > RDD.reduceByKey((x,y) => x+y) > > RDD.take(3) > > Maybe try: > > > rdd2 = RDD.reduceByKey((x,y) => x+y) > > rdd2.ta

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread Abhishek R. Singh

You had: > RDD.reduceByKey((x,y) => x+y) > RDD.take(3) Maybe try: > rdd2 = RDD.reduceByKey((x,y) => x+y) > rdd2.take(3) -Abhishek- On Aug 20, 2015, at 3:05 AM, satish chandra j wrote: > HI All, > I have data in RDD as mentioned below: > > RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20)

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

HI All, Any inputs for the actual problem statement Regards, Satish On Fri, Aug 21, 2015 at 5:57 PM, Jeff Zhang wrote: > Yong, Thanks for your reply. > > I tried spark-shell -i , it works fine for me. Not sure the > different with > dse spark --master local --jars postgresql-9.4-1201.jar -i

Tungsten and sun.misc.Unsafe

2015-08-21 Thread Marek Kolodziej

Hello, I attended the Tungsten-related presentations at Spark Summit (by Josh Rosen) and at Big Data Scala (by Matei Zaharia). Needless to say, this project holds great promise for major performance improvements. At Josh's talk, I heard about the use of sun.misc.Unsafe as a way of achieving some

Having Clause with variation and stddev

2015-08-21 Thread Ravisankar Mani

Hi, Exception thrown when using Having Clause with variation or stddev. It works perfectly when using other aggregate functions(Like sum,count,min,max..) SELECT SUM(1) AS `sum_number_of_records_ok` FROM `some_db`.`some_table` `some_table` GROUP BY 1 HAVING (STDDEV(1) > 0) SELECT SUM(1) AS `sum_

Is long running Spark batch job in fine grained mode is Deprecated?

2015-08-21 Thread Akash Mishra

Hi *, We are trying to run Spark on top of mesos using fine grained mode. While talking to few people i came to know that running Spark job using fine grained mode on mesos is not a good idea. I could not find anything regarding fine grained mode getting deprecated and also if corse grained mode

Re: SPARK sql :Need JSON back isntead of roq

2015-08-21 Thread Roberto Congiu

2015-08-21 3:17 GMT-07:00 smagadi : > teenagers .toJSON gives the json but it does not preserve the parent ids > > meaning if the input was {"name":"Yin", > "address":{"city":"Columbus","state":"Ohio"},"age":20} > > val x= sqlContext.sql("SELECT name, address.city, address.state ,age FROM > people

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964

I believe "spark-shell -i scriptFile" is there. We also use it, at least in Spark 1.3.1. "dse spark" will just wrap "spark-shell" command, underline it is just invoking "spark-shell". I don't know too much about the original problem though. Yong Date: Fri, 21 Aug 2015 18:19:49 +0800 Subject: Re:

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-21 Thread Michael Albert

This is something of a wild guess, but I find that when executors start disappearingfor no obvious reason, this is usually because the yarn node-managers have decided that the containers are using too much memory and then terminate the executors. Unfortunately, to see evidence of this, one needs

DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Eugene Morozov

Hi, I'm using spark 1.3.1 built against hadoop 1.0.4 and java 1.7 and I'm trying to save my data frame to parquet. The issue I'm stuck looks like serialization tries to do pretty weird thing: tries to write to an empty array. The last (through stack trace) line of spark code that leads to excepti

Spark streaming multi-tasking during I/O

2015-08-21 Thread Sateesh Kavuri

Hi, My scenario goes like this: I have an algorithm running in Spark streaming mode on a 4 core virtual machine. Majority of the time, the algorithm does disk I/O and database I/O. Question is, during the I/O, where the CPU is not considerably loaded, is it possible to run any other task/thread so

Re: Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-21 Thread Yanbo Liang

ML plans to make Machine Learning pipeline that users can make machine learning more efficient. It's more general to make StringIndexer chain with any kinds of Estimators. I think we can make StringIndexer and reverse process automatic in the future. If you want to know your original labels, you ca

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread Jeff Zhang

Hi Satish, I don't see where spark support "-i", so suspect it is provided by DSE. In that case, it might be bug of DSE. On Fri, Aug 21, 2015 at 6:02 PM, satish chandra j wrote: > HI Robin, > Yes, it is DSE but issue is related to Spark only > > Regards, > Satish Chandra > > On Fri, Aug 21, 2

Re: SPARK sql :Need JSON back isntead of roq

2015-08-21 Thread smagadi

teenagers .toJSON gives the json but it does not preserve the parent ids meaning if the input was {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"},"age":20} val x= sqlContext.sql("SELECT name, address.city, address.state ,age FROM people where age>19 and age <=30 ").toJSON x.collect

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

HI Robin, Yes, it is DSE but issue is related to Spark only Regards, Satish Chandra On Fri, Aug 21, 2015 at 3:06 PM, Robin East wrote: > Not sure, never used dse - it’s part of DataStax Enterprise right? > > On 21 Aug 2015, at 10:07, satish chandra j > wrote: > > HI Robin, > Yes, below mention

Worker Machine running out of disk for Long running Streaming process

2015-08-21 Thread gaurav sharma

Hi All, I have a 24x7 running Streaming Process, which runs on 2 hour windowed data The issue i am facing is my worker machines are running OUT OF DISK space I checked that the SHUFFLE FILES are not getting cleaned up. /log/spark-2b875d98-1101-4e61-86b4-67c9e71954cc/executor-5bbb53c1-cee9-4438

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

Yes, DSE 4.7 Regards, Satish Chandra On Fri, Aug 21, 2015 at 3:06 PM, Robin East wrote: > Not sure, never used dse - it’s part of DataStax Enterprise right? > > On 21 Aug 2015, at 10:07, satish chandra j > wrote: > > HI Robin, > Yes, below mentioned piece or code works fine in Spark Shell but

Re: Creating Spark DataFrame from large pandas DataFrame

2015-08-21 Thread ayan guha

The easiest option I found to put jars in SPARK CLASSPATH On 21 Aug 2015 06:20, "Burak Yavuz" wrote: > If you would like to try using spark-csv, please use > `pyspark --packages com.databricks:spark-csv_2.11:1.2.0` > > You're missing a dependency. > > Best, > Burak > > On Thu, Aug 20, 2015 at 1:0

Re:SPARK sql :Need JSON back isntead of roq

2015-08-21 Thread Todd

please try DataFrame.toJSON， it will give you an RDD of JSON string. At 2015-08-21 15:59:43, "smagadi" wrote: >val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND >age <= 19") > >I need teenagers to be a JSON object rather a simple row .How can we get >that done ? >

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

HI Robin, Yes, below mentioned piece or code works fine in Spark Shell but the same when place in Script File and executed with -i it creating an empty RDD scala> val pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40))) pairs: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[77]

spark streaming 1.3 kafka error

2015-08-21 Thread Shushant Arora

Hi Getting below error in spark streaming 1.3 while consuming from kafka using directkafka stream. Few of tasks are getting failed in each run. What is the reason /solution of this error? 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in stage 130.0 (TID 16332) java.io.EOF

Re: PySpark concurrent jobs using single SparkContext

2015-08-21 Thread Hemant Bhanawat

It seems like you want simultaneous processing of multiple jobs but at the same time serialization of few tasks within those jobs. I don't know how to achieve that in Spark. But, why would you bother about the inter-weaved processing when the data that is being aggregated in different jobs is per

Re: Spark-Cassandra-connector

2015-08-21 Thread Ted Yu

Have you considered asking this question on https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user ? Cheers On Thu, Aug 20, 2015 at 10:57 PM, Samya wrote: > Hi All, > > I need to write an RDD to Cassandra & using the sparkCassandraConnector > from > DataStax. My appli

Re: SPARK SQL support for XML

2015-08-21 Thread Ted Yu

You may have seen this: http://search-hadoop.com/m/q3RTt31EKP1BTHxY1 On Fri, Aug 21, 2015 at 1:01 AM, smagadi wrote: > Does spark sql supports XML the same way as it supports json ? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-support

Re: DataFrameWriter.jdbc is very slow

2015-08-21 Thread Aram Mkrtchyan

It's good Thanks for your reply Michael. On Thu, Aug 20, 2015 at 11:03 PM, Michael Armbrust wrote: > We will probably fix this in Spark 1.6 > > https://issues.apache.org/jira/browse/SPARK-10040 > > On Thu, Aug 20, 2015 at 5:18 AM, Aram Mkrtchyan < > aram.mkrtchyan...@gmail.com> wrote: > >> We wa

Re: what determine the task size?

2015-08-21 Thread Robineast

The OP wants to understand what determines the size of the task code that is shipped to each executor so it can run the task. I don't know the answer to but would be interested to know too. Sent from my iPhone > On 21 Aug 2015, at 08:26, oubrik [via Apache Spark User List] > wrote: > > Hi >

SPARK SQL support for XML

2015-08-21 Thread smagadi

Does spark sql supports XML the same way as it supports json ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-support-for-XML-tp24382.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

SPARK sql :Need JSON back isntead of roq

2015-08-21 Thread smagadi

val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") I need teenagers to be a JSON object rather a simple row .How can we get that done ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-sql-Need-JSON-back-isntead-of-

ClassCastException when saving a DataFrame to parquet file (saveAsParquetFile, Spark 1.3.1) using Scala

2015-08-21 Thread Emma Boya Peng

Hi, I was trying to programmatically specify a schema and apply it to a RDD of Rows and save the resulting DataFrame as a parquet file, but I got "java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long" on the last step. Here's what I did: 1. Created an RDD of Rows from

ClassCastException when saving a DataFrame to parquet file (saveAsParquetFile, Spark 1.3.1) using Scala

2015-08-21 Thread Emma Boya Peng

Hi, I was trying to programmatically specify a schema and apply it to a RDD of Rows and save the resulting DataFrame as a parquet file. Here's what I did: 1. Created an RDD of Rows from RDD[Array[String]]: val gameId= Long.valueOf(line(0)) val accountType = Long.valueOf(line(1)) val w

78 matches

Mail list logo