Re: OOM When Running with Mesos Fine-grained Mode

2016-03-05 Thread SLiZn Liu
On Saturday, 5 March 2016, SLiZn Liu wrote: > >> Hi Spark Mailing List, >> >> I’m running terabytes of text files with Spark on Mesos, the job runs >> fine until we decided to switch to Mesos fine-grained mode. >> >> At first glance, we spotted massive numbe

OOM When Running with Mesos Fine-grained Mode

2016-03-04 Thread SLiZn Liu
Hi Spark Mailing List, I’m running terabytes of text files with Spark on Mesos, the job runs fine until we decided to switch to Mesos fine-grained mode. At first glance, we spotted massive number of task lost errors in logs: 16/03/05 04:01:20 ERROR TaskSchedulerImpl: Ignoring update with state L

Re: Imported CSV file content isn't identical to the original file

2016-02-14 Thread SLiZn Liu
This Error message does not appear as I upgraded to 1.6.0 . -- Cheers, Todd Leo On Tue, Feb 9, 2016 at 9:07 AM SLiZn Liu wrote: > At least works for me though, temporarily disabled Kyro serilizer until > upgrade to 1.6.0. Appreciate for your update. :) > Luciano Resende 于2016年2月9日

Is this Task Scheduler Error normal?

2016-02-10 Thread SLiZn Liu
Hi Spark Users, I’m running Spark jobs on Mesos, and sometimes I get vast number of Task Scheduler Errors: ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 1161 because its task set is gone (this is likely the result of receiving duplicate task finished status updates)T It lo

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu
At least works for me though, temporarily disabled Kyro serilizer until upgrade to 1.6.0. Appreciate for your update. :) Luciano Resende 于2016年2月9日 周二02:37写道: > Sorry, same expected results with trunk and Kryo serializer > > On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu wrote: > >&

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu
I’ve found the trigger of my issue: if I start my spark-shell or submit by spark-submit with --conf spark.serializer=org.apache.spark.serializer.KryoSerializer, the DataFrame content goes wrong, as I described earlier. ​ On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu wrote: > Thanks Luciano, now

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu
5-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > +--+----------+ > > > > > On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu wrote: > >> Hi Spark Users Group, >> >> I have a csv file to analysis with Spark, but I’m troubling with >> importing as DataFram

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu
have Great fortune in the Year of Monkey! — BR, Todd Leo ​ On Sun, Feb 7, 2016 at 6:09 PM SLiZn Liu wrote: > Hi Igor, > > In my case, it’s not a matter of *truncate*. As the show() function in > Spark API doc reads, > > truncate: Whether truncate long strings. If true, st

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu
missing. Good to know the way to show the whole content in a cell. — BR, Todd Leo ​ On Sun, Feb 7, 2016 at 5:42 PM Igor Berman wrote: > show has argument of truncate > pass false so it wont truncate your results > > On 7 February 2016 at 11:01, SLiZn Liu wrote: > >> Plus, I

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu
Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried HiveContext, but the result is exactly the same. ​ On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu wrote: > Hi Spark Users Group, > > I have a csv file to analysis with Spark, but I’m troubling with importing > as DataFrame

Imported CSV file content isn't identical to the original file

2016-02-06 Thread SLiZn Liu
Hi Spark Users Group, I have a csv file to analysis with Spark, but I’m troubling with importing as DataFrame. Here’s the minimal reproducible example. Suppose I’m having a *10(rows)x2(cols)* *space-delimited csv* file, shown as below: 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30

Re: Save GraphX to disk

2015-11-13 Thread SLiZn Liu
Hi Gaurav, Your graph can be saved to graph databases like Neo4j or Titan through their drivers, that eventually saved to the disk. BR, Todd Gaurav Kumar gauravkuma...@gmail.com>于2015年11月13日 周五22:08写道: > Hi, > > I was wondering how to save a graph to disk and load it back again. I know > how to

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread SLiZn Liu
Hi Jerry, I think you are referring to --no-switch_user. =) chiling...@gmail.com>于2015年10月19日 周一21:05写道: > Can you try setting SPARK_USER at the driver? It is used to impersonate > users at the executor. So if you have user setup for launching spark jobs > on the executor machines, simply se

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread SLiZn Liu
be more specific on `collect_set`? Is it a built-in function or, >> if it is an UDF, how it is defined? >> >> BR, >> Todd Leo >> >> On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust >> wrote: >> >> import org.apache.spark.sql.functions._ >>

Re: OutOfMemoryError When Reading Many json Files

2015-10-14 Thread SLiZn Liu
ow => (k, v) ) > .combineByKey() > > Deenar > > On 14 October 2015 at 05:18, SLiZn Liu wrote: > >> Hey Spark Users, >> >> I kept getting java.lang.OutOfMemoryError: Java heap space as I read a >> massive amount of json files, iteratively via read.json(). Eve

OutOfMemoryError When Reading Many json Files

2015-10-13 Thread SLiZn Liu
Hey Spark Users, I kept getting java.lang.OutOfMemoryError: Java heap space as I read a massive amount of json files, iteratively via read.json(). Even the result RDD is rather small, I still get the OOM Error. The brief structure of my program reads as following, in psuedo-code: file_path_list.m

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
lUDF("collect_set", df("id")).as("id_list")) > > On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu > wrote: > >> Hey Spark users, >> >> I'm trying to group by a dataframe, by appending occurrences into a list >>

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
gt; > > You can always reconvert the obtained RDD after tranformation and reduce to a > DataFrame. > > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > > https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=n

Spark DataFrame GroupBy into List

2015-10-12 Thread SLiZn Liu
Hey Spark users, I'm trying to group by a dataframe, by appending occurrences into a list instead of count. Let's say we have a dataframe as shown below: | category | id | | |:--:| | A| 1 | | A| 2 | | B| 3 | | B| 4 | | C| 5 | ideally, after

Re: Streaming Receiver Imbalance Problem

2015-09-23 Thread SLiZn Liu
m> wrote: Also, you could switch to the Direct KAfka API which was first released as > experimental in 1.3. In 1.5 we graduated it from experimental, but its > quite usable in Spark 1.3.1 > > TD > > On Tue, Sep 22, 2015 at 7:45 PM, SLiZn Liu wrote: > >> Cool, we are s

Re: Streaming Receiver Imbalance Problem

2015-09-22 Thread SLiZn Liu
-8882 > > On Tue, Sep 22, 2015 at 12:17 AM, SLiZn Liu > wrote: > >> Hi spark users, >> >> In our Spark Streaming app via Kafka integration on Mesos, we initialed 3 >> receivers to receive 3 Kafka partitions, whereas records receiving rate >> imbalance been

Streaming Receiver Imbalance Problem

2015-09-22 Thread SLiZn Liu
Hi spark users, In our Spark Streaming app via Kafka integration on Mesos, we initialed 3 receivers to receive 3 Kafka partitions, whereas records receiving rate imbalance been observed, with spark.streaming.receiver.maxRate is set to 120, sometimes 1 of which receives very close to the limit whil

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-07-01 Thread SLiZn Liu
uot; at "http://some.other.repo2"; > ``` > > call `sbt package`, and then run spark-submit as: > > $ bin/spark-submit --packages org.apache.hbase:hbase:1.1.1, junit:junit:x > --repositories http://some.other.repo,http://some.other.repo2 $YOUR_JAR > > Best, > Bu

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-06-29 Thread SLiZn Liu
On Mon, Jun 29, 2015 at 10:46 PM, SLiZn Liu > wrote: > >> Hey Spark Users, >> >> I'm writing a demo with Spark and HBase. What I've done is packaging a >> **fat jar**: place dependencies in `build.sbt`, and use `sbt assembly` to >> package **all depen

Can Dependencies Be Resolved on Spark Cluster?

2015-06-29 Thread SLiZn Liu
Hey Spark Users, I'm writing a demo with Spark and HBase. What I've done is packaging a **fat jar**: place dependencies in `build.sbt`, and use `sbt assembly` to package **all dependencies** into one big jar. The rest work is copy the fat jar to Spark master node and then launch by `spark-submit`.

Re: Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
eed not to be in > memory. Give it a try with high number of partitions. > On 11 Jun 2015 23:09, "SLiZn Liu" wrote: > >> Hi Spark Users, >> >> I'm trying to load a literally big file (50GB when compressed as gzip >> file, stored in HDFS) by receivi

Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
Hi Spark Users, I'm trying to load a literally big file (50GB when compressed as gzip file, stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as this file cannot be fitted in my memory. However, it looks like no RDD will be received until I copy this big file to a prior-specified

Re: DataFrame Column Alias problem

2015-05-22 Thread SLiZn Liu
$"col1").as("c")).show() > > > On Thu, May 21, 2015 at 11:22 PM, SLiZn Liu > wrote: > >> However this returns a single column of c, without showing the original >> col1. >> ​ >> >> On Thu, May 21, 2015 at 11:25 PM Ram Sriharsha >&g

Re: DataFrame Column Alias problem

2015-05-21 Thread SLiZn Liu
However this returns a single column of c, without showing the original col1 . ​ On Thu, May 21, 2015 at 11:25 PM Ram Sriharsha wrote: > df.groupBy($"col1").agg(count($"col1").as("c")).show > > On Thu, May 21, 2015 at 3:09 AM, SLiZn Liu wrote: > >&g

DataFrame Column Alias problem

2015-05-21 Thread SLiZn Liu
Hi Spark Users Group, I’m doing groupby operations on my DataFrame *df* as following, to get count for each value of col1: > df.groupBy("col1").agg("col1" -> "count").show // I don't know if I should > write like this. col1 COUNT(col1#347) aaa2 bbb4 ccc4 ... and more... As I ‘d li

Fwd: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
RDD object To: SLiZn Liu Are you sure that you are submitting it correctly? Can you post the entire command you are using to run the .jar file via spark-submit? On Wed, May 13, 2015 at 4:07 PM, SLiZn Liu wrote: > No, creating DF using createDataFrame won’t work: > > val

Re: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
> val schema: StructType = ... > > sqlContext.createDataFrame(rdd, schema) > > > > 2015-05-13 12:00 GMT+02:00 SLiZn Liu : > >> Additionally, after I successfully packaged the code, and submitted via >> spark-submit >> webcat_2.11-1.0.jar, the following error was

Re: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
documents. What else should I try? REGARDS, Todd Leo ​ On Wed, May 13, 2015 at 11:27 AM SLiZn Liu wrote: > Thanks folks, really appreciate all your replies! I tried each of your > suggestions and in particular, *Animesh*‘s second suggestion of *making > case class definition global* he

Re: value toDF is not a member of RDD object

2015-05-12 Thread SLiZn Liu
, May 12, 2015 at 9:33 AM, Olivier Girardot > wrote: > >> you need to instantiate a SQLContext : >> val sc : SparkContext = ... >> val sqlContext = new SQLContext(sc) >> import sqlContext.implicits._ >> >> Le mar. 12 mai 2015 à 12:29, SLiZn Liu a écrit : &

Re: value toDF is not a member of RDD object

2015-05-12 Thread SLiZn Liu
> toDF is part of spark SQL so you need Spark SQL dependency + import > sqlContext.implicits._ to get the toDF method. > > Regards, > > Olivier. > > Le mar. 12 mai 2015 à 11:36, SLiZn Liu a écrit : > >> Hi User Group, >> >> I’m trying to reproduce t

value toDF is not a member of RDD object

2015-05-12 Thread SLiZn Liu
Hi User Group, I’m trying to reproduce the example on Spark SQL Programming Guide , and got a compile error when packaging with sbt: [error] myfile.scala:30: value toDF is not a member of org.ap

OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread SLiZn Liu
Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such que