Re: Benchmark results between Flink and Spark

2015-07-05 Thread Jerry Lam
Hi guys, I just read the paper too. There is no much information regarding why Flink is faster than Spark for data science type of workloads in the benchmark. It is very difficult to generalize the conclusion of a benchmark from my point of view. How much experience the author has with Spark is in

Re: Benchmark results between Flink and Spark

2015-07-14 Thread Jerry Lam
-Bits-and-Bytes.html > > Probably if re-ran the benchmarks with 1.5/tungsten line would close the > gap a bit(or a lot) with spark moving towards similar style off-heap memory > mgmt, more planning optimizations > > > *From:* Jerry Lam [mailto:chiling...@gmail.com ] > *Sent:* Sun

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
You mean this does not work? SELECT key, count(value) from table group by key On Sun, Jul 19, 2015 at 2:28 PM, N B wrote: > Hello, > > How do I go about performing the equivalent of the following SQL clause in > Spark Streaming? I will be using this on a Windowed DStream. > > SELECT key, coun

[Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
Hi spark users and developers, I'm using spark 1.5.1 (I have no choice because this is what we used). I ran into some very unexpected behaviour when I did some join operations lately. I cannot post my actual code here and the following code is not for practical reasons but it should demonstrate th

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
nk and I do not see any exceptions. > > val d1=base.where($"label" === 0) as("d1") > val d2=base.where($"label" === 1).as("d2") > > d1.join(d2, $"d1.id" === $"d2.id", > "left_outer").drop($"d2.label&q

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
Scan PhysicalRDD[_1#12,_2#13] Again, this is just a tip of the iceberg. I have spent hours to find out this weird behaviour. Best Regards, Jerry Best Regards, Jerry On Tue, Mar 29, 2016 at 2:01 AM, Jerry Lam wrote: > Hi Sunitha, > > Thank you for the reference Jira. It looks l

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
n using "key". > // With that assumption, this hack turns the trivially true condition into > equality on join > // keys that are resolved to both sides. > // Trigger analysis so in the case of self-join, the analyzer will clone > the plan. > // After the cloning, left and right

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
rong answer though. As I said, this is just a tip of iceberg. I have experience worsen than this. For example, you might think renaming fields will work but in some cases, it still returns wrong results. Best Regards, Jerry On Tue, Mar 29, 2016 at 7:38 AM, Jerry Lam wrote: > Hi Divya, > &g

[Spark SQL]: UDF with Array[Double] as input

2016-04-01 Thread Jerry Lam
Hi spark users and developers, Anyone tried to pass in an Array[Double] as a input to the UDF? I tried it for many hours reading spark sql code but IK still couldn't figure out a way to do this. Best Regards, Jerry

Re: ideal number of executors per machine

2015-12-15 Thread Jerry Lam
Hi Veljko, I usually ask the following questions: “how many memory per task?” then "How many cpu per task?” then I calculate based on the memory and cpu requirements per task. You might be surprise (maybe not you, but at least I am :) ) that many OOM issues are actually because of this. Best

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn, Can you define those properties in hdfs-site.xml and make sure it is visible in the class path when you spark-submit? It looks like a conf sourcing issue to me. Cheers, Sent from my iPhone > On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev > wrote: > > Chris, > > thanks

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Kudryavtsev > >> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam wrote: >> Hi Kostiantyn, >> >> Can you define those properties in hdfs-site.xml and make sure it is visible >> in the class path when you spark-submit? It looks like a conf sourcing issue >> to me

Re: SparkSQL integration issue with AWS S3a

2016-01-01 Thread Jerry Lam
gt; > what you suggested looks to be working (I put hdfs-site.xml into > $SPARK_HOME/conf folder), but could you shed some light on how it can be > federated per user? > Thanks in advance! > > Thank you, > Konstantin Kudryavtsev > >> On Wed, Dec 30, 2015 at 2:37 P

Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Jerry Lam
;> On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev >> wrote: >> >> thanks Jerry, it works! >> really appreciate your help >> >> Thank you, >> Konstantin Kudryavtsev >> >>> On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam wrote: >>> Hi K

[Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam
Hi spark users and developers, I wonder if the following observed behaviour is expected. I'm writing dataframe to parquet into s3. I'm using append mode when I'm writing to it. Since I'm using org.apache.spark.sql. parquet.DirectParquetOutputCommitter as the spark.sql.parquet.output.committer.clas

Re: [Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam
ere can be dataloss when you are using the DirectOutputCommitter and > speculation is turned on, so we disable it automatically. > > On Tue, Jan 12, 2016 at 1:11 PM, Jerry Lam wrote: > >> Hi spark users and developers, >> >> I wonder if the following observed b

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-14 Thread Jerry Lam
Hi Arkadiusz, the partitionBy is not designed to have many distinct value the last time I used it. If you search in the mailing list, I think there are couple of people also face similar issues. For example, in my case, it won't work over a million distinct user ids. It will require a lot of memor

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-15 Thread Jerry Lam
issues.apache.org/jira/browse/SPARK-12546 > >> On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam wrote: >> Hi Arkadiusz, >> >> the partitionBy is not designed to have many distinct value the last time I >> used it. If you search in the mailing list, I think there are cou

Re: How To Save TF-IDF Model In PySpark

2016-01-15 Thread Jerry Lam
Can you save it to parquet with the vector in one field? Sent from my iPhone > On 15 Jan, 2016, at 7:33 pm, Andy Davidson > wrote: > > Are you using 1.6.0 or an older version? > > I think I remember something in 1.5.1 saying save was not implemented in > python. > > > The current doc does

[Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
Hi spark users and developers, what do you do if you want the from_unixtime function in spark sql to return the timezone you want instead of the system timezone? Best Regards, Jerry

Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
nk I will make a UDF if this is the only way out of the box. Thanks! Jerry On Mon, Jan 18, 2016 at 2:32 PM, Alexander Pivovarov wrote: > Look at > to_utc_timestamp > > from_utc_timestamp > On Jan 18, 2016 9:39 AM, "Jerry Lam" wrote: > >> Hi spark user

Re: sqlContext.cacheTable("tableName") vs dataFrame.cache()

2016-01-19 Thread Jerry Lam
Is cacheTable similar to asTempTable before? Sent from my iPhone > On 19 Jan, 2016, at 4:18 am, George Sigletos wrote: > > Thanks Kevin for your reply. > > I was suspecting the same thing as well, although it still does not make much > sense to me why would you need to do both: > myData.cach

Re: Spark, Mesos, Docker and S3

2016-01-26 Thread Jerry Lam
Hi Mao, Can you try --jars to include those jars? Best Regards, Jerry Sent from my iPhone > On 26 Jan, 2016, at 7:02 pm, Mao Geng wrote: > > Hi there, > > I am trying to run Spark on Mesos using a Docker image as executor, as > mentioned > http://spark.apache.org/docs/latest/running-on-m

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
I think spark dataframe supports more than just SQL. It is more like pandas dataframe.( I rarely use the SQL feature. ) There are a lot of novelties in dataframe so I think it is quite optimize for many tasks. The in-memory data structure is very memory efficient. I just change a very slow RDD prog

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
Hi Michael, Is there a section in the spark documentation demonstrate how to serialize arbitrary objects in Dataframe? The last time I did was using some User Defined Type (copy from VectorUDT). Best Regards, Jerry On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust wrote: > A principal differen

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
e I have to > convert all my existing transformation to things like > df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly > and error prone in my opinion. > > On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam wrote: > >> Hi M

Union of RDDs without the overhead of Union

2016-02-02 Thread Jerry Lam
Hi Spark users and developers, anyone knows how to union two RDDs without the overhead of it? say rdd1.union(rdd2).saveTextFile(..) This requires a stage to union the 2 rdds before saveAsTextFile (2 stages). Is there a way to skip the union step but have the contents of the two rdds save to the s

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-03 Thread Jerry Lam
j("field1")' instead of 'caseObj.field1'; >> more over multiple tables having similar column names causing parsing >> issues; and when you start writing constants for your columns it just become >> another schema maintenance inside your app. It feels like th

Re: Spark 1.5.2 memory error

2016-02-03 Thread Jerry Lam
Hi guys, I was processing 300GB data with lot of joins today. I have a combination of RDD->Dataframe->RDD due to legacy code. I have memory issues at the beginning. After fine-tuning those configurations that many already suggested above, it works with 0 task failed. I think it is fair to say any

Re: Convert Iterable to RDD

2016-02-12 Thread Jerry Lam
Not sure if I understand your problem well but why don't you create the file locally and then upload to hdfs? Sent from my iPhone > On 12 Feb, 2016, at 9:09 am, "seb.arzt" wrote: > > I have an Iterator of several million elements, which unfortunately won't fit > into the driver memory at the s

Re: Streaming with broadcast joins

2016-02-19 Thread Jerry Lam
Hi guys, I also encounter broadcast dataframe issue not for steaming jobs but regular dataframe join. In my case, the executors died probably due to OOM which I don't think it should use that much memory. Anyway, I'm going to craft an example and send it here to see if it is a bug or something

Pattern Matching over a Sequence of rows using Spark

2016-02-28 Thread Jerry Lam
Hi spark users and developers, Anyone has experience developing pattern matching over a sequence of rows using Spark? I'm talking about functionality similar to matchpath in Hive or match_recognize in Oracle DB. It is used for path analysis on clickstream data. If you know of any libraries that do

Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Jerry Lam
Hi sparkers, Anyone knows how to do LATERAL VIEW EXPLODE without HiveContext? I don't want to start up a metastore and derby just because I need LATERAL VIEW EXPLODE. I have been trying but I always get the exception like this: Name: java.lang.RuntimeException Message: [1.68] failure: ``union''

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-26 Thread Jerry Lam
k.sql.functions._ > table("purchases").select(explode(df("purchase_items")).as("item")) > > > > On Fri, Sep 25, 2015 at 4:21 PM, Jerry Lam wrote: > >> Hi sparkers, >> >> Anyone knows how to do LATERAL VIEW EXPLODE without HiveContext? >

Spark SQL: Implementing Custom Data Source

2015-09-28 Thread Jerry Lam
Hi spark users and developers, I'm trying to learn how implement a custom data source for Spark SQL. Is there a documentation that I can use as a reference? I'm not sure exactly what needs to be extended/implemented. A general workflow will be greatly helpful! Best Regards, Jerry

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Jerry Lam
hread: >> >> http://search-hadoop.com/m/q3RTttmiYDqGc202 >> >> And: >> >> >> http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources >> >> On Sep 28, 2015, at 8:22 PM, Jerry Lam wrote: >> >> Hi spark users and developer

spark-submit --packages using different resolver

2015-10-01 Thread Jerry Lam
Hi spark users and developers, I'm trying to use spark-submit --packages against private s3 repository. With sbt, I'm using fm-sbt-s3-resolver with proper aws s3 credentials. I wonder how can I add this resolver into spark-submit such that --packages can resolve dependencies from private repo? Th

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Jerry Lam
Philip, the guy is trying to help you. Calling him silly is a bit too far. He might assume your problem is IO bound which might not be the case. If you need only 4 cores per job no matter what there is little advantage to use spark in my opinion because you can easily do this with just a worker

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
e try using > the --repositories flag and provide the address: > `$ spark-submit --packages my:awesome:package --repositories > s3n://$aws_ak:$aws_sak@bucket/path/to/repo` > > If that doesn't work, could you please file a JIRA? > > Best, > Burak > > > On Thu, Oct 1,

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
This is the ticket SPARK-10951 <https://issues.apache.org/jira/browse/SPARK-10951> Cheers~ On Tue, Oct 6, 2015 at 11:33 AM, Jerry Lam wrote: > Hi Burak, > > Thank you for the tip. > Unfortunately it does not work. It throws: > > java.net.MalformedURLException: unk

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Jerry Lam
I just read the article by ogirardot but I don’t agree It is like saying pandas dataframe is the sole data structure for analyzing data in python. Can Pandas dataframe replace Numpy array? The answer is simply no from an efficiency perspective for some computations. Unless there is a computer s

Re: Indexing Support

2015-10-18 Thread Jerry Lam
I'm interested in it but I doubt there is r-tree indexing support in the near future as spark is not a database. You might have a better luck looking at databases with spatial indexing support out of the box. Cheers Sent from my iPad On 2015-10-18, at 17:16, Mustafa Elbehery wrote: > Hi All

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread Jerry Lam
Can you try setting SPARK_USER at the driver? It is used to impersonate users at the executor. So if you have user setup for launching spark jobs on the executor machines, simply set it to that user name for SPARK_USER. There is another configuration that will prevents jobs being launched with a

Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Jerry Lam
Hi Spark users and developers, I have a dataframe with the following schema (Spark 1.5.1): StructType(StructField(type,StringType,true), StructField(timestamp,LongType,false)) After I save the dataframe in parquet and read it back, I get the following schema: StructType(StructField(timestamp,Lo

Spark SQL: Issues with using DirectParquetOutputCommitter with APPEND mode and OVERWRITE mode

2015-10-22 Thread Jerry Lam
Hi Spark users and developers, I read the ticket [SPARK-8578] (Should ignore user defined output committer when appending data) which ignore DirectParquetOutputCommitter if append mode is selected. The logic was that it is unsafe to use because it is not possible to revert a failed job in append m

[Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-24 Thread Jerry Lam
Hi Spark users and developers, Does anyone encounter any issue when a spark SQL job produces a lot of files (over 1 millions), the job hangs on the refresh method? I'm using spark 1.5.1. Below is the stack trace. I saw the parquet files are produced but the driver is doing something very intensive

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
ate over 1 million files. Not sure why it OOM the driver after the job is marked _SUCCESS in the output folder. Best Regards, Jerry On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam wrote: > Hi Spark users and developers, > > Does anyone encounter any issue when a spark SQL job produces a lot

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
parameters to make it more memory efficient? Best Regards, Jerry On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam wrote: > Hi guys, > > After waiting for a day, it actually causes OOM on the spark driver. I > configure the driver to have 6GB. Note that I didn't call refresh myself. > T

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
ns of driver memory via the > OutputCommitCoordinator's bookkeeping data structures. > > On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam wrote: > >> Hi spark guys, >> >> I think I hit the same issue SPARK-8890 >> https://issues.apache.org/jira/browse/SPARK-8890. I

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
) org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31) org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395) org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:267) On Sun, Oct 25, 2015 at 10:25 PM, Jerry Lam wrote: > Hi Josh, > > No I don't have speculation enabl

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Jerry Lam
partition is over 100. Best Regards, Jerry Sent from my iPhone > On 26 Oct, 2015, at 2:50 am, Fengdong Yu wrote: > > How many partitions you generated? > if Millions generated, then there is a huge memory consumed. > > > > > >> On Oct 26, 2015, at 10:5

[Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
Hi Spark users and developers, Anyone experiences issues in setting hadoop configurations after SparkContext is initialized? I'm using Spark 1.5.1. I'm trying to use s3a which requires access and secret key set into hadoop configuration. I tried to set the properties in the hadoop configuration f

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
t 2:05 PM, Marcelo Vanzin wrote: > On Tue, Oct 27, 2015 at 10:43 AM, Jerry Lam wrote: > > Anyone experiences issues in setting hadoop configurations after > > SparkContext is initialized? I'm using Spark 1.5.1. > > > > I'm trying to use s3a which requires access

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
> code. > > On Tue, Oct 27, 2015 at 11:22 AM, Jerry Lam wrote: > > Hi Marcelo, > > > > Thanks for the advice. I understand that we could set the configurations > > before creating SparkContext. My question is > > SparkContext.hadoopConfiguration.set("ke

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
Hi Bryan, Did you read the email I sent few days ago. There are more issues with partitionBy down the road: https://www.mail-archive.com/user@spark.apache.org/msg39512.html Best Regards, Jerry > On Oct 28, 2015, at 4:52 PM, B

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
or the note. It sounds like you were able to get further than I > have been - any insight? Just a Spark 1.4.1 vs Spark 1.5? > > Regards, > > Bryan Jeffrey > From: Jerry Lam > Sent: ‎10/‎28/‎2015 6:29 PM > To: Bryan Jeffrey > Cc: Susan Zhang; user > Subject: Re: Spa

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, If the partitioned table is implemented properly, I would think "select distinct(date) as dt from table order by dt DESC limit 1" would return the latest dates without scanning the whole dataset. I haven't try it that myself. It would be great if you can report back if this actually work

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
, Nov 1, 2015 at 4:08 PM, Koert Kuipers wrote: > >> it seems to work but i am not sure if its not scanning the whole dataset. >> let me dig into tasks a a bit >> >> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam wrote: >> >>> Hi Koert, >>> >>> If

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
to be faster though. > > On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam wrote: > >> Hi Koert, >> >> You should be able to see if it requires scanning the whole data by >> "explain" the query. The physical plan should say something about it. I >> wonder

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
physical plan, you can navigate the actual execution in the web UI to see how much data is actually read to satisfy this request. I hope it only requires a few bytes for few dates. Best Regards, Jerry On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lam wrote: > I agreed the max date will satisfy

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Jerry Lam
We "used" Spark on Mesos to build interactive data analysis platform because the interactive session could be long and might not use Spark for the entire session. It is very wasteful of resources if we used the coarse-grained mode because it keeps resource for the entire session. Therefore, fine-gr

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Jerry Lam
Does Qubole use Yarn or Mesos for resource management? Sent from my iPhone > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan > wrote: > > Qubole - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional comma

[Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
Hi spark users and developers, Is it possible to disable HiveContext from being instantiated when using spark-shell? I got the following errors when I have more than one session starts. Since I don't use HiveContext, it would be great if I can have more than 1 spark-shell start at the same time.

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
7;t see config of skipping the above call. > > FYI > > On Fri, Nov 6, 2015 at 8:53 AM, Jerry Lam <mailto:chiling...@gmail.com>> wrote: > Hi spark users and developers, > > Is it possible to disable HiveContext from being instantiated when using > spark-shell? I got

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
e above call. > > FYI > > On Fri, Nov 6, 2015 at 8:53 AM, Jerry Lam <mailto:chiling...@gmail.com>> wrote: > Hi spark users and developers, > > Is it possible to disable HiveContext from being instantiated when using > spark-shell? I got the following errors when

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
by HiveContext? > > Thanks. > > Zhan Zhang > > On Nov 6, 2015, at 10:43 AM, Jerry Lam <mailto:chiling...@gmail.com>> wrote: > >> What is interesting is that pyspark shell works fine with multiple session >> in the same host even though multiple HiveContext ha

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
; > > > hive.metastore.uris > thrift://zzhang-yarn11:9083 <> > > > > > HW11188:spark zzhang$ > > By the way, I don’t know whether there is any caveat for this walk around. > > Thanks. > > Zhan Zhang > > > >

Re: Very slow startup for jobs containing millions of tasks

2015-11-14 Thread Jerry Lam
ixes such as SPARK-9952 > > Cheers > >> On Sat, Nov 14, 2015 at 6:35 PM, Jerry Lam wrote: >> Hi spark users and developers, >> >> Have anyone experience the slow startup of a job when it contains a stage >> with over 4 millions of tasks? >> The job has

Re: spark-ec2 vs. EMR

2015-12-01 Thread Jerry Lam
Simply put: EMR = Hadoop Ecosystem (Yarn, HDFS, etc) + Spark + EMRFS + Amazon EMR API + Selected Instance Types + Amazon EC2 Friendly (bootstrapping) spark-ec2 = HDFS + Yarn (Optional) + Spark (Standalone Default) + Any Instance Type I use spark-ec2 for prototyping and I have never use it for p

Re: spark-ec2 vs. EMR

2015-12-02 Thread Jerry Lam
gt;> 10. I think Amazon will include spark-jobserver to EMR soon. >>> >>> 11. You do not need to be aws expert to start EMR cluster. Users can use >>> EMR web ui to start cluster to run some jobs or work in Zeppelun during the >>> day >>> >>>

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Jerry Lam
Hi Don, It sounds familiar to this: https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCAG+ckK9L=htfyrwx3ux2oeqjjkyukkpmxjq+tns1xrwh-ff...@mail.gmail.com%3E

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
val2")); >> >> list.add(new Tuple2("key2", "val22")); >> >> >> >> JavaPairRDD rdd = jsc.parallelize(list).mapToPair(t -> new >> Tuple2(t._1 + " " +t._2, 1)); >> >> >> JavaPairRDD rdd2 = rd

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
Yes. Sent from my iPhone > On 19 Jul, 2015, at 10:52 pm, "Jahagirdar, Madhu" > wrote: > > All, > > Can we run different version of Spark using the same Mesos Dispatcher. For > example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? > > Regards, > Madhu Jahagirdar > > Th

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
sing that dispatcher ? > ---------- > *From:* Jerry Lam [chiling...@gmail.com] > *Sent:* Monday, July 20, 2015 8:27 AM > *To:* Jahagirdar, Madhu > *Cc:* user; d...@spark.apache.org > *Subject:* Re: Spark Mesos Dispatcher > > Yes. > > Sent from my iPhone > > On 19 Jul,

Re: Parquet problems

2015-07-22 Thread Jerry Lam
Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg wrote: > No, never really resolved the problem, except by increasing the per

Re: Partition parquet data by ENUM column

2015-07-23 Thread Jerry Lam
Hi Cheng, I ran into issues related to ENUM when I tried to use Filter push down. I'm using Spark 1.5.0 (which contains fixes for parquet filter push down). The exception is the following: java.lang.IllegalArgumentException: FilterPredicate column: item's declared type (org.apache.parquet.io.api.

Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Jerry Lam
Hi spark users and developers, I have been trying to understand how Spark SQL works with Parquet for the couple of days. There is a performance problem that is unexpected using the column pruning. Here is a dummy example: The parquet file has the 3 fields: |-- customer_id: string (nullable = tr

Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
Hi Spark users and developers, I wonder which git commit was used to build the latest master-nightly build found at: http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/? I downloaded the build but I couldn't find the information related to it. Thank you! Best Regards, Jerry

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
-search for the > commits made on Jul 16th. > There may be other ways of determining the latest commit. > > Cheers > > On Thu, Jul 30, 2015 at 7:39 AM, Jerry Lam wrote: > >> Hi Spark users and developers, >> >> I wonder which git commit was used to build the

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam
Hi Spark users and developers, I have been trying to use spark-ec2. After I launched the spark cluster (1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job where the data is stored in the ephemeral hdfs. It does not matter what I tried to do, there is no data locality at all.

Re: Accessing S3 files with s3n://

2015-08-09 Thread Jerry Lam
Hi Akshat, Is there a particular reason you don't use s3a? From my experience,s3a performs much better than the rest. I believe the inefficiency is from the implementation of the s3 interface. Best Regards, Jerry Sent from my iPhone > On 9 Aug, 2015, at 5:48 am, Akhil Das wrote: > > Depend

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Jerry Lam
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU and the other executor with 6. So I don't think you can easily configure it without some tweaking at the source code. Sent from my iPad On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula wrote: > Hi Tim, > > Spark

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Jerry Lam
Just out of curiosity, what is the advantage of using parquet without hadoop? Sent from my iPhone > On 11 Aug, 2015, at 11:12 am, wrote: > > I confirm that it works, > > I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 > > Saif > > From: Ellafi, Saif A. > S

Re: Controlling number of executors on Mesos vs YARN

2015-08-12 Thread Jerry Lam
omayajula > wrote: > Spark evolved as an example framework for Mesos - thats how I know it. It is > surprising to see that the options provided by mesos in this case are less. > Tweaking the source code, haven't done it yet but I would love to see what > options could be there!

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Jerry Lam
Hi Nick, I forgot to mention in the survey that ganglia is never installed properly for some reasons. I have this exception every time I launched the cluster: Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so into serv

Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi spark users and developers, Did anyone have IPython Notebook (Jupyter) deployed in production that uses Spark as the computational engine? I know Databricks Cloud provides similar features with deeper integration with Spark. However, Databricks Cloud has to be hosted by Databricks so we cannot

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
ing this for one of my projects on a cluster as well. > > Also, here is a blog that describes how to configure this. > > > http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ > > > Guru Medasani > gdm...@gmail.com > > > > On Aug 1

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Prabeesh, That's even better! Thanks for sharing Jerry On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. wrote: > Refer this post > http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/ > > Spark + Jupyter + Docker > > On 18 August 2015 at 21:29, Je

Re: Java vs. Scala for Spark

2015-09-08 Thread Jerry Lam
Hi Bryan, I would choose a language based on the requirements. It does not make sense if you have a lot of dependencies that are java-based components and interoperability between java and scala is not always obvious. I agree with the above comments that Java is much more verbose than Scala in ma

Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers, I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first(). For your reference, this is what I did in pyspark 1.5: df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15 minutes. I wa

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
gt;> Hi Jerry, >> >> Looks like it is a Python-specific issue. Can you create a JIRA? >> >> Thanks, >> >> Yin >> >> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: >> >>> Hi Spark Developers, >>> >>> I just

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm not sure if it happens als

Re: How does one use s3 for checkpointing?

2015-09-21 Thread Jerry Lam
Hi Amit, Have you looked at Amazon EMR? Most people using EMR use s3 for persistency (both as input and output of spark jobs). Best Regards, Jerry Sent from my iPhone > On 21 Sep, 2015, at 9:24 pm, Amit Ramesh wrote: > > > A lot of places in the documentation mention using s3 for checkpoi

Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
Do you have specific reasons to use Ceph? I used Ceph before, I'm not too in love with it especially when I was using the Ceph Object Gateway S3 API. There are some incompatibilities with aws s3 api. You really really need to try it because making the commitment. Did you managed to install it? On

Re: Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
way s3 rest api, agreed for such > inconvinience and some incompobilities. However, we had not > > yet quite researched and tested over radosgw a lot. But we had some little > requirements using gw in some use cases. > > Hope for more considerations and talks. > > Best, > Sun. >

Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Jerry Lam
Hello spark users and developers, I read the paper from Yahoo about CF with implicit feedback and other papers using implicit feedbacks. Their implementation require to set the missing rating with 0. That is for unobserved ratings, the confidence for those is set to 1 (c=1). Therefore, the matrix

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Jerry Lam
his pair. (x^Ty)^2 + regularization. Do I misunderstand the paper? Best Regards, Jerry On Mon, Dec 5, 2016 at 2:43 PM, Sean Owen wrote: > What are you referring to in what paper? implicit input would never > materialize 0s for missing values. > > On Tue, Dec 6, 2016 at 3:42 AM J

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Jerry Lam
sn't mean this 0 value is literally included in the input. There's > no need for that. > > On Tue, Dec 6, 2016 at 4:24 AM Jerry Lam wrote: > >> Hi Sean, >> >> I'm referring to the paper (http://yifanhu.net/PUB/cf.pdf) Section 2: >> " However, with i

Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-01-31 Thread Jerry Lam
Hi guys, I got an exception like the following, when I tried to implement a user defined aggregation function. Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for Set[(scala.Long, scala.Long)] The Set[(Long, Long)] is a field in the case class which is the o

Re: Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-02-01 Thread Jerry Lam
other work around that i know of. > > import org.apache.spark.sql.{ Encoder, Encoders } > implicit def setEncoder[X]: Encoder[Set[X]] = Encoders.kryo[Set[X]] > > On Tue, Jan 31, 2017 at 7:33 PM, Jerry Lam wrote: > >> Hi guys, >> >> I got an exception like the followi

  1   2   >