Re: welcome a new batch of committers

2018-10-03 Thread Ted Yu
Congratulations to all ! Original message From: Jungtaek Lim Date: 10/3/18 2:41 AM (GMT-08:00) To: Marco Gaido Cc: dev Subject: Re: welcome a new batch of committers Congrats all! You all deserved it. On Wed, 3 Oct 2018 at 6:35 PM Marco Gaido wrote: Congrats you all! Il g

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-01 Thread Ted Yu
+1 Original message From: Denny Lee Date: 9/30/18 10:30 PM (GMT-08:00) To: Stavros Kontopoulos Cc: Sean Owen , Wenchen Fan , dev Subject: Re: [VOTE] SPARK 2.4.0 (RC2) +1 (non-binding) On Sat, Sep 29, 2018 at 10:24 AM Stavros Kontopoulos wrote: +1 Stavros On Sat, Sep 2

Re: from_csv

2018-09-19 Thread Ted Yu
+1 Original message From: Dongjin Lee Date: 9/19/18 7:20 AM (GMT-08:00) To: dev Subject: Re: from_csv Another +1. I already experienced this case several times. On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon wrote: +1 for this idea since text parsing in CSV/JSON is quite co

Re: Upgrade SBT to the latest

2018-08-31 Thread Ted Yu
+1 Original message From: Sean Owen Date: 8/31/18 6:40 AM (GMT-08:00) To: Darcy Shen Cc: dev@spark.apache.org Subject: Re: Upgrade SBT to the latest Certainly worthwhile. I think this should target Spark 3, which should come after 2.4, which is itself already just about rea

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Ted Yu
+1 Original message From: Reynold Xin Date: 8/30/18 11:11 PM (GMT-08:00) To: Felix Cheung Cc: dev Subject: Re: SPIP: Executor Plugin (SPARK-24918) I actually had a similar use case a while ago, but not entirely the same. In my use case, Spark is already up, but I want to m

Re: Spark Kafka adapter questions

2018-08-20 Thread Ted Yu
spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203) > > 18/08/20 22:29:33 INFO AbstractCoordinator: Marking the coordinator > :9093 (id: 2147483647 rack: null) dead for group > spark-kafka-source-1aa50598-99d1-4c53-a73c-fa6637a219b2--1338794993-dri

Re: Spark Kafka adapter questions

2018-08-17 Thread Ted Yu
If you have picked up all the changes for SPARK-18057, the Kafka “broker” supporting v1.0+ should be compatible with Spark's Kafka adapter. Can you post more details about the “failed to send SSL close message” errors ? (The default Kafka version is 2.0.0 in Spark Kafka adapter after SPARK-18057

Re: 回复: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Ted Yu
Congratulations, Zhenhua  Original message From: 雨中漫步 <601450...@qq.com> Date: 4/1/18 11:30 PM (GMT-08:00) To: Yuanjian Li , Wenchen Fan Cc: dev Subject: 回复: Welcome Zhenhua Wang as a Spark committer Congratulations Zhenhua Wang -- 原始邮件 --发件

Re: DataSourceV2 write input requirements

2018-03-30 Thread Ted Yu
+1 Original message From: Ryan Blue Date: 3/30/18 2:28 PM (GMT-08:00) To: Patrick Woody Cc: Russell Spitzer , Wenchen Fan , Ted Yu , Spark Dev List Subject: Re: DataSourceV2 write input requirements You're right. A global sort would change the clustering if it had

Re: DataSourceV2 write input requirements

2018-03-28 Thread Ted Yu
>>> provide >>>>>>>>>> Spark a hash function for the other side of a join. It seems >>>>>>>>>> unlikely to me >>>>>>>>>> that many data sources would have partitioning that happens

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ted Yu
wrote: > Actually clustering is already supported, please take a look at > SupportsReportPartitioning > > Ordering is not proposed yet, might be similar to what Ryan proposed. > > On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu wrote: > >> Interesting. >> >> Sh

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ted Yu
Interesting. Should requiredClustering return a Set of Expression's ? This way, we can determine the order of Expression's by looking at what requiredOrdering() returns. On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue wrote: > Hi Pat, > > Thanks for starting the discussion on this, we’re really inte

Re: [VOTE] Spark 2.3.0 (RC1)

2018-01-16 Thread Ted Yu
Is there going to be another RC ? With KafkaContinuousSourceSuite hanging, it is hard to get the rest of the tests going. Cheers On Sat, Jan 13, 2018 at 7:29 AM, Sean Owen wrote: > The signatures and licenses look OK. Except for the missing k8s package, > the contents look OK. Tests look prett

Re: Broken SQL Visualization?

2018-01-15 Thread Ted Yu
Did you include any picture ? Looks like the picture didn't go thru. Please use third party site.  Thanks Original message From: Tomasz Gawęda Date: 1/15/18 2:07 PM (GMT-08:00) To: dev@spark.apache.org, u...@spark.apache.org Subject: Broken SQL Visualization? Hi, today I hav

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Ted Yu
Congratulations, Jerry ! On Mon, Aug 28, 2017 at 6:28 PM, Matei Zaharia wrote: > Hi everyone, > > The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai > has been contributing to many areas of the project for a long time, so it’s > great to see him join. Join me in thanking an

Spark 2.1.x client with 2.2.0 cluster

2017-08-10 Thread Ted Yu
Hi, Has anyone used Spark 2.1.x client with Spark 2.2.0 cluster ? If so, is there any compatibility issue observed ? Thanks

Re: Performance Benchmark Hbase vs Cassandra

2017-06-29 Thread Ted Yu
For Cassandra, I found: https://www.instaclustr.com/multi-data-center-sparkcassandra-benchmark-round-2/ My coworker (on vacation at the moment) was doing benchmark with hbase. When he comes back, the result can be published. Note: it is hard to find comparison results with same setup (hardware,

Re: Spark Hbase Connector

2017-06-29 Thread Ted Yu
Please take a look at HBASE-16179 (work in progress). On Thu, Jun 29, 2017 at 4:30 PM, Raj, Deepu wrote: > Hi Team, > > > > Is there stable Spark HBase connector for Spark 2.0 ? > > > > Thanks, > > Deepu Raj > > >

Re: how to mention others in JIRA comment please?

2017-06-26 Thread Ted Yu
You can find the JIRA handle of the person you want to mention by going to a JIRA where that person has commented. e.g. you want to find the handle for Joseph. You can go to: https://issues.apache.org/jira/browse/SPARK-6635 and click on his name in comment: https://issues.apache.org/jira/secure/V

Re: the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread Ted Yu
Does adding -X to mvn command give you more information ? Cheers On Sun, Jun 25, 2017 at 5:29 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > Today I use new PC to compile SPARK. > At the beginning, it worked well. > But it stop at some point. > the content in consle is : > ==

Re: Should we consider a Spark 2.1.1 release?

2017-03-20 Thread Ted Yu
Timur: Mind starting a new thread ? I have the same question as you have. > On Mar 20, 2017, at 11:34 AM, Timur Shenkao wrote: > > Hello guys, > > Spark benefits from stable versions not frequent ones. > A lot of people still have 1.6.x in production. Those who wants the freshest > (like me)

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
Does the storage handler provide bulk load capability ? Cheers > On Jan 25, 2017, at 3:39 AM, Amrit Jangid wrote: > > Hi chetan, > > If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE with > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. > > Try this if you

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
The references are vendor specific. Suggest contacting vendor's mailing list for your PR. My initial interpretation of HBase repository is that of Apache. Cheers On Wed, Jan 25, 2017 at 7:38 AM, Chetan Khatri wrote: > @Ted Yu, Correct but HBase-Spark module available at HBase re

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
Though no hbase release has the hbase-spark module, you can find the backport patch on HBASE-14160 (for Spark 1.6) You can build the hbase-spark module yourself. Cheers On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri wrote: > Hello Spark Community Folks, > > Currently I am using HBase 1.2.4 and

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
processing is delivered to hbase. Cheers On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri wrote: > Ok, Sure will ask. > > But what would be generic best practice solution for Incremental load from > HBASE. > > On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: > >> I haven

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
I haven't used Gobblin. You can consider asking Gobblin mailing list of the first option. The second option would work. On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri wrote: > Hello Guys, > > I would like to understand different approach for Distributed Incremental > load from HBase, Is there

Re: Difference between netty and netty-all

2016-12-05 Thread Ted Yu
This should be in netty-all : $ jar tvf /home/x/.m2/repository/io/netty/netty-all/4.0.29.Final/netty-all-4.0.29.Final.jar | grep ThreadLocalRandom 967 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/ThreadLocalRandom$1.class 1079 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/ThreadL

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Ted Yu
Makes sense. I trust Hyukjin, Holden and Cody's judgement in respective areas. I just wish to see more participation from the committers. Thanks > On Oct 8, 2016, at 8:27 AM, Sean Owen wrote: > > Hyukjin - To unsubscribe

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Ted Yu
I think only committers should resolve JIRAs which were not created by himself / herself. > On Oct 8, 2016, at 6:53 AM, Hyukjin Kwon wrote: > > I am uncertain too. It'd be great if these are documented too. > > FWIW, in my case, I privately asked and told Sean first that I am going to > look

Re: Issues in compiling spark 2.0.0 code using scala-maven-plugin

2016-09-30 Thread Ted Yu
Was there any error prior to 'LifecycleExecutionException' ? On Fri, Sep 30, 2016 at 2:43 PM, satyajit vegesna < satyajit.apas...@gmail.com> wrote: > >> i am trying to compile code using maven ,which was working with spark >> 1.6.2, but when i try for spark 2.0.0 then i get below error, >> >> org

Replacement for SparkSqlSerializer.deserialize[

2016-09-06 Thread Ted Yu
Hi, In hbase-spark module of hbase, we previously had this code: def hbaseFieldToScalaType( f: Field, src: Array[Byte], offset: Int, length: Int): Any = { ... case BinaryType => val newArray = new Array[Byte](length) System.arraycopy(src, offse

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Ted Yu
'Spark 1.x and Scala 2.10 & 2.11' was repeated. I guess your second line should read: org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1] for Spark 2.x and Scala 2.10 & 2.11 On Wed, Aug 24, 2016 at 9:41 AM, Michael Heuer wrote: > Hello, > > We're a project downstream of Spark and need to

Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Ted Yu
Congratulations, Felix. On Mon, Aug 8, 2016 at 11:15 AM, Matei Zaharia wrote: > Hi all, > > The PMC recently voted to add Felix Cheung as a committer. Felix has been > a major contributor to SparkR and we're excited to have him join > officially. Congrats and welcome, Felix! > > Matei >

Re: SQL Based Authorization for SparkSQL

2016-08-02 Thread Ted Yu
There was SPARK-12008 which was closed. Not sure if there is active JIRA in this regard. On Tue, Aug 2, 2016 at 6:40 PM, 马晓宇 wrote: > Hi guys, > > I wonder if anyone working on SQL based authorization already or not. > > This is something we needed badly right now and we tried to embedded a > H

Re: Build speed

2016-07-22 Thread Ted Yu
I assume you have enabled Zinc. Cheers On Fri, Jul 22, 2016 at 7:54 AM, Mikael Ståldal wrote: > Is there any way to speed up an incremental build of Spark? > > For me it takes 8 minutes to build the project with just a few code > changes. > > -- > [image: MagineTV] > > *Mikael Ståldal* > Senior

Re: Spark performance regression test suite

2016-07-08 Thread Ted Yu
Found a few issues: [SPARK-6810] Performance benchmarks for SparkR [SPARK-2833] performance tests for linear regression [SPARK-15447] Performance test for ALS in Spark 2.0 Haven't found one for Spark core. On Fri, Jul 8, 2016 at 8:58 AM, Michael Allman wrote: > Hello, > > I've seen a few messa

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Ted Yu
bq. we turned it off when fixing a bug Adam: Can you refer to the bug JIRA ? Thanks On Fri, Jul 8, 2016 at 9:22 AM, Adam Roberts wrote: > Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned > vs 2.0.0 default vs 1.6.2 default comparison, for future reference the > defaults

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Ted Yu
Running the following command: build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr -Dhadoop.version=2.7.0 package The build stopped with this test failure: ^[[31m- SPARK-9757 Persist Parquet relation with decimal column *** FAILED ***^[[0m On Wed, Jul 6, 2016 at 6:25 AM, Sea

Re: Hello

2016-06-17 Thread Ted Yu
You can use a JIRA filter to find JIRAs of the component(s) you're interested in. Then sort by Priority. Maybe comment on the JIRA if you want to work on it. On Fri, Jun 17, 2016 at 3:22 PM, Pedro Rodriguez wrote: > What is the best way to determine what the library maintainers believe is > imp

Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Ted Yu
Docker Integration Tests failed on Linux: http://pastebin.com/Ut51aRV3 Here was the command I used: mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr -Dhadoop.version=2.7.0 package Has anyone seen similar error ? Thanks On Thu, Jun 16, 2016 at 9:49 PM, Reynold Xin wrote: > P

Re: Kryo registration for Tuples?

2016-06-08 Thread Ted Yu
I think the second group (3 classOf's) should be used. Cheers On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov wrote: > if my RDD is RDD[(String, (Long, MyClass))] > > Do I need to register > > classOf[MyClass] > classOf[(Any, Any)] > > or > > classOf[MyClass] > classOf[(Long, MyClass)] > cl

Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu
Please go ahead. On Tue, Jun 7, 2016 at 4:45 PM, franklyn wrote: > Thanks for reproducing it Ted, should i make a Jira Issue?. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp1

Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu
I built with Scala 2.10 >>> df.select(add_one(df.a).alias('incremented')).collect() The above just hung. On Tue, Jun 7, 2016 at 3:31 PM, franklyn wrote: > Thanks Ted !. > > I'm using > > https://github.com/apache/spark/commit/8f5a04b6299e3a47aca13cbb40e72344c0114860 > and building with scala-2

Re: Dataset API agg question

2016-06-07 Thread Ted Yu
Have you tried the following ? Seq(1->2, 1->5, 3->6).toDS("a", "b") then you can refer to columns by name. FYI On Tue, Jun 7, 2016 at 3:58 PM, Alexander Pivovarov wrote: > I'm trying to switch from RDD API to Dataset API > My question is about reduceByKey method > > e.g. in the following exa

Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu
With commit 200f01c8fb15680b5630fbd122d44f9b1d096e02 using Scala 2.11: Using Python version 2.7.9 (default, Apr 29 2016 10:48:06) SparkSession available as 'spark'. >>> from pyspark.sql import SparkSession >>> from pyspark.sql.types import IntegerType, StructField, StructType >>> from pyspark.sql.

Re: Can't compile 2.0-preview with scala 2.10

2016-06-06 Thread Ted Yu
See the following from https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-sbt-SCALA-2.10/1642/consoleFull : + SBT_FLAGS+=('-Dscala-2.10') + ./dev/change-scala-version.sh 2.10 FYI On Mon, Jun 6, 2016 at 10:35 AM, Franklyn D'souza < franklyn.dso...@shopify.

Re: Welcoming Yanbo Liang as a committer

2016-06-03 Thread Ted Yu
Congratulations, Yanbo. On Fri, Jun 3, 2016 at 7:48 PM, Matei Zaharia wrote: > Hi all, > > The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been a > super active contributor in many areas of MLlib. Please join me in > welcoming Yanbo! > > Matei > --

Re: ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-05-24 Thread Ted Yu
Please log a JIRA. Thanks On Tue, May 24, 2016 at 8:33 AM, Koert Kuipers wrote: > hello, > as we continue to test spark 2.0 SNAPSHOT in-house we ran into the > following trying to port an existing application from spark 1.6.1 to spark > 2.0.0-SNAPSHOT. > > given this code: > > case class Test(a

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Ted Yu
>>> And, don't worry, Ted. >>> >>> Travis launches new VMs for every PR. >>> >>> Apache Spark repository uses the following setting. >>> >>> VM: Google Compute Engine >>> OS: Ubuntu 14.04.3 LTS Server Edition 64bit >>

Re: Running TPCDSQueryBenchmark results in java.lang.OutOfMemoryError

2016-05-23 Thread Ted Yu
Can you tell us the commit hash using which the test was run ? For #2, if you can give full stack trace, that would be nice. Thanks On Mon, May 23, 2016 at 8:58 AM, Ovidiu-Cristian MARCU < ovidiu-cristian.ma...@inria.fr> wrote: > Hi > > 1) Using latest spark 2.0 I've managed to run TPCDSQueryBe

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Ted Yu
and amend your commit title or messages, see the Travis CI. > Or, you can monitor Travis CI result on status menu bar. > If it shows green icon, you have nothing to do. > >https://docs.travis-ci.com/user/apps/ > > To sum up, I think we don't need to wait f

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-22 Thread Ted Yu
- For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`. > > Thank you, Ted. > > Dongjoon. > > On Sun, May 22, 2016 at 1:29 PM, Ted Yu wrote: > >> The following line was repeated twice: >> >> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-22 Thread Ted Yu
The following line was repeated twice: - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`. Did you intend to cover JDK 8 ? Cheers On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun wrote: > Hi, All. > > I want to propose the followings. > > - Turn on Travis CI for Apache Spark PR

Re: Quick question on spark performance

2016-05-20 Thread Ted Yu
Yash: Can you share the JVM parameters you used ? How many partitions are there in your data set ? Thanks On Fri, May 20, 2016 at 5:59 PM, Reynold Xin wrote: > It's probably due to GC. > > On Fri, May 20, 2016 at 5:54 PM, Yash Sharma wrote: > >> Hi All, >> I am here to get some expert advice

Re: Query parsing error for the join query between different database

2016-05-18 Thread Ted Yu
Which release of Spark / Hive are you using ? Cheers > On May 18, 2016, at 6:12 AM, JaeSung Jun wrote: > > Hi, > > I'm working on custom data source provider, and i'm using fully qualified > table name in FROM clause like following : > > SELECT user. uid, dept.name > FROM userdb.user user, d

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Ted Yu
In master branch, behavior is the same. Suggest opening a JIRA if you haven't done so. On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > Hi guys, > > I have a problem about spark DataFrame. My spark version is 1.6.1. > Basically, i used udf and df.withColumn to create a "new" column, and then

Re: Structured Streaming with Kafka source/sink

2016-05-11 Thread Ted Yu
Please see this thread: http://search-hadoop.com/m/q3RTt9XAz651PiG/Adhoc+queries+spark+streaming&subj=Re+Adhoc+queries+on+Spark+2+0+with+Structured+Streaming > On May 11, 2016, at 1:47 AM, Ofir Manor wrote: > > Hi, > I'm trying out Structured Streaming from current 2.0 branch. > Does the branch

Re: Cache Shuffle Based Operation Before Sort

2016-05-08 Thread Ted Yu
I assume there were supposed to be images following this line (which I don't see in the email thread): bq. Let’s look at details of execution for 10 and 100 scale factor input Consider using 3rd party image site. On Sun, May 8, 2016 at 5:17 PM, Ali Tootoonchian wrote: > Thanks for your comment

Re: Proposal of closing some PRs and maybe some PRs abandoned by its author

2016-05-06 Thread Ted Yu
PR #10572 was listed twice. In the future, is it possible to include the contributor's handle beside the PR number so that people can easily recognize their own PR ? Thanks On Fri, May 6, 2016 at 8:45 AM, Hyukjin Kwon wrote: > Hi all, > > > This was similar with the proposal of closing PRs bef

Re: SQLContext and "stable identifier required"

2016-05-03 Thread Ted Yu
Have you tried the following ? scala> import spark.implicits._ import spark.implicits._ scala> spark res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@323d1fa2 Cheers On Tue, May 3, 2016 at 9:16 AM, Koert Kuipers wrote: > with the introduction of SparkSession SQLCont

Re: spark 2 segfault

2016-05-02 Thread Ted Yu
gt; Created issue: >> https://issues.apache.org/jira/browse/SPARK-15062 >> >> On Mon, May 2, 2016 at 6:48 AM, Ted Yu wrote: >> >>> I tried the same statement using Spark 1.6.1 >>> There was no error with default memory setting. >>> >>> Suggest logging a

Re: spark 2 segfault

2016-05-02 Thread Ted Yu
On May 2, 2016 12:09 AM, "Ted Yu" wrote: >> Using commit hash 90787de864b58a1079c23e6581381ca8ffe7685f and Java 1.7.0_67 >> , I got: >> >> scala> val dfComplicated = sc.parallelize(List((Map("1" -> "a"), List

Re: Number of partitions for binaryFiles

2016-04-26 Thread Ted Yu
> Hi Ted, > > > > I have 36 files of size ~600KB and the rest 74 are about 400KB. > > > > Is there a workaround rather than changing Sparks code? > > > > Best regards, Alexander > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* Tu

Re: Number of partitions for binaryFiles

2016-04-26 Thread Ted Yu
Here is the body of StreamFileInputFormat#setMinPartitions : def setMinPartitions(context: JobContext, minPartitions: Int) { val totalLen = listStatus(context).asScala.filterNot(_.isDirectory).map(_.getLen).sum val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 1.0)).toLong

Re: Cache Shuffle Based Operation Before Sort

2016-04-25 Thread Ted Yu
Interesting. bq. details of execution for 10 and 100 scale factor input Looks like some chart (or image) didn't go through. FYI On Mon, Apr 25, 2016 at 12:50 PM, Ali Tootoonchian wrote: > Caching shuffle RDD before the sort process improves system performance. > SQL > planner can be intellige

Re: RFC: Remote "HBaseTest" from examples?

2016-04-21 Thread Ted Yu
Zhan: I have mentioned the JIRA numbers in the thread starting with (note the typo in subject of this thread): RFC: Remove ... On Thu, Apr 21, 2016 at 1:28 PM, Zhan Zhang wrote: > FYI: There are several pending patches for DataFrame support on top of > HBase. > > Thanks. > > Zhan Zhang > > On A

Re: [Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread Ted Yu
Interesting analysis. Can you log a JIRA ? > On Apr 21, 2016, at 11:07 AM, atootoonchian wrote: > > SQL query planner can have intelligence to push down filter commands towards > the storage layer. If we optimize the query planner such that the IO to the > storage is reduced at the cost of run

Re: Improving system design logging in spark

2016-04-20 Thread Ted Yu
Interesting. For #3: bq. reading data from, I guess you meant reading from disk. On Wed, Apr 20, 2016 at 10:45 AM, atootoonchian wrote: > Current spark logging mechanism can be improved by adding the following > parameters. It will help in understanding system bottlenecks and provide > useful

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
: > On Tue, Apr 19, 2016 at 11:07 AM, Ted Yu wrote: > >> The same question can be asked w.r.t. examples for other projects, such >> as flume and kafka. >> > > The main difference being that flume and kafka integration are part of > Spark itself. HBase integratio

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
> On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu wrote: > >> bq. HBase's current support, even if there are bugs or things that still >> need to be done, is much better than the Spark example >> >> In my opinion, a simple example that works is better than a buggy

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
too > many dependencies for something that is not really useful, is why I'm > suggesting removing it. > > > On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu wrote: > > There is an Open JIRA for fixing the documentation: HBASE-15473 > > > > I would say the refguide li

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
gt; > While you're at it, here's some much better documentation, from the > HBase project themselves, than what the Spark example provides: > http://hbase.apache.org/book.html#spark > > On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu wrote: > > bq. it's actually in use

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
'bq.' is used in JIRA to quote what other people have said. On Tue, Apr 19, 2016 at 10:42 AM, Reynold Xin wrote: > Ted - what's the "bq" thing? Are you using some 3rd party (e.g. Atlassian) > syntax? They are not being rendered in email. > > > On Tue,

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
se's input > formats), which makes it not very useful as a blueprint for developing > HBase apps with Spark. > > On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu wrote: > > bq. I wouldn't call it "incomplete". > > > > I would call it incomplete. > > &g

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
bq. create a separate tarball for them Probably another thread can be started for the above. I am fine with it. On Tue, Apr 19, 2016 at 10:34 AM, Marcelo Vanzin wrote: > On Tue, Apr 19, 2016 at 10:28 AM, Reynold Xin wrote: > > Yea in general I feel examples that bring in a large amount of > de

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin wrote: > On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu wrote: > > I want to note that the hbase-spark module in HBase is incomplete. Zhan > has > > several patches pending review. > > I wouldn't call it "incomplete

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
Corrected typo in subject. I want to note that the hbase-spark module in HBase is incomplete. Zhan has several patches pending review. hbase-spark module is currently only in master branch which would be released as 2.0 However the release date for 2.0 is unclear - probably half a year from now.

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
s; a higher barrier to contributing; a combination >> thereof; etc... >> >> Also relevant: http://danluu.com/discourage-oss/ >> >> By the way, some people noted that closing PRs may discourage >> contributors. I think our open PR count alone is very discouraging. Und

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
ing people look at the pull requests that have been inactive >>> for a >>> > long time. That seems equally likely (or unlikely) as committers >>> looking at >>> > the recently closed pull requests. >>> > >>> > In either case, most

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
en but the cost to reopen is approximately zero (i.e. > click a button on the pull request). > > > On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu wrote: > >> bq. close the ones where they don't respond for a week >> >> Does this imply that the script understands re

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
filtered for non-mergeable PRs or instead left a comment > asking the author to respond if they are still available to move the PR > forward - and close the ones where they don't respond for a week? > > Just a suggestion. > On Monday, April 18, 2016, Ted Yu wrote: > >>

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
I had one PR which got merged after 3 months. If the inactivity was due to contributor, I think it can be closed after 30 days. But if the inactivity was due to lack of review, the PR should be kept open. On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger wrote: > For what it's worth, I have defi

Re: BytesToBytes and unaligned memory

2016-04-18 Thread Ted Yu
unaligned memory access on a platform where unaligned memory access is > definitely not supported for shorts/ints/longs. > > if these tests continue to pass then I think the Spark tests don't > exercise unaligned memory access, cheers > > > > > > > > From:

Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu
added > > Cheers, > > > > > From:Ted Yu > To:Adam Roberts/UK/IBM@IBMGB > Cc:"dev@spark.apache.org" > Date:15/04/2016 16:43 > Subject:Re: BytesToBytes and unaligned memory > -- > &g

Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu
unaligned);* > } > } > > Output is, as you'd expect, "used reflection and _unaligned is false, > setting to true anyway for experimenting", and the tests pass. > > No other problems on the platform (pending a different pull request). > > Cheers, > >

Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu
I assume you tested 2.0 with SPARK-12181 . Related code from Platform.java if java.nio.Bits#unaligned() throws exception: // We at least know x86 and x64 support unaligned access. String arch = System.getProperty("os.arch", ""); //noinspection DynamicRegexReplaceableByCompiledPa

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-11 Thread Ted Yu
Gentle ping: spark-1.6.1-bin-hadoop2.4.tgz from S3 is still corrupt. On Wed, Apr 6, 2016 at 12:55 PM, Josh Rosen wrote: > Sure, I'll take a look. Planning to do full verification in a bit. > > On Wed, Apr 6, 2016 at 12:54 PM Ted Yu wrote: > >> Josh: >> Can you ch

Re: spark graphx storage RDD memory leak

2016-04-10 Thread Ted Yu
I see the following code toward the end of the method: // Unpersist the RDDs hidden by newly-materialized RDDs oldMessages.unpersist(blocking = false) prevG.unpersistVertices(blocking = false) prevG.edges.unpersist(blocking = false) Wouldn't the above achieve same effect ?

Re: [BUILD FAILURE] Spark Project ML Local Library - me or it's real?

2016-04-09 Thread Ted Yu
Sent PR: https://github.com/apache/spark/pull/12276 I was able to get build going past mllib-local module. FYI On Sat, Apr 9, 2016 at 12:40 PM, Ted Yu wrote: > The broken build was caused by the following: > > [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom > &

Re: [BUILD FAILURE] Spark Project ML Local Library - me or it's real?

2016-04-09 Thread Ted Yu
The broken build was caused by the following: [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/607/ FYI On Sat, Apr 9, 2016 at 12:01 PM, Jacek Laskowski wrote: > Hi, > > Is this me or the build is

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-06 Thread Ted Yu
Front >>> (i.e. >>> >> the “direct download” option on spark.apache.org) are also corrupt. >>> >> >>> >> Btw what’s the correct way to verify the SHA of a Spark package? I’ve >>> tried >>> >> a few commands on working pack

Re: BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Ted Yu
i > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Tue, Apr 5, 2016 at 8:41 PM, Ted Yu wrote: > > Looking at recent > > > https://amplab.cs.berkeley.e

Re: BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Ted Yu
Looking at recent https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7 builds, there was no such error. I don't see anything wrong with the code: usage = "_FUNC_(str) - " + "Returns str, with the first letter of each word in uppercase, all other letters in " + Mind

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Ted Yu
Josh: You may have noticed the following error ( https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/566/console ): [error] javac: invalid source release: 1.8 [error] Usage: javac [error] use -help for a list of possible options On Tue, Apr 5, 2016 at 2:14 PM, Josh Ro

Re: [STREAMING] DStreamClosureSuite.scala with { return; ssc.sparkContext.emptyRDD[Int] } Why?!

2016-04-05 Thread Ted Yu
The next line should give some clue: expectCorrectException { ssc.transform(Seq(ds), transformF) } Closure shouldn't include return. On Tue, Apr 5, 2016 at 3:40 PM, Jacek Laskowski wrote: > Hi, > > In > https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/st

Re: Build with Thrift Server & Scala 2.11

2016-04-05 Thread Ted Yu
Raymond: Did "namenode" appear in any of the Spark config files ? BTW Scala 2.11 is used by the default build. On Tue, Apr 5, 2016 at 6:22 AM, Raymond Honderdors < raymond.honderd...@sizmek.com> wrote: > I can see that the build is successful > > (-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phi

Re: error: reference to sql is ambiguous after import org.apache.spark._ in shell?

2016-04-04 Thread Ted Yu
Looks like the import comes from repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala : processLine("import sqlContext.sql") On Mon, Apr 4, 2016 at 5:16 PM, Jacek Laskowski wrote: > Hi Spark devs, > > I'm unsure if what I'm seeing is correct. I'd appreciate any input > to

Re: explain codegen

2016-04-04 Thread Ted Yu
on't you wipe everything out and try again? > > On Monday, April 4, 2016, Ted Yu wrote: > >> The commit you mentioned was made Friday. >> I refreshed workspace Sunday - so it was included. >> >> Maybe this was related: >> >> $ bin/spark-shell >&g

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Ted Yu
; > >>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky >>> wrote: >>> >> >>> >> I just experienced the issue, however retrying the download a second >>> >> time worked. Could it be that there is some load balancer/c

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Ted Yu
bq. the modifications do not touch the scheduler If the changes can be ported over to 1.6.1, do you mind reproducing the issue there ? I ask because master branch changes very fast. It would be good to narrow the scope where the behavior you observed started showing. On Mon, Apr 4, 2016 at 6:12

  1   2   3   4   >