Re: [Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Brian Hong
Yes that is the option I took while implementing this under Spark 1.4. But every time there is a major update in Spark, I needed to re-copy the needed parts, which is very time consuming. The reason is that InferSchema and JacksonParser uses many more Spark internal methods, which makes this very

Re: GraphX-related "open" issues

2017-01-18 Thread Dongjin Lee
Hi all, I am currently working on SPARK-15880[^1] and also have some interest on SPARK-7244[^2] and SPARK-7257[^3]. In fact, SPARK-7244 and SPARK-7257 have some importance on graph analysis field. Could you make them an exception? Since I am working on graph analysis, I hope to take them. If need

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Michael Armbrust
+1, we should just fix the error to explain why months aren't allowed and suggest that you manually specify some number of days. On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz wrote: > Thanks for the response Burak, > > As any sane person I try to steer away from the objects which have both

Re: Possible bug - Java iterator/iterable inconsistency

2017-01-18 Thread Sean Owen
Hm. Unless I am also totally missing or forgetting something, I think you're right. The equivalent in PairRDDFunctions.scala operations on a function from T to TraversableOnce[U] and a TraversableOnce is most like java.util.Iterator. You can work around it by wrapping it in a faked IteratorIterabl

Possible bug - Java iterator/iterable inconsistency

2017-01-18 Thread Asher Krim
In Spark 2 + Java + RDD api, the use of iterables was replaced with iterators. I just encountered an inconsistency in `flatMapValues` that may be a bug: `flatMapValues` (https://github.com/apache/spark/blob/master/core/src/ main/scala/org/apache/spark/api/java/JavaPairRDD.scala#L677) takes a `Flat

Re: clientMode in RpcEnv.create in Spark on YARN vs general case (driver vs executors)?

2017-01-18 Thread Marcelo Vanzin
On Wed, Jan 18, 2017 at 1:29 AM, Jacek Laskowski wrote: > I'm trying to get the gist of clientMode input parameter for > RpcEnv.create [1]. It is disabled (i.e. false) by default. "clientMode" means whether the RpcEnv only opens external connections (client) or also accepts incoming connections.

Re: can someone review my PR?

2017-01-18 Thread Marcelo Vanzin
On Wed, Jan 18, 2017 at 6:16 AM, Steve Loughran wrote: > it's failing on the dependency check as the dependencies have changed. > that's what it's meant to do. should I explicitly be changing the values so > that the build doesn't notice the change? Yes. There's no automated way to do that, inten

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Thanks for the response Burak, As any sane person I try to steer away from the objects which have both calendar and unsafe in their fully qualified names but if there is no bigger picture I missed here I would go with 1 as well. And of course fix the error message. I understand this has been intro

Re: [Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Reynold Xin
That is internal, but the amount of code is not a lot. Can you just copy the relevant classes over to your project? On Wed, Jan 18, 2017 at 5:52 AM Brian Hong wrote: > I work for a mobile game company. I'm solving a simple question: "Can we > efficiently/cheaply query for the log of a particular

Re: [Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Michael Allman
Personally I'd love to see some kind of pluggability, configurability in the JSON schema parsing, maybe as an option in the DataFrameReader. Perhaps you can propose an API? > On Jan 18, 2017, at 5:51 AM, Brian Hong wrote: > > I work for a mobile game company. I'm solving a simple question: "Ca

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-18 Thread Michael Allman
Based on what you've described, I think you should be able to use Spark's parquet reader plus partition pruning in 2.1. > On Jan 17, 2017, at 10:44 PM, Raju Bairishetti wrote: > > Thanks for the detailed explanation. Is it completely fixed in spark-2.1.0? > > We are giving very high memory t

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Burak Yavuz
Hi Maciej, I believe it would be useful to either fix the documentation or fix the implementation. I'll leave it to the community to comment on. The code right now disallows intervals provided in months and years, because they are not a "consistently" fixed amount of time. A month can be 28, 29, 3

ApacheCon CFP closing soon (11 February)

2017-01-18 Thread Rich Bowen
Hello, fellow Apache enthusiast. Thanks for your participation, and interest in, the projects of the Apache Software Foundation. I wanted to remind you that the Call For Papers (CFP) for ApacheCon North America, and Apache: Big Data North America, closes in less than a month. If you've been puttin

Re: GC limit exceed

2017-01-18 Thread Daniel van der Ende
Hi Marco, What kind of scheduler are you using on your cluster? Yarn? Also, are you running in client mode or cluster mode on the cluster? Daniel On Wed, Jan 18, 2017 at 3:22 PM, marco rocchi < rocchi.1407...@studenti.uniroma1.it> wrote: > I have a spark code that works well over a sample of d

GC limit exceed

2017-01-18 Thread marco rocchi
I have a spark code that works well over a sample of data in local mode, but when I pass the same code on a cluster with the entire dataset I receive GC limited exceed error. In that section is possible to submit the code and have some hints in order to solve my problem? Thanks a lot for the attent

Re: can someone review my PR?

2017-01-18 Thread Steve Loughran
On 18 Jan 2017, at 11:18, Sean Owen mailto:so...@cloudera.com>> wrote: It still doesn't pass tests -- I'd usually not look until that point. it's failing on the dependency check as the dependencies have changed. that's what it's meant to do. should I explicitly be changing the values so that t

[Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Brian Hong
I work for a mobile game company. I'm solving a simple question: "Can we efficiently/cheaply query for the log of a particular user within given date period?" I've created a special JSON text-based file format that has these traits: - Snappy compressed, saved in AWS S3 - Partitioned by date. ie.

Re: can someone review my PR?

2017-01-18 Thread Sean Owen
It still doesn't pass tests -- I'd usually not look until that point. On Wed, Jan 18, 2017 at 11:10 AM Steve Loughran wrote: > I've had a PR outstanding on spark/object store integration, works for > both maven and sbt builds > > https://issues.apache.org/jira/browse/SPARK-7481 > https://github.

can someone review my PR?

2017-01-18 Thread Steve Loughran
I've had a PR outstanding on spark/object store integration, works for both maven and sbt builds https://issues.apache.org/jira/browse/SPARK-7481 https://github.com/apache/spark/pull/12004 Can I get someone to review this as it appears to be being overlooked amongst all the PRs thanks -steve

clientMode in RpcEnv.create in Spark on YARN vs general case (driver vs executors)?

2017-01-18 Thread Jacek Laskowski
Hi, I'm trying to get the gist of clientMode input parameter for RpcEnv.create [1]. It is disabled (i.e. false) by default. I've managed to find out that, in the "general" case, it's enabled for executors and disabled for the driver. (it's also used for Spark Standalone's master and workers but

Re: 答复: Limit Query Performance Suggestion

2017-01-18 Thread Liang-Chi Hsieh
Hi zhenhua, Thanks for the idea. Actually, I think we can completely avoid shuffling the data in a limit operation, no matter LocalLimit or GlobalLimit. wangzhenhua (G) wrote > How about this: > 1. we can make LocalLimit shuffle to mutiple partitions, i.e. create a new > partitioner to unifor

答复: Limit Query Performance Suggestion

2017-01-18 Thread wangzhenhua (G)
How about this: 1. we can make LocalLimit shuffle to mutiple partitions, i.e. create a new partitioner to uniformly dispatch the data class LimitUniformPartitioner(partitions: Int) extends Partitioner { def numPartitions: Int = partitions var num = 0 def getPartition(key: Any): Int = {

[SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Hi, Can I ask for some clarifications regarding intended behavior of window / TimeWindow? PySpark documentation states that "Windows in the order of months are not supported". This is further confirmed by the checks in TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l). Surprisingly eno

Re: RpcEnv(Factory) is no longer pluggable? spark.rpc is gone, isn't it?

2017-01-18 Thread Jacek Laskowski
On Wed, Jan 18, 2017 at 8:57 AM, Jacek Laskowski wrote: > p.s. How to know when the deprecation was introduced? The last change > is for executor blacklisting so git blame does not show what I want :( > Any ideas? Figured that out myself! $ git log --topo-order --graph -u -L 641,641:core/src/ma

RpcEnv(Factory) is no longer pluggable? spark.rpc is gone, isn't it?

2017-01-18 Thread Jacek Laskowski
Hi, Given [1]: > DeprecatedConfig("spark.rpc", "2.0", "Not used any more.") I believe the comment in [2]: > A RpcEnv implementation must have a [[RpcEnvFactory]] implementation with an > empty constructor so that it can be created via Reflection. Correct? Deserves a pull request to get rid of