Hi Ryan,
On Mon, Feb 4, 2019 at 12:17 PM Ryan Blue wrote:
>
> To partition by a condition, you would need to create a column with the
> result of that condition. Then you would partition by that column. The sort
> option would also work here.
We actually do something similar to filter based on
Just wondering if this is what you are implying Ryan (example only):
val data = (dataset to be partitionned)
val splitCondition =
s"""
CASE
WHEN …. THEN ….
WHEN …. THEN …..
END partition_condition
"""
val partitionedData = data.withColumn("partitionColumn", expr(splitCondi
Likely need a shim (which we should have anyway) because of namespace/import
changes.
I’m huge +1 on this.
From: Hyukjin Kwon
Sent: Monday, February 4, 2019 12:27 PM
To: Xiao Li
Cc: Sean Owen; Felix Cheung; Ryan Blue; Marcelo Vanzin; Yuming Wang; dev
Subject: R
I should check the details and feasiablity by myself but to me it sounds
fine if it doesn't need extra big efforts.
On Tue, 5 Feb 2019, 4:15 am Xiao Li Yes. When our support/integration with Hive 2.x becomes stable, we can do
> it in Hadoop 2.x profile too, if needed. The whole proposal is to min
Yes. When our support/integration with Hive 2.x becomes stable, we can do
it in Hadoop 2.x profile too, if needed. The whole proposal is to minimize
the risk and ensure the release stability and quality.
Hyukjin Kwon 于2019年2月4日周一 下午12:01写道:
> Xiao, to check if I understood correctly, do you mean
Xiao, to check if I understood correctly, do you mean the below?
1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with
Hadoop 3.x profile.
2. Make another newer version of thrift server by Hive 2.x(?) in Spark side.
3. Target the transition to Hive 2.x completely and slowly later
To partition by a condition, you would need to create a column with the
result of that condition. Then you would partition by that column. The sort
option would also work here.
I don't think that there is much of a use case for this. You have a set of
conditions on which to partition your data, an
Thx Xiao!
On Mon, Feb 4, 2019 at 9:04 AM Xiao Li wrote:
> Thank you, Imran!
>
> Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark".
>
> Cheers,
>
> Xiao
>
>
>
> John Zhuge 于2019年2月4日周一 上午8:59写道:
>
>> Thanks Imran!
>>
>> On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid
>> wrote:
>
Thanks Li and Imran for providing us an overview about one of the complex
module in spark 👍 Excellent sharing.
Regards
Sujith.
On Mon, 4 Feb 2019 at 10:54 PM, Xiao Li wrote:
> Thank you, Imran!
>
> Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark".
>
> Cheers,
>
> Xiao
>
>
Thank you Imran, this is quite helpful.
Regards,
Parth Kamlesh Gandhi
On Mon, Feb 4, 2019 at 11:01 AM Rubén Berenguel
wrote:
> Thanks Imran, will definitely give it a look (even if just out of sheer
> interest on how the sausage is done)
>
> R
>
>
> On 4 February 2019 at 17:59:33, John Zhuge (
Hello Ryan,
On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue wrote:
>
> Andrew, can you give us more information about why partitioning the output
> data doesn't work for your use case?
>
> It sounds like all you need to do is to create a table partitioned by A and
> B, then you would automatically ge
To reduce the impact and risk of upgrading Hive execution JARs, we can just
upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x. The
support of Hadoop 3 will be still experimental in our next release. That
means, the impact and risk are very minimal for most users who are still
us
Thanks Imran, will definitely give it a look (even if just out of sheer
interest on how the sausage is done)
R
On 4 February 2019 at 17:59:33, John Zhuge (jzh...@apache.org) wrote:
Thanks Imran!
On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid
wrote:
> The scheduler has been pretty error-prone an
Thanks Imran!
On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid
wrote:
> The scheduler has been pretty error-prone and hard to work on, and I feel
> like there may be a dwindling core of active experts. I'm sure its very
> discouraging to folks trying to make what seem like simple changes, and
> then
Andrew, can you give us more information about why partitioning the output
data doesn't work for your use case?
It sounds like all you need to do is to create a table partitioned by A and
B, then you would automatically get the divisions you want. If what you're
looking for is a way to scale the n
The scheduler has been pretty error-prone and hard to work on, and I feel
like there may be a dwindling core of active experts. I'm sure its very
discouraging to folks trying to make what seem like simple changes, and
then find they are in a rats nest of complex issues they weren't
expecting. But
Hello
On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote:
>
> I've seen many application need to split dataset to multiple datasets based
> on some conditions. As there is no method to do it in one place, developers
> use filter method multiple times. I think it can be useful to have method t
I was unclear from this thread what the objection to these PRs is:
https://github.com/apache/spark/pull/23552
https://github.com/apache/spark/pull/23553
Would we like to specifically discuss whether to merge these or not? I
hear support for it, concerns about continuing to support Hive too,
but I
18 matches
Mail list logo