Hi Philip,

here a few additions to what Max said:
- ORDER BY: As Max said, Flink's sortPartition() does only sort with a
partition and does not produce a total order. You can either set the
parallelism to 1 as Max suggested or use a custom partitioner to range
partition the data.
- SORT BY: From your description, the semantics are not 100% clear. If SORT
BY refers to the order of tuples WITHIN a reduce function call, it should
be groupBy().sortGroup() in Flink instead of sortPartition
- DISTRIBUTE BY: This should be partitionByHash() instead of groupBy().
GroupBy() will also sort the data which is not required for DISTRIBUTE BY.
- CLUSTER BY: This should be partitionByHash().sortPartition().
- Reduce vs. GroupReduce: A ReduceFunction is always combinable. This is
optional for GroupReduceFunctions.

Cheers, Fabian



2015-10-19 13:01 GMT+02:00 Maximilian Michels <m...@apache.org>:

> Hi Philip,
>
> Thank you for your questions. I think you have mapped the HIVE
> functions to the Flink ones correctly. Just a remark on the ORDER BY.
> You wrote that it produces a total order of all the records. In this
> case, you'd have do a SortPartition operation with parallelism set to
> 1. This is necessary because we need to have all records in one place
> to perform a sort on them.
>
> Considering your reduce question: There is no fundamental
> advantage/disadvantage of using GroupReduce over Reduce. It depends on
> your use case which one is more convenient or efficient. For the
> regular reduce, you just get two elements and produce one. You can't
> easily keep state between the reduces other than in the value itself.
> The GroupReduce, on the other hand, may produce none, one, or multiple
> elements per grouping and keep state in between emitting values. Thus,
> GroupReduce is a more powerful operator and can be seen as a superset
> of the Reduce operator. I would advise you to use the one you find
> easiest to use.
>
> Best regards,
> Max
>
> On Sun, Oct 18, 2015 at 9:16 PM, Philip Lee <philjj...@gmail.com> wrote:
> > Hi, Flink people, a question about translation from HIVE Query to Flink
> > fucntioin by using Table API. In sum up, I am working on some benchmark
> for
> > flink
> >
> > I am Philip Lee majoring in Computer Science in Master Degree of TUB. , I
> > work on translation from Hive Query of Benchmark to Flink codes.
> >
> > As I stuided it, I have a few of questions.
> >
> > First of all, if there are people who do no know Hive functions, let me
> > briefly explan.
> >
> > ORDER BY: it just guarntees total order in the output.
> > SORT BY: it only guarntess ordering of the rows within a reducer.
> > GROUP BY: this is just groupBy function in SQL.
> > DISTRIBUTE BY: all rows with the same distributed by columns will go to
> the
> > same reducer.
> > CLUSTER BY: this is just consisted of Distribute By the same column +
> Sort
> > By the same column.
> >
> > I just want to check that the flink functions I use are equal to Hive
> one.
> > < Hive SQL Query = Flink functions >
> >
> > ORDER BY = sortPartition(,)
> > SORT BY= groupBy(`col).sortPartition(,)
> > GROUP BY: this is just groupBy function.
> > DISTRIBUTE BY = groupBy(`col)
> > CLUSTER BY = groupBy(`col).sortPartition(,)
> >
> > I do not see much difference between groupBy and distributed by if I
> apply
> > it to flink function.
> > If this is hadoop version, we could say mapper is distribute by on
> hadoop.
> > However, I am not much sure what could be DISTRIBUTE BY on flink. I
> tried to
> > guess groupBy on Flink could be the function which is to distribute the
> rows
> > by the specified key.
> >
> > Please feel free to correct what I suggested.
> >
> >
> > Secondly, I just want to make sure the difference between reduce function
> > and reduceGroup. I guess there must be a trade-off between two
> functinos. I
> > know reduceGroup is invoked with an Iterator, but which case is more
> proper
> > and benifical to use reduceGroup function rather than reduce function?
> >
> > Best Regards,
> > Philip
> >
> > --
> >
> > ==========================================================
> >
> > Hae Joon Lee
> >
> >
> > Now, in Germany,
> >
> > M.S. Candidate, Interested in Distributed System, Iterative Processing
> >
> > Dept. of Computer Science, Informatik in German, TUB
> >
> > Technical University of Berlin
> >
> >
> > In Korea,
> >
> > M.S. Candidate, Computer Architecture Laboratory
> >
> > Dept. of Computer Science, KAIST
> >
> >
> > Rm# 4414 CS Dept. KAIST
> >
> > 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
> >
> >
> > Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
> >
> > ==========================================================
>

Reply via email to