Hi Philip, here a few additions to what Max said: - ORDER BY: As Max said, Flink's sortPartition() does only sort with a partition and does not produce a total order. You can either set the parallelism to 1 as Max suggested or use a custom partitioner to range partition the data. - SORT BY: From your description, the semantics are not 100% clear. If SORT BY refers to the order of tuples WITHIN a reduce function call, it should be groupBy().sortGroup() in Flink instead of sortPartition - DISTRIBUTE BY: This should be partitionByHash() instead of groupBy(). GroupBy() will also sort the data which is not required for DISTRIBUTE BY. - CLUSTER BY: This should be partitionByHash().sortPartition(). - Reduce vs. GroupReduce: A ReduceFunction is always combinable. This is optional for GroupReduceFunctions.
Cheers, Fabian 2015-10-19 13:01 GMT+02:00 Maximilian Michels <m...@apache.org>: > Hi Philip, > > Thank you for your questions. I think you have mapped the HIVE > functions to the Flink ones correctly. Just a remark on the ORDER BY. > You wrote that it produces a total order of all the records. In this > case, you'd have do a SortPartition operation with parallelism set to > 1. This is necessary because we need to have all records in one place > to perform a sort on them. > > Considering your reduce question: There is no fundamental > advantage/disadvantage of using GroupReduce over Reduce. It depends on > your use case which one is more convenient or efficient. For the > regular reduce, you just get two elements and produce one. You can't > easily keep state between the reduces other than in the value itself. > The GroupReduce, on the other hand, may produce none, one, or multiple > elements per grouping and keep state in between emitting values. Thus, > GroupReduce is a more powerful operator and can be seen as a superset > of the Reduce operator. I would advise you to use the one you find > easiest to use. > > Best regards, > Max > > On Sun, Oct 18, 2015 at 9:16 PM, Philip Lee <philjj...@gmail.com> wrote: > > Hi, Flink people, a question about translation from HIVE Query to Flink > > fucntioin by using Table API. In sum up, I am working on some benchmark > for > > flink > > > > I am Philip Lee majoring in Computer Science in Master Degree of TUB. , I > > work on translation from Hive Query of Benchmark to Flink codes. > > > > As I stuided it, I have a few of questions. > > > > First of all, if there are people who do no know Hive functions, let me > > briefly explan. > > > > ORDER BY: it just guarntees total order in the output. > > SORT BY: it only guarntess ordering of the rows within a reducer. > > GROUP BY: this is just groupBy function in SQL. > > DISTRIBUTE BY: all rows with the same distributed by columns will go to > the > > same reducer. > > CLUSTER BY: this is just consisted of Distribute By the same column + > Sort > > By the same column. > > > > I just want to check that the flink functions I use are equal to Hive > one. > > < Hive SQL Query = Flink functions > > > > > ORDER BY = sortPartition(,) > > SORT BY= groupBy(`col).sortPartition(,) > > GROUP BY: this is just groupBy function. > > DISTRIBUTE BY = groupBy(`col) > > CLUSTER BY = groupBy(`col).sortPartition(,) > > > > I do not see much difference between groupBy and distributed by if I > apply > > it to flink function. > > If this is hadoop version, we could say mapper is distribute by on > hadoop. > > However, I am not much sure what could be DISTRIBUTE BY on flink. I > tried to > > guess groupBy on Flink could be the function which is to distribute the > rows > > by the specified key. > > > > Please feel free to correct what I suggested. > > > > > > Secondly, I just want to make sure the difference between reduce function > > and reduceGroup. I guess there must be a trade-off between two > functinos. I > > know reduceGroup is invoked with an Iterator, but which case is more > proper > > and benifical to use reduceGroup function rather than reduce function? > > > > Best Regards, > > Philip > > > > -- > > > > ========================================================== > > > > Hae Joon Lee > > > > > > Now, in Germany, > > > > M.S. Candidate, Interested in Distributed System, Iterative Processing > > > > Dept. of Computer Science, Informatik in German, TUB > > > > Technical University of Berlin > > > > > > In Korea, > > > > M.S. Candidate, Computer Architecture Laboratory > > > > Dept. of Computer Science, KAIST > > > > > > Rm# 4414 CS Dept. KAIST > > > > 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701) > > > > > > Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea > > > > ========================================================== >