Fabian,
Not sure if we are on the same page. If I do something like below code, it
will groupby field 0 and each task will write a separate part file in
parallel.
val sink = data1.join(data2)
.where(1).equalTo(0) { ((l,r) => ( l._3, r._3) ) }
.partitionByHash(0)
.writeAsCsv(pathBase + "output/test", rowDelimiter="\n",
fieldDelimiter="\t" , WriteMode.OVERWRITE)
This will create folder ./output/test/<1,2,3,4...>
But what I was looking for is Hive style partitionBy that will output with
folder structure
./output/field0=1/file
./output/field0=2/file
./output/field0=3/file
./output/field0=4/file
Assuming field0 is Int and has unique values 1,2,3&4.
Srikanth
On Mon, Feb 15, 2016 at 6:20 AM, Fabian Hueske <[email protected]> wrote:
> Hi Srikanth,
>
> DataSet.partitionBy() will partition the data on the declared partition
> fields.
> If you append a DataSink with the same parallelism as the partition
> operator, the data will be written out with the defined partitioning.
> It should be possible to achieve the behavior you described using
> DataSet.partitionByHash() or partitionByRange().
>
> Best, Fabian
>
>
> 2016-02-12 20:53 GMT+01:00 Srikanth <[email protected]>:
>
>> Hello,
>>
>>
>>
>> Is there a Hive(or Spark dataframe) partitionBy equivalent in Flink?
>>
>> I'm looking to save output as CSV files partitioned by two columns(date
>> and hour).
>>
>> The partitionBy dataset API is more to partition the data based on a
>> column for further processing.
>>
>>
>>
>> I'm thinking there is no direct API to do this. But what will be the best
>> way of achieving this?
>>
>>
>>
>> Srikanth
>>
>>
>>
>
>