Re: [QUESTION] Support for different String encoding in RowCoder

Robert Bradshaw via dev Thu, 19 Dec 2024 12:16:48 -0800

+1 to RowCoder being an implementation detail (and it doesn't make
much sense to parameterize its string encodings, logically it just has
string fields).


It does make sense, however, to augment CsvIO to be able to name an
encoding that is used to decode the bytes of the file (producing
"standard" Rows, which would presumably have in memory representations
as java Strings and the encoding chosen for them would not be an
issue).

On Wed, Nov 27, 2024 at 4:56 PM Reuven Lax via dev <dev@beam.apache.org> wrote:
>
> The RowCoder encoding is not really intended to be an external encoding - 
> i.e. it's not intended to be a stable encoding for writing into files. While 
> it's fine to take in PCollection<Row> in your write operation, I would not 
> recommend just using RowCoder in order to generate the bytes written to the 
> file.
>
> On Wed, Nov 27, 2024 at 1:46 PM Facundo Tomatis <facundotoma...@gmail.com> 
> wrote:
>>
>> Hello everyone!
>>
>> I've been developing a csv connector that wraps CsvIO, the read
>> operation outputs PCollection<Row> and the write operation takes
>> PCollection<Row>. I am having issues setting the encoding of the
>> resulting file and the input file, for example I would like to write a
>> CSV with ISO-8859-1 encoding or windows-1250 and more, and read from
>> those encodings as well.
>>
>> Reading the source code I found out that Row's String fields (generated
>> with RowCoder.of(schema)) have a StringUtf8Encoder associated, is there
>> a way to change this encoder to be a custom encoder while maintaining
>> PCollection<Row>?
>>
>> Thanks for your time.
>>
>> Facu.
>>

Re: [QUESTION] Support for different String encoding in RowCoder

Reply via email to