Re: Question about reading ORC file in Flink

Fabian Hueske Wed, 25 Sep 2019 01:52:03 -0700

Thank you very much for coming back and reporting the good news! :-)
If you think that there is something that we can do to improve Flink's ORC
input format, for example log a warning, please open a Jira.


Thank you,
Fabian

Am Mi., 25. Sept. 2019 um 05:14 Uhr schrieb 163 <shuyun...@163.com>:

> Hi Fabian,
>
> After debugging in local mode, I found that Flink orc connector is no
> problem, but some fields in our schema is in capital form，so these fields
> can not be matched.
> But the program directly read orc file using includeColumns method, which
> will use equalsIgnoreCase to match the column, so it can read the fields.
>
> Thanks for your Help!
>
> Qi Shu
>
>
> 在 2019年9月24日，下午4:36，Fabian Hueske <fhue...@gmail.com> 写道：
>
> Hi QiShu,
>
> It might be that Flink's OrcInputFormat has a bug.
> Can you open a Jira issue to report the problem?
> In order to be able to fix this, we need as much information as possible.
> It would be great if you could create a minimal example of an ORC file and
> a program that reproduces the issue.
> If that's not possible, we need the schema of an Orc file that cannot be
> correctly read.
>
> Thanks,
> Fabian
>
> Am Mo., 23. Sept. 2019 um 11:40 Uhr schrieb ShuQi <shuyun...@163.com>:
>
>> Hi Guys,
>>
>> The Flink version is 1.9.0. I use OrcTableSource to read ORC file in HDFS
>> and the job is executed successfully, no any exception or error. But some
>> fields(such as tagIndustry) are always null, actually these fields are not
>> null. I can read these fields by direct reading it. Below is my code:
>>
>> //main
>>          final ParameterTool params = ParameterTool.fromArgs(args);
>>
>>         final ExecutionEnvironment env =
>> ExecutionEnvironment.getExecutionEnvironment();
>>
>>         env.getConfig().setGlobalJobParameters(params);
>>
>>         Configuration config = new Configuration();
>>
>>
>>         OrcTableSource orcTableSource = OrcTableSource
>>             .builder()
>>             .path(params.get("input"))
>>             .forOrcSchema(TypeDescription.fromString(TYPE_INFO))
>>             .withConfiguration(config)
>>             .build();
>>
>>         DataSet<Row> dataSet = orcTableSource.getDataSet(env);
>>
>>         DataSet<Tuple2<String, Integer>> counts = dataSet.flatMap(new
>> Tokenizer()).groupBy(0).sum(1);
>>
>> //read field
>> public void flatMap(Row row, Collector<Tuple2<String, Integer>> out) {
>>
>>             String content = ((String) row.getField(6));
>>             String tagIndustry = ((String) row.getField(35));
>>
>>                 LOGGER.info("arity: " + row.getArity());
>>                 LOGGER.info("content: " + content);
>>                 LOGGER.info("tagIndustry: " + tagIndustry);
>>                 LOGGER.info("===========================");
>>
>>             if (Strings.isNullOrEmpty(content) ||
>> Strings.isNullOrEmpty(tagIndustry) || !tagIndustry.contains("FD001")) {
>>                 return;
>>             }
>>             // normalize and split the line
>>             String[] tokens = content.toLowerCase().split("\\W+");
>>
>>             // emit the pairs
>>             for (String token : tokens) {
>>                 if (token.length() > 0) {
>>                     out.collect(new Tuple2<>(token, 1));
>>                 }
>>             }
>>         }
>>
>> Thanks for your help!
>>
>> QiShu
>>
>>
>>
>>
>>
>
>

Re: Question about reading ORC file in Flink

Reply via email to