Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Jörn Franke Thu, 08 Jun 2017 00:47:39 -0700

You can change the CSV parser library


> On 8. Jun 2017, at 08:35, Chanh Le <giaosu...@gmail.com> wrote:
> 
> 
> I did add mode -> DROPMALFORMED but it still couldn't ignore it because the 
> error raise from the CSV library that Spark are using.
> 
> 
>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com> wrote:
>> The CSV data source allows you to skip invalid lines - this should also 
>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>> 
>>> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote:
>>> 
>>> Hi Takeshi, Jörn Franke,
>>> 
>>> The problem is even I increase the maxColumns it still have some lines have 
>>> larger columns than the one I set and it will cost a lot of memory.
>>> So I just wanna skip the line has larger columns than the maxColumns I set.
>>> 
>>> Regards,
>>> Chanh
>>> 
>>> 
>>>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin....@gmail.com> 
>>>> wrote:
>>>> Is it not enough to set `maxColumns` in CSV options?
>>>> 
>>>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>>> 
>>>> // maropu
>>>> 
>>>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>> Spark CSV data source should be able
>>>>> 
>>>>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi everyone,
>>>>>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>>>>>> One problem that I am facing is if one row of csv file has more columns 
>>>>>> than maxColumns (default is 20480). The process of parsing was stop.
>>>>>> 
>>>>>> Internal state when error was thrown: line=1, column=3, record=0, 
>>>>>> charIndex=12
>>>>>> com.univocity.parsers.common.TextParsingException: 
>>>>>> java.lang.ArrayIndexOutOfBoundsException - 2
>>>>>> Hint: Number of columns processed may have exceeded limit of 2 columns. 
>>>>>> Use settings.setMaxColumns(int) to define the maximum number of columns 
>>>>>> your input can have
>>>>>> Ensure your configuration is correct, with delimiters, quotes and escape 
>>>>>> sequences that match the input format you are trying to parse
>>>>>> Parser Configuration: CsvParserSettings:
>>>>>> 
>>>>>> 
>>>>>> I did some investigation in univocity library but the way it handle is 
>>>>>> throw error that why spark stop the process.
>>>>>> 
>>>>>> How to skip the invalid row and just continue to parse next valid one?
>>>>>> Any libs can replace univocity in that job?
>>>>>> 
>>>>>> Thanks & regards,
>>>>>> Chanh
>>>>>> -- 
>>>>>> Regards,
>>>>>> Chanh
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> ---
>>>> Takeshi Yamamuro
>>> 
>>> -- 
>>> Regards,
>>> Chanh
> 
> -- 
> Regards,
> Chanh

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Reply via email to