Nice!

On Thu, Jun 13, 2024 at 12:25 PM Ian Maxon <ima...@apache.org> wrote:

>  There's a patch up on Gerrit now that should allow files like this to be
> parsed. If you give the option ("escape"="\\") it should pass this down all
> the way to the parser, which has been extended to allow characters other
> than " to be an escape for " . I tried it on the IMDB dataset from that
> benchmark and it appeared like it parsed all the lines, as long as there
> are no multi-line quoted strings. I just did SELECT COUNT (*) ... vs 'wc
> -l' .
>
> On Jun 11, 2024 at 17:03:19, Ian Maxon <ima...@apache.org> wrote:
>
> > I believe so. I think FieldCursorForDelimitedDataParser just needs to be
> > refactored to allow some character other than quote to begin an escape,
> and
> > it should be able to parse this fine.
> > I'd be curious on other's thoughts as well, though. I am surprised we
> > haven't hit this yet from other sources.
> >
> > On Jun 11, 2024 at 16:58:08, Mike Carey <dtab...@gmail.com> wrote:
> >
> > So suppose we make the long-term rule that we can't change any of the
> >
> > lines in the file (:-)) - as that's the customer's data - and want to be
> >
> > import all of it - what're the specific moves in the CSV game that are
> >
> > needed in terms of being able to swallow the IMDb data whole?  (To allow
> >
> > configurable escape, that is?)
> >
> >
> > (As a workaround to be unblocked for testing/benchmarking I guess Mehnaz
> >
> > can break the no-changing lines rule in the very short term - but -
> >
> > that's not ideal because we want to talk to the owners of the benchmark
> >
> > she's using and say that we're using exactly their data.)
> >
> >
> >
> > On 6/11/24 4:13 PM, Ian Maxon wrote:
> >
> >
> >   The problem is sort of multifaceted. DelimitedDataParser doesn't allow
> >
> >
> > configuration of the escape character. QuotedLineRecordReader does, but
> it
> >
> >
> > isn't parsing the fields. You also only get that if you specify
> >
> >
> > "format"="csv", and not "delimited-text".
> >
> >
> > The csv isn't compliant with what's stated in RFC4180. There, the escape
> >
> >
> > character is "". This is what DelimitedDataParser follows. If the line is
> >
> >
> > changed to use that ("" insteade of \"), it works fine.
> >
> >
> > I think we should consider supporting configurable escape during parse,
> >
> >
> > since it can't really be expected that CSV should follow that RFC
> strictly;
> >
> >
> > it is somewhat of an ad-hoc format.
> >
> >
> >
> > On Jun 11, 2024 at 08:30:48, Mike Carey<dtab...@gmail.com>  wrote:
> >
> >
> >
> > > I’m told the relevant code is in QuotedLineRecordReader, that's where
> >
> >
> > > CSV/TSV parsing takes place, so you can have a look at what is
> happening
> >
> >
> > > there.  There’s also an undocumented escape flag there (which we need
> to
> >
> >
> > > test and document).  Others will probably have more details…. 🙂
> >
> >
> > >
> >
> >
> > > On Mon, Jun 10, 2024 at 4:18 PM Mehnaz Tabassum Mahin <
> >
> >
> > > mehnaztabassum.ma...@email.ucr.edu> wrote:
> >
> >
> > >
> >
> >
> > > Hello everyone,
> >
> >
> > >
> >
> >
> > >
> >
> >
> > > I am trying to load the IMDb dataset in AsterixDB. It seems that some
> of
> >
> >
> > >
> >
> >
> > > the rows end up with broken escaping and eventually not being inserted
> at
> >
> >
> > >
> >
> >
> > > all. For example, I used the syntax as follows:
> >
> >
> > >
> >
> >
> > >
> >
> >
> > > LOAD DATASET movie_companies using localfs (
> >
> >
> > >
> >
> >
> > > ("path"=asterix_nc1://imdb-data/movie-companies.csv),
> >
> >
> > >
> >
> >
> > > ("format"="delimited-text"),("delimiter"=","), ("null"="")
> >
> >
> > >
> >
> >
> > > );
> >
> >
> > >
> >
> >
> > >
> >
> >
> > > The schema is movie_companies (id: int, movie_id: int, company_id: int,
> >
> >
> > >
> >
> >
> > > company_type_id: int, note: string) and the CSV file contains the
> >
> > following
> >
> >
> > >
> >
> >
> > > row:
> >
> >
> > >
> >
> >
> > >
> >
> >
> > > 13893, 53192, 1376, 1, "(1986) (USA) (VHS) (included in \"The Best Of
> >
> >
> > >
> >
> >
> > > Alfred Hitchcock, Vol. One\")"
> >
> >
> > >
> >
> >
> > >
> >
> >
> > > This row ends up not loading at all. The rest of the row with no such
> >
> >
> > >
> >
> >
> > > string input can be loaded successfully.
> >
> >
> > >
> >
> >
> > >
> >
> >
> > > Any suggestions?
> >
> >
> > >
> >
> >
> > >
> >
> >
> > > Thanks,
> >
> >
> > >
> >
> >
> > > Mehnaz
> >
> >
> > >
> >
> >
> > >
> >
> >
> > >
> >
> >
> >
> >
>

Reply via email to