Re: Question about loading IMDb dataset from CSV files

Mehnaz Tabassum Mahin Thu, 13 Jun 2024 15:12:02 -0700

Sounds great! Thank you so much.

-- Mehnaz


On Thu, Jun 13, 2024 at 12:52 PM Mike Carey <[email protected]> wrote:

> Nice!
>
> On Thu, Jun 13, 2024 at 12:25 PM Ian Maxon <[email protected]> wrote:
>
> >  There's a patch up on Gerrit now that should allow files like this to be
> > parsed. If you give the option ("escape"="\\") it should pass this down
> all
> > the way to the parser, which has been extended to allow characters other
> > than " to be an escape for " . I tried it on the IMDB dataset from that
> > benchmark and it appeared like it parsed all the lines, as long as there
> > are no multi-line quoted strings. I just did SELECT COUNT (*) ... vs 'wc
> > -l' .
> >
> > On Jun 11, 2024 at 17:03:19, Ian Maxon <[email protected]> wrote:
> >
> > > I believe so. I think FieldCursorForDelimitedDataParser just needs to
> be
> > > refactored to allow some character other than quote to begin an escape,
> > and
> > > it should be able to parse this fine.
> > > I'd be curious on other's thoughts as well, though. I am surprised we
> > > haven't hit this yet from other sources.
> > >
> > > On Jun 11, 2024 at 16:58:08, Mike Carey <[email protected]> wrote:
> > >
> > > So suppose we make the long-term rule that we can't change any of the
> > >
> > > lines in the file (:-)) - as that's the customer's data - and want to
> be
> > >
> > > import all of it - what're the specific moves in the CSV game that are
> > >
> > > needed in terms of being able to swallow the IMDb data whole?  (To
> allow
> > >
> > > configurable escape, that is?)
> > >
> > >
> > > (As a workaround to be unblocked for testing/benchmarking I guess
> Mehnaz
> > >
> > > can break the no-changing lines rule in the very short term - but -
> > >
> > > that's not ideal because we want to talk to the owners of the benchmark
> > >
> > > she's using and say that we're using exactly their data.)
> > >
> > >
> > >
> > > On 6/11/24 4:13 PM, Ian Maxon wrote:
> > >
> > >
> > >   The problem is sort of multifaceted. DelimitedDataParser doesn't
> allow
> > >
> > >
> > > configuration of the escape character. QuotedLineRecordReader does, but
> > it
> > >
> > >
> > > isn't parsing the fields. You also only get that if you specify
> > >
> > >
> > > "format"="csv", and not "delimited-text".
> > >
> > >
> > > The csv isn't compliant with what's stated in RFC4180. There, the
> escape
> > >
> > >
> > > character is "". This is what DelimitedDataParser follows. If the line
> is
> > >
> > >
> > > changed to use that ("" insteade of \"), it works fine.
> > >
> > >
> > > I think we should consider supporting configurable escape during parse,
> > >
> > >
> > > since it can't really be expected that CSV should follow that RFC
> > strictly;
> > >
> > >
> > > it is somewhat of an ad-hoc format.
> > >
> > >
> > >
> > > On Jun 11, 2024 at 08:30:48, Mike Carey<[email protected]>  wrote:
> > >
> > >
> > >
> > > > I’m told the relevant code is in QuotedLineRecordReader, that's where
> > >
> > >
> > > > CSV/TSV parsing takes place, so you can have a look at what is
> > happening
> > >
> > >
> > > > there.  There’s also an undocumented escape flag there (which we need
> > to
> > >
> > >
> > > > test and document).  Others will probably have more details…. 🙂
> > >
> > >
> > > >
> > >
> > >
> > > > On Mon, Jun 10, 2024 at 4:18 PM Mehnaz Tabassum Mahin <
> > >
> > >
> > > > [email protected]> wrote:
> > >
> > >
> > > >
> > >
> > >
> > > > Hello everyone,
> > >
> > >
> > > >
> > >
> > >
> > > >
> > >
> > >
> > > > I am trying to load the IMDb dataset in AsterixDB. It seems that some
> > of
> > >
> > >
> > > >
> > >
> > >
> > > > the rows end up with broken escaping and eventually not being
> inserted
> > at
> > >
> > >
> > > >
> > >
> > >
> > > > all. For example, I used the syntax as follows:
> > >
> > >
> > > >
> > >
> > >
> > > >
> > >
> > >
> > > > LOAD DATASET movie_companies using localfs (
> > >
> > >
> > > >
> > >
> > >
> > > > ("path"=asterix_nc1://imdb-data/movie-companies.csv),
> > >
> > >
> > > >
> > >
> > >
> > > > ("format"="delimited-text"),("delimiter"=","), ("null"="")
> > >
> > >
> > > >
> > >
> > >
> > > > );
> > >
> > >
> > > >
> > >
> > >
> > > >
> > >
> > >
> > > > The schema is movie_companies (id: int, movie_id: int, company_id:
> int,
> > >
> > >
> > > >
> > >
> > >
> > > > company_type_id: int, note: string) and the CSV file contains the
> > >
> > > following
> > >
> > >
> > > >
> > >
> > >
> > > > row:
> > >
> > >
> > > >
> > >
> > >
> > > >
> > >
> > >
> > > > 13893, 53192, 1376, 1, "(1986) (USA) (VHS) (included in \"The Best Of
> > >
> > >
> > > >
> > >
> > >
> > > > Alfred Hitchcock, Vol. One\")"
> > >
> > >
> > > >
> > >
> > >
> > > >
> > >
> > >
> > > > This row ends up not loading at all. The rest of the row with no such
> > >
> > >
> > > >
> > >
> > >
> > > > string input can be loaded successfully.
> > >
> > >
> > > >
> > >
> > >
> > > >
> > >
> > >
> > > > Any suggestions?
> > >
> > >
> > > >
> > >
> > >
> > > >
> > >
> > >
> > > > Thanks,
> > >
> > >
> > > >
> > >
> > >
> > > > Mehnaz
> > >
> > >
> > > >
> > >
> > >
> > > >
> > >
> > >
> > > >
> > >
> > >
> > >
> > >
> >
>

Re: Question about loading IMDb dataset from CSV files

Reply via email to