Re: Storing large files for later processing through hadoop

Eric Stevens Fri, 02 Jan 2015 09:15:19 -0800

> Can this split and combine be done automatically by cassandra when
inserting/fetching the file without application being bothered about it?

There are client libraries which offer recipes for this, but in general,
no.

You're trying to do something with Cassandra that it's not designed to do.
You can get there from here, but you're not going to have a good time.  If
you need a document store, you should use a NoSQL solution designed with
that in mind (Cassandra is a columnar store).  If you need a distributed
filesystem, you should use one of those.

If you do want to continue forward and do this with Cassandra, then you
should definitely not do this on the same cluster as handles normal clients
as the kind of workload you'd be subjecting this cluster to is going to
cause all sorts of troubles for normal clients, particularly with respect
to GC pressure, compaction and streaming problems, and many other
consequences of vastly exceeding recommended limits.

On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N <seen...@gmail.com> wrote:

>
>
> On Fri, Jan 2, 2015 at 5:54 PM, mck <m...@apache.org> wrote:
>
>>
>> You could manually chunk them down to 64Mb pieces.
>>
>> Can this split and combine be done automatically by cassandra when
> inserting/fetching the file without application being bothered about it?
>
>
>>
>> > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
>> > the file from cassandra to HDFS when I want to process it in hadoop
>> cluster?
>>
>>
>> We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
>> need for backups of it, no need to upgrade data, and we're free to wipe
>> it whenever hadoop has been stopped.
>> ~mck
>>
>
> Since the hadoop MR streaming job requires the file to be processed to be
> present in HDFS, I was thinking whether can it get directly from mongodb
> instead of me manually fetching it and placing it in a directory before
> submitting the hadoop job?
>
>
> >> There was a datastax project before in being able to replace HDFS with
> >> Cassandra, but i don't think it's alive anymore.
>
> I think you are referring to Brisk project (
> http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
> but I don't know its current status.
>
> Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?
>
> Regards,
> Seenu.
>

Re: Storing large files for later processing through hadoop

Reply via email to