Re: Storing large files for later processing through hadoop

2015-01-02 Thread Jacob Rhoden
If it's for auditing, if recommend pushing the files out somewhere reasonably external, Amazon S3 works well for this type of thing, and you don't have to worry too much about backups and the like. __ Sent from iPhone > On 3 Jan 2015, at 5:07 pm, Srinivasa T N wrote

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
Hi Wilm, The reason is that for some auditing purpose, I want to store the original files also. Regards, Seenu. On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher wrote: > Hi, > > perhaps I totally misunderstood your problem, but why "bother" with > cassandra for storing in the first place? >

Re: STCS limitation with JBOD?

2015-01-02 Thread Robert Coli
On Fri, Jan 2, 2015 at 11:28 AM, Colin wrote: > Forcing a major compaction is usually a bad idea. What is your reason for > doing that? > I'd say "often" and not "usually". Lots of people have schema where they create way too much garbage, and major compaction can be a good response. The docs'

Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Robert Coli
On Fri, Jan 2, 2015 at 11:35 AM, Tyler Hobbs wrote: > > This is not true (with one minor exception). All operations on sets and > maps require no reads. The same is true for appends and prepends on lists, > but delete and set operations on lists with (non-zero) indexes require the > list to be

Re: Tombstones without DELETE

2015-01-02 Thread Tyler Hobbs
No worries! They're a data type that was introduced in 1.2: http://www.datastax.com/dev/blog/cql3_collections On Fri, Jan 2, 2015 at 12:07 PM, Nikolay Mihaylov wrote: > Hi Tyler, > > sorry for very stupid question - what is a collection ? > > Nick > > On Wed, Dec 31, 2014 at 6:27 PM, Tyler Hobb

Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Tyler Hobbs
On Fri, Jan 2, 2015 at 1:13 PM, Eric Stevens wrote: > > And also stored entirely for each UPDATE. Change one element, > re-serialize the whole thing to disk. > > Is this true? I thought updates (adds, removes, but not overwrites) > affected just the indicated columns. Isn't it just the reads th

Re: STCS limitation with JBOD?

2015-01-02 Thread Colin
Forcing a major compaction is usually a bad idea. What is your reason for doing that? -- Colin Clark +1-320-221-9531 > On Jan 2, 2015, at 1:17 PM, Dan Kinder wrote: > > Hi, > > Forcing a major compaction (using nodetool compact) with STCS will result in > a single sstable (ignoring repai

STCS limitation with JBOD?

2015-01-02 Thread Dan Kinder
Hi, Forcing a major compaction (using nodetool compact ) with STCS will result in a single sstable (ignoring repair data). However this seems like it could be a problem for large JBOD setups. For example if I have 1

Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Eric Stevens
> And also stored entirely for each UPDATE. Change one element, re-serialize the whole thing to disk. Is this true? I thought updates (adds, removes, but not overwrites) affected just the indicated columns. Isn't it just the reads that involve reading the entire collection? DS docs talk about r

Re: Best Time Series insert strategy

2015-01-02 Thread Robert Coli
On Tue, Dec 16, 2014 at 1:16 PM, Arne Claassen wrote: > 3) Go to consistency ANY. > Consistency level ANY should probably be renamed to NEVER and removed from the software. It is almost never the correct solution to any problem. =Rob

Re: Number of SSTables grows after repair

2015-01-02 Thread Robert Coli
On Mon, Dec 15, 2014 at 1:51 AM, Michał Łowicki wrote: > We've noticed that number of SSTables grows radically after running > *repair*. What we did today is to compact everything so for each node > number of SStables < 10. After repair it jumped to ~1600 on each node. What > is interesting is th

Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Robert Coli
On Thu, Jan 1, 2015 at 11:04 AM, DuyHai Doan wrote: > 2) collections and maps are loaded entirely by Cassandra for each query, > whereas with clustering columns you can select a slice of columns > And also stored entirely for each UPDATE. Change one element, re-serialize the whole thing to disk.

sstable structure

2015-01-02 Thread Nikolay Mihaylov
Hi from some time I try to find the structure of sstable is it documented somewhere or can anyone explain it to me I am speaking about "hex dump" bytes stored on the disk. Nick.

Re: Tombstones without DELETE

2015-01-02 Thread Nikolay Mihaylov
Hi Tyler, sorry for very stupid question - what is a collection ? Nick On Wed, Dec 31, 2014 at 6:27 PM, Tyler Hobbs wrote: > Overwriting an entire collection also results in a tombstone being > inserted. > > On Wed, Dec 24, 2014 at 7:09 AM, Ryan Svihla wrote: > >> You should probably ask on t

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Wilm Schumacher
Hi, perhaps I totally misunderstood your problem, but why "bother" with cassandra for storing in the first place? If your MR for hadoop is only run once for each file (as you wrote above), why not copy the data directly to hdfs, run your MR job and use cassandra as sink? As hdfs and yarn are mor

Re: Storing large files for later processing through hadoop

2015-01-02 Thread mck
> Since the hadoop MR streaming job requires the file to be processed to be > present in HDFS, > I was thinking whether can it get directly from mongodb instead of me > manually fetching it > and placing it in a directory before submitting the hadoop job? Hadoop M/R can get data directly from

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
I agree that cassandra is a columnar store. The storing of the raw xml file, parsing the file using hadoop and then storing the extracted value is only once. The extracted data on which further operations will be done suits well with the timeseries storage of the data provided by cassandra and th

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Eric Stevens
> Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? There are client libraries which offer recipes for this, but in general, no. You're trying to do something with Cassandra that it's not designed to do. You

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
On Fri, Jan 2, 2015 at 5:54 PM, mck wrote: > > You could manually chunk them down to 64Mb pieces. > > Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? > > > 2) Can I replace HDFS with Cassandra so that I

Re: Storing large files for later processing through hadoop

2015-01-02 Thread mck
> 1) The FAQ … informs that I can have only files of around 64 MB … See http://wiki.apache.org/cassandra/CassandraLimitations "A single column value may not be larger than 2GB; in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob va

Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
Hi All, The problem I am trying to address is: Store the raw files (files are in xml format and of the size arnd 700MB) in cassandra, later fetch it and process it in hadoop cluster and populate back the processed data in cassandra. Regarding this, I wanted few clarifications: 1) The FAQ ( ht