not directly related, but you can try to use zstd as compression - in my tests it performed faster offload, with slightly worse compression ratio
Marc Richter at "Wed, 22 Apr 2020 17:57:44 +0200" wrote: MR> Seems as if sstable2json is deprecated; see [1] and [2]. MR> So, dsbulk [3] it is, I guess. MR> I downloaded it and crafted the following commandline from the docs [4] for my use case: MR> $ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \ MR> --driver.advanced.auth-provider.class PlainTextAuthProvider \ MR> -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \ MR> --connector.json.compression gzip -url /path/to/big/storage MR> This seems to result in multiple JSON files compressed with GZIP; seems to be exactly what MR> I needed to help me in this case! MR> There's only one thing that I do not really understand what it means: MR> Besides the GZIP archives, it also creates two logfiles. One of them (unload-errors.log) MR> contains some Java stacks. I do not understand what those lines are supposed to say: MR> (Added it to pastebin to not render the mail unreadable): MR> https://pastebin.com/WpYvqxAA MR> What are those lines supposed to tell me? MR> Marc Richter MR> [1] https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html MR> [2] https://issues.apache.org/jira/browse/CASSANDRA-9618 MR> [3] https://downloads.datastax.com/#bulk-loader MR> [4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html MR> On 22.04.20 16:15, Marc Richter wrote: >> This sounds like a promising way; thank you for bringing this up! >> >> I will see if I can manage it with this approach. >> >> Best regards, >> Marc Richter >> >> >> >> On 22.04.20 15:38, Durity, Sean R wrote: >>> I thought this might be a single-time use case request. I think my first >>> approach would >>> be to use something like dsbulk to unload the data and then reload it into >>> a table >>> designed for the query you want to do (as long as you have adequate disk >>> space). I >>> think like a DBA/admin first. Dsbulk creates csv files, so you could move >>> that data to >>> any kind of database, if you chose. >>> >>> An alternative approach would be to use a driver that supports paging (I >>> think this >>> would be most of them) and write a program to walk the data set and output >>> what you >>> need in whatever format you need. >>> >>> Or, since this is a single node scenario, you could try sstable2json to >>> export the >>> sstables (files on disk) into JSON, if that is a more workable format for >>> you. >>> >>> Sean Durity – Staff Systems Engineer, Cassandra >>> >>> -----Original Message----- >>> From: Marc Richter <m...@marc-richter.info> >>> Sent: Wednesday, April 22, 2020 6:22 AM >>> To: user@cassandra.apache.org >>> Subject: [EXTERNAL] Re: Issues, understanding how CQL works >>> >>> Hi Jeff, >>> >>> thank you for your exhaustive and verbose answer! >>> Also, a very big "Thank you!" to all the other replyers; I hope you >>> understand that I summarize all your feedback in this single answer. >>> >>> From what I understand from your answers, Cassandra seems to be >>> optimized to store (and read) data in only exactly that way that the >>> data structure has been designed for. That makes it very inflexible, but >>> allows it to do that single job very effectively for a trade-off. >>> >>> I also understand, the more I dig into Cassandra, that the team I am >>> supporting is using Cassandra kind of wrong; they for example do have >>> only one node and so do not use neither the load-balancing, nor the >>> redundancy-capabilities Cassandra offers. >>> Thus, maybe relevant side-note: All the data resides on just one single >>> node; maybe that info is important, because we know on which node the >>> data is (I know that Cassandra internally is applying the same Hashing - >>> Voodoo as if there were 1k nodes, but maybe this is important anyways). >>> >>> Anyways: I do not really care if a query or effort to find this >>> information is sub-optimal or very "expensive" in means of effectivity >>> or system-load, since this isn't something that I need to extract on a >>> regular basis, but only once. Due to that, it doesn't need to be optimal >>> or effective; I also do not care if it blocks the node for several >>> hours, since Cassandra is only working on this single request. I really >>> need this info (most recent "insertdate") only once. >>> Is, considering this, a way to do that? >>> >>> > Because you didnt provide a signalid and monthyear, it doesn't know >>> > which machine in your cluster to use to start the query. >>> >>> I know this already; thanks for confirming that I got this correct! But >>> what do I do then if I do not know all "signalid"s? How to learn them? >>> >>> Is it maybe possible to get a full list of all "signalid"s? Or is it >>> possible to "re-arrange" the data in the cluster or something that >>> enables me to learn what's the most recent "insertdate"? >>> I really do not care if I need to do some expensive copy-all-data - >>> move, but I do not know about what is possible and how to do that. >>> >>> Best regards, >>> Marc Richter >>> >>> On 21.04.20 19:20, Jeff Jirsa wrote: >>>> >>>> >>>> On Tue, Apr 21, 2020 at 6:20 AM Marc Richter <m...@marc-richter.info >>>> <mailto:m...@marc-richter.info>> wrote: >>>> >>>> Hi everyone, >>>> >>>> I'm very new to Cassandra. I have, however, some experience with SQL. >>>> >>>> >>>> The biggest thing to remember is that Cassandra is designed to scale out >>>> to massive clusters - like thousands of instances. To do that, you can't >>>> assume it's ever ok to read all of the data, because that doesn't scale. >>>> So cassandra takes shortcuts / optimizations to make it possible to >>>> ADDRESS all of that data, but not SCAN it. >>>> >>>> >>>> I need to extract some information from a Cassandra database that has >>>> the following table definition: >>>> >>>> CREATE TABLE tagdata.central ( >>>> signalid int, >>>> monthyear int, >>>> fromtime bigint, >>>> totime bigint, >>>> avg decimal, >>>> insertdate bigint, >>>> max decimal, >>>> min decimal, >>>> readings text, >>>> PRIMARY KEY (( signalid, monthyear ), fromtime, totime) >>>> ) >>>> >>>> >>>> What your primary key REALLY MEANS is: >>>> >>>> The database on reads and writes will hash(signalid+monthyear) to find >>>> which hosts have the data, then >>>> >>>> In each data file, the data for a given (signalid,monthyear) is stored >>>> sorted by fromtime and totime >>>> >>>> The database is already of round about 260 GB in size. >>>> I now need to know what is the most recent entry in it; the correct >>>> column to learn this would be "insertdate". >>>> >>>> In SQL I would do something like this: >>>> >>>> SELECT insertdate FROM tagdata.central >>>> ORDER BY insertdate DESC LIMIT 1; >>>> >>>> In CQL, however, I just can't get it to work. >>>> >>>> What I have tried already is this: >>>> >>>> SELECT insertdate FROM "tagdata.central" >>>> ORDER BY insertdate DESC LIMIT 1; >>>> >>>> >>>> Because you didnt provide a signalid and monthyear, it doesn't know >>>> which machine in your cluster to use to start the query. >>>> >>>> >>>> But this gives me an error: >>>> ERROR: ORDER BY is only supported when the partition key is >>>> restricted >>>> by an EQ or an IN. >>>> >>>> >>>> Because it's designed for potentially petabytes of data per cluster, it >>>> doesn't believe you really want to walk all the data and order ALL of >>>> it. Instead, it assumes that when you need to use an ORDER BY, you're >>>> going to have some very small piece of data - confined to a single >>>> signalid/monthyear pair. And even then, the ORDER is going to assume >>>> that you're ordering it by the ordering keys you've defined - fromtime >>>> first, and then totime. >>>> >>>> So you can do >>>> >>>> SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC >>>> And you can do >>>> >>>> SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC >>>> >>>> And you can do ranges: >>>> >>>> SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY >>>> fromtime DESC >>>> >>>> But you have to work within the boundaries of how the data is stored. >>>> It's stored grouped by signalid+monthyear, and then sorted by fromtime, >>>> and then sorted by totime. >>>> >>>> >>>> >>>> So, after some trial and error and a lot of Googling, I learned that >>>> I >>>> must include all rows from the PRIMARY KEY from left to right in my >>>> query. Thus, this is the "best" I can get to work: >>>> >>>> >>>> SELECT >>>> * >>>> FROM >>>> "tagdata.central" >>>> WHERE >>>> "signalid" = 4002 >>>> AND "monthyear" = 201908 >>>> ORDER BY >>>> "fromtime" DESC >>>> LIMIT 10; >>>> >>>> >>>> The "monthyear" column, I crafted like a fool by incrementing the >>>> date >>>> one month after another until no results could be found anymore. >>>> The "signalid" I grabbed from one of the unrestricted "SELECT * >>>> FROM" - >>>> query results. But these can't be as easily guessed as the >>>> "monthyear" >>>> values could. >>>> >>>> This is where I'm stuck! >>>> >>>> 1. This does not really feel like the ideal way to go. I think there >>>> is >>>> something more mature in modern IT systems. Can anyone tell me what >>>> is a >>>> better way to get these informations? >>>> >>>> >>>> You can denormalize. Because cassandra allows you to have very large >>>> clusters, you can make multiple tables sorted in different ways to >>>> enable the queries you need to run. Normal data modeling is to build >>>> tables based on the SELECT statements you need to do (unless you're very >>>> advanced, in which case you do it based on the transaction semantics of >>>> the INSERT/UPDATE statements, but that's probably not you). >>>> >>>> Or you can use a more flexible database. >>>> >>>> >>>> 2. I need a way to learn all values that are in the "monthyear" and >>>> "signalid" columns in order to be able to craft that query. >>>> How can I achieve that in a reasonable way? As I said: The DB is >>>> round >>>> about 260 GB which makes it next to impossible to just "have a look" >>>> at >>>> the output of "SELECT *".. >>>> >>>> >>>> You probably want to keep another table of monthyear + signalid pairs. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: user-h...@cassandra.apache.org >>> >>> >>> ________________________________ >>> >>> The information in this Internet Email is confidential and may be legally >>> privileged. It is intended solely for the addressee. Access to this Email >>> by anyone >>> else is unauthorized. If you are not the intended recipient, any >>> disclosure, copying, >>> distribution or any action taken or omitted to be taken in reliance on it, >>> is >>> prohibited and may be unlawful. When addressed to our clients any opinions >>> or advice >>> contained in this Email are subject to the terms and conditions expressed >>> in any >>> applicable governing The Home Depot terms of business or client engagement >>> letter. The >>> Home Depot disclaims all responsibility and liability for the accuracy and >>> content of >>> this attachment and for any damages or losses arising from any >>> inaccuracies, errors, >>> viruses, e.g., worms, trojan horses, etc., or other items of a destructive >>> nature, >>> which may be contained in this attachment and shall not be liable for >>> direct, indirect, >>> consequential or special damages in connection with this e-mail message or >>> its >>> attachment. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: user-h...@cassandra.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org MR> --------------------------------------------------------------------- MR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org MR> For additional commands, e-mail: user-h...@cassandra.apache.org -- With best wishes, Alex Ott Principal Architect, DataStax http://datastax.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org