Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread mck
Shenghua,
 
> The problem is the user might only want all the data via a "select *"
> like statement. It seems that 257 connections to query the rows are necessary.
> However, is there any way to prohibit 257 concurrent connections?


Your reasoning is correct.
The number of connections should be tunable via the
"cassandra.input.split.size" property. See
ConfigHelper.setInputSplitSize(..)

The problem is that vnodes completely trashes this, since splits
returned don't span across vnodes.
There's an issue out for this –
https://issues.apache.org/jira/browse/CASSANDRA-6091
 but part of the problem is that the thrift stuff involved here is
 getting rewritten¹ to be pure cql.

In the meantime you override the CqlInputFormat and manually re-merge
splits together, where location sets match, so to better honour
inputSplitSize and to return to a more reasonable number of connections.
We do this, using code similar to this patch
https://github.com/michaelsembwever/cassandra/pull/2/files

~mck

¹ https://issues.apache.org/jira/browse/CASSANDRA-8358


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread Shenghua(Daniel) Wan
I did another experiment to verify indeed 3*257 (1 of 257 ranges is null
effectively) mappers were created.

Thanks mcm for the information !

On Wed, Jan 28, 2015 at 12:17 AM, mck  wrote:

> Shenghua,
>
> > The problem is the user might only want all the data via a "select *"
> > like statement. It seems that 257 connections to query the rows are
> necessary.
> > However, is there any way to prohibit 257 concurrent connections?
>
>
> Your reasoning is correct.
> The number of connections should be tunable via the
> "cassandra.input.split.size" property. See
> ConfigHelper.setInputSplitSize(..)
>
> The problem is that vnodes completely trashes this, since splits
> returned don't span across vnodes.
> There's an issue out for this –
> https://issues.apache.org/jira/browse/CASSANDRA-6091
>  but part of the problem is that the thrift stuff involved here is
>  getting rewritten¹ to be pure cql.
>
> In the meantime you override the CqlInputFormat and manually re-merge
> splits together, where location sets match, so to better honour
> inputSplitSize and to return to a more reasonable number of connections.
> We do this, using code similar to this patch
> https://github.com/michaelsembwever/cassandra/pull/2/files
>
> ~mck
>
> ¹ https://issues.apache.org/jira/browse/CASSANDRA-8358
>



-- 

Regards,
Shenghua (Daniel) Wan


Change timezone

2015-01-28 Thread José Guilherme Vanz
Hi, everyone

I have a development machine running in the UTC -0200
But the Cassandra in the same machine looks like is not running in the same
UTC because the logs are 2 hours ahead ( UTC ). For example, if my machine
is 08:49 and the Cassandra log entries are 10:49. And I'm trying change the
timezone, but I did not succeed. In my last try I put this line:
JVM_OPTS="$JVM_OPTS
-Duser.timezone=America/Sao_Paulo" in my cassandra-env.sh But did not work.
=/

What is the right way to change the Cassandra timezone?

All the best
Vanz


Re: full-tabe scan - extracting all data from C*

2015-01-28 Thread DuyHai Doan
Hint: using the Java driver, you can set the fetchSize to tell the driver
how many CQL rows to fetch for each page.

 Depending on the size (in bytes) of each CQL row, it would be useful to
tune this fetchSize value to avoid loading too much data into memory for
each page

On Wed, Jan 28, 2015 at 8:20 AM, Xu Zhongxing  wrote:

> This is hard to answer. The performance is a thing depending on context.
> You could tune various parameters.
>
> At 2015-01-28 14:43:38, "Shenghua(Daniel) Wan" 
> wrote:
>
> Cool. What about performance? e.g. how many record for how long?
>
> On Tue, Jan 27, 2015 at 10:16 PM, Xu Zhongxing 
> wrote:
>
>> For Java driver, there is no special API actually, just
>>
>> ResultSet rs = session.execute("select * from ...");
>> for (Row r : rs) {
>>...
>> }
>>
>> For Spark, the code skeleton is:
>>
>> val rdd = sc.cassandraTable("ks", "table")
>>
>> then call various standard Spark API to process the table parallelly.
>>
>> I have not used CqlInputFormat.
>>
>> At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan" 
>> wrote:
>>
>> Hi, Zhongxing,
>> I am also interested in your table size. I am trying to dump 10s Million
>> record data from C* using map-reduce related API like CqlInputFormat.
>> You mentioned about Java driver. Could you suggest any API you used?
>> Thanks.
>>
>> On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing 
>> wrote:
>>
>>> Both Java driver "select * from table" and Spark sc.cassandraTable()
>>> work well.
>>> I use both of them frequently.
>>>
>>> At 2015-01-28 04:06:20, "Mohammed Guller" 
>>> wrote:
>>>
>>>  Hi –
>>>
>>>
>>>
>>> Over the last few weeks, I have seen several emails on this mailing list
>>> from people trying to extract all data from C*, so that they can import
>>> that data into other analytical tools that provide much richer analytics
>>> functionality than C*. Extracting all data from C* is a full-table scan,
>>> which is not the ideal use case for C*. However, people don’t have much
>>> choice if they want to do ad-hoc analytics on the data in C*.
>>> Unfortunately, I don’t think C* comes with any built-in tools that make
>>> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
>>> has a COPY TO command, but it doesn’t really work if you have a large
>>> amount of data in C*.
>>>
>>>
>>>
>>> I am aware of couple of approaches for extracting all data from a table
>>> in C*:
>>>
>>> 1)  Iterate through all the C* partitions (physical rows) using the
>>> Java Driver and CQL.
>>>
>>> 2)  Extract the data directly from SSTables files.
>>>
>>>
>>>
>>> Either approach can be used with Hadoop or Spark to speed up the
>>> extraction process.
>>>
>>>
>>>
>>> I wanted to do a quick survey and find out how many people on this
>>> mailing list have successfully used approach #1 or #2 for extracting large
>>> datasets (terabytes) from C*. Also, if you have used some other techniques,
>>> it would be great if you could share your approach with the group.
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>>
>
>
> --
>
> Regards,
> Shenghua (Daniel) Wan
>
>


Re: Controlling the MAX SIZE of sstables after compaction

2015-01-28 Thread Daniel Chia
They have an experimental 2.0 that works (we're using it).

Thanks,
Daniel

On Tue, Jan 27, 2015 at 11:50 AM, Mikhail Strebkov 
wrote:

> It is open sourced but works only with C* 1.x as far as I know.
>
> Mikhail
>
>
> On Tuesday, January 27, 2015, Mohammed Guller 
> wrote:
>
>>  I believe Aegisthus is open sourced.
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Jan [mailto:cne...@yahoo.com]
>> *Sent:* Monday, January 26, 2015 11:20 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Controlling the MAX SIZE of sstables after compaction
>>
>>
>>
>> Parth  et al;
>>
>>
>>
>> the folks at Netflix seem to have built a solution for your problem.
>>
>> The Netflix Tech Blog: Aegisthus - A Bulk Data Pipeline out of Cassandra
>> 
>>
>>
>>
>>
>>
>> [image: image]
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> The Netflix Tech Blog: Aegisthus - A Bulk Data Pipeline ...
>> 
>>
>> By Charles Smith and Jeff Magnusson
>>
>> View on *techblog.netflix.com*
>> 
>>
>> Preview by Yahoo
>>
>>
>>
>>
>>
>> May want to chase Jeff Magnuson & check if the solution is open sourced.
>>
>>
>> Pl.   report back to this forum if you get an answer to the problem.
>>
>>
>>
>> hope this helps.
>>
>> Jan
>>
>>
>>
>> C* Architect
>>
>>
>>
>> On Monday, January 26, 2015 11:25 AM, Robert Coli 
>> wrote:
>>
>>
>>
>> On Sun, Jan 25, 2015 at 10:40 PM, Parth Setya 
>> wrote:
>>
>> 1. Is there a way to configure the size of sstables created after
>> compaction?
>>
>>
>>
>> No, won'tfix : https://issues.apache.org/*jira*/browse/*CASSANDRA*-4897.
>>
>>
>>
>> You could use the "sstablesplit" utility on your One Big SSTable to split
>> it into files of your preferred size.
>>
>>
>>
>>  2. Is there a better approach to generate the report?
>>
>>
>>
>> The major compaction isn't too bad, but something that understands
>> SSTables as an input format would be preferable to sstable2json.
>>
>>
>>
>>  3. What are the flaws with this approach?
>>
>>
>>
>> sstable2json is slow and transforms your data to JSON.
>>
>>
>>
>> =Rob
>>
>>
>>
>


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread Huiliang Zhang
If you are using replication factor 1 and 3 cassandra nodes, 256 virtual
nodes should be evenly distributed on 3 nodes. So there are totally 256
virtual nodes. But in your experiment, you saw 3*257 mapper. Is that
because of the setting cassandra.input.split.size=3? It is nothing with
node number=3. Otherwise, I am confused why there are 256 virtual nodes on
every cassandra node.

On Wed, Jan 28, 2015 at 12:29 AM, Shenghua(Daniel) Wan <
wansheng...@gmail.com> wrote:

> I did another experiment to verify indeed 3*257 (1 of 257 ranges is null
> effectively) mappers were created.
>
> Thanks mcm for the information !
>
> On Wed, Jan 28, 2015 at 12:17 AM, mck  wrote:
>
>> Shenghua,
>>
>> > The problem is the user might only want all the data via a "select *"
>> > like statement. It seems that 257 connections to query the rows are
>> necessary.
>> > However, is there any way to prohibit 257 concurrent connections?
>>
>>
>> Your reasoning is correct.
>> The number of connections should be tunable via the
>> "cassandra.input.split.size" property. See
>> ConfigHelper.setInputSplitSize(..)
>>
>> The problem is that vnodes completely trashes this, since splits
>> returned don't span across vnodes.
>> There's an issue out for this –
>> https://issues.apache.org/jira/browse/CASSANDRA-6091
>>  but part of the problem is that the thrift stuff involved here is
>>  getting rewritten¹ to be pure cql.
>>
>> In the meantime you override the CqlInputFormat and manually re-merge
>> splits together, where location sets match, so to better honour
>> inputSplitSize and to return to a more reasonable number of connections.
>> We do this, using code similar to this patch
>> https://github.com/michaelsembwever/cassandra/pull/2/files
>>
>> ~mck
>>
>> ¹ https://issues.apache.org/jira/browse/CASSANDRA-8358
>>
>
>
>
> --
>
> Regards,
> Shenghua (Daniel) Wan
>


Unable to create a keyspace

2015-01-28 Thread Saurabh Sethi
I have a 3 node Cassandra 2.1.0 cluster and I am using datastax 2.1.4 driver to 
create a keyspace followed by creating a column family within that keyspace 
from my unit test.

But I do not see the keyspace getting created and the code for creating column 
family fails because it cannot find the keyspace. I see the following in the 
system.log file:

INFO  [SharedPool-Worker-1] 2015-01-28 17:59:08,472 MigrationManager.java:229 - 
Create new Keyspace: KSMetaData{name=testmaxcolumnskeyspace, 
strategyClass=SimpleStrategy, strategyOptions={replication_factor=1}, 
cfMetaData={}, durableWrites=true, 
userTypes=org.apache.cassandra.config.UTMetaData@370ad1d3}
INFO  [MigrationStage:1] 2015-01-28 17:59:08,476 ColumnFamilyStore.java:856 - 
Enqueuing flush of schema_keyspaces: 512 (0%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:22] 2015-01-28 17:59:08,477 Memtable.java:326 - 
Writing Memtable-schema_keyspaces@1664717092(138 serialized bytes, 3 ops, 0%/0% 
of on/off-heap limit)
INFO  [MemtableFlushWriter:22] 2015-01-28 17:59:08,486 Memtable.java:360 - 
Completed flushing 
/usr/share/apache-cassandra-2.1.0/bin/../data/data/system/schema_keyspaces-b0f2235744583cdb9631c43e59ce3676/system-schema_keyspaces-ka-118-Data.db
 (175 bytes) for commitlog position ReplayPosition(segmentId=1422485457803, 
position=10514)

This issue doesn’t happen always. My test runs fine sometimes but once it gets 
into this state, it remains there for a while and I can constantly reproduce 
this.

Also, when this issue happens for the first time, I also see the following 
error message in system.log file:


ERROR [SharedPool-Worker-1] 2015-01-28 15:08:24,286 ErrorMessage.java:218 - 
Unexpected exception during request
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_05]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 
~[na:1.8.0_05]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 
~[na:1.8.0_05]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_05]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:375) 
~[na:1.8.0_05]
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
 ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:878) 
~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
 ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:114)
 ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
 ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_05]

Anyone has any idea what might be going on here?

Thanks,
Saurabh


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread Shenghua(Daniel) Wan
That's c* default setting. My version is 2.0.11. Check your Cassandra.yaml.
On Jan 28, 2015 4:53 PM, "Huiliang Zhang"  wrote:

> If you are using replication factor 1 and 3 cassandra nodes, 256 virtual
> nodes should be evenly distributed on 3 nodes. So there are totally 256
> virtual nodes. But in your experiment, you saw 3*257 mapper. Is that
> because of the setting cassandra.input.split.size=3? It is nothing with
> node number=3. Otherwise, I am confused why there are 256 virtual nodes on
> every cassandra node.
>
> On Wed, Jan 28, 2015 at 12:29 AM, Shenghua(Daniel) Wan <
> wansheng...@gmail.com> wrote:
>
>> I did another experiment to verify indeed 3*257 (1 of 257 ranges is null
>> effectively) mappers were created.
>>
>> Thanks mcm for the information !
>>
>> On Wed, Jan 28, 2015 at 12:17 AM, mck  wrote:
>>
>>> Shenghua,
>>>
>>> > The problem is the user might only want all the data via a "select *"
>>> > like statement. It seems that 257 connections to query the rows are
>>> necessary.
>>> > However, is there any way to prohibit 257 concurrent connections?
>>>
>>>
>>> Your reasoning is correct.
>>> The number of connections should be tunable via the
>>> "cassandra.input.split.size" property. See
>>> ConfigHelper.setInputSplitSize(..)
>>>
>>> The problem is that vnodes completely trashes this, since splits
>>> returned don't span across vnodes.
>>> There's an issue out for this –
>>> https://issues.apache.org/jira/browse/CASSANDRA-6091
>>>  but part of the problem is that the thrift stuff involved here is
>>>  getting rewritten¹ to be pure cql.
>>>
>>> In the meantime you override the CqlInputFormat and manually re-merge
>>> splits together, where location sets match, so to better honour
>>> inputSplitSize and to return to a more reasonable number of connections.
>>> We do this, using code similar to this patch
>>> https://github.com/michaelsembwever/cassandra/pull/2/files
>>>
>>> ~mck
>>>
>>> ¹ https://issues.apache.org/jira/browse/CASSANDRA-8358
>>>
>>
>>
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>
>


error while bulk loading using copy command

2015-01-28 Thread Rahul Bhardwaj
Hi All,

We need to upload 18 lacs rows into a table which consist columns with data
type "counter".

on uploading using copy command , we are getting below error:

*Bad Request: INSERT statement are not allowed on counter tables, use
UPDATE instead*

we need counter data type because after loading this data we want to use
functionality of counter data type.

Kindly help is there any way to do this.


Regards:
Rahul Bhardwaj

-- 

Follow IndiaMART.com  for latest updates on this 
and more:  
  Mobile 
Channel: 

 
 


Watch how Irrfan Khan gets his work done in no time on IndiaMART, kyunki Kaam 
Yahin Banta Hai !!!


incremential repairs - again

2015-01-28 Thread Roland Etzenhammer

Hi,

a short question about the new incremental repairs again. I am running 
2.1.2 (for testing). Marcus pointed me that 2.1.2 should do incremental 
repairs automatically, so I rolled back all steps taken. I expect that 
routine repair times will decrease when I do not put many new data on 
the cluster.


But they dont - they are constant at about 1000 minutes  per node, so I 
extracted all "Repaired at" with sstablemetadata and I cant see any 
recent date. I put several GB of data into the cluster in 2015 and I run 
"nodetool repair -pr" on every node regularly.


Am I still missing something? Or is this one of the issues with 2.1.2 
(CASSANDRA-8316)?


Thanks for hints,
Jan




Performance difference between Regular Statement Vs PreparedStatement

2015-01-28 Thread Ajay
Hi All,

I tried both insert and select query (using QueryBuilder) in Regular
statement and PreparedStatement in a multithreaded code to do the query say
10k to 50k times. But I don't see any visible improvement using the
PreparedStatement. What could be the reason?

Note : I am using the same Session object in multiple threads.

Cassandra version : 2.0.11
Driver version : 2.1.4

Thanks
Ajay


Re: incremential repairs - again

2015-01-28 Thread Marcus Eriksson
Hi

Unsure what you mean by automatically, but you should use "-par -inc" when
you repair

And, you should wait until 2.1.3 (which will be out very soon) before doing
this, we have fixed many issues with incremental repairs

/Marcus

On Thu, Jan 29, 2015 at 7:44 AM, Roland Etzenhammer <
r.etzenham...@t-online.de> wrote:

> Hi,
>
> a short question about the new incremental repairs again. I am running
> 2.1.2 (for testing). Marcus pointed me that 2.1.2 should do incremental
> repairs automatically, so I rolled back all steps taken. I expect that
> routine repair times will decrease when I do not put many new data on the
> cluster.
>
> But they dont - they are constant at about 1000 minutes  per node, so I
> extracted all "Repaired at" with sstablemetadata and I cant see any recent
> date. I put several GB of data into the cluster in 2015 and I run "nodetool
> repair -pr" on every node regularly.
>
> Am I still missing something? Or is this one of the issues with 2.1.2
> (CASSANDRA-8316)?
>
> Thanks for hints,
> Jan
>
>
>


Re: incremential repairs - again

2015-01-28 Thread Roland Etzenhammer

Hi,

the "automatically" meant this reply earlier:

If you are on 2.1.2+ (or using STCS) you don't those steps (should 
probably update the blog post).


Now we keep separate levelings for the repaired/unrepaired data and 
move the sstables over after the first incremental repair


My understanding was, that there are no steps necessary to migrate to 
incremental repairs (disableautocompaction and so on) for 2.1.2+.


I'm looking forward to the next release - and I'll give a test run.

Thanks so far :-)