> 2013-05-10 16:09:13,250 10441:69: adding 
> UUID('d8a31630-3b90-46d0-9e34-1b7638044c62') with 1.500000 MB
> 2013-05-10 16:09:13,250 10441:69: adding 
> UUID('ba14b6fc-f9ed-4d33-b64d-d705f405ad22') with 1.500000 MB
> 2013-05-10 16:09:13,250 10441:69: adding 
> UUID('0b08d080-ec72-4f6e-b4a1-1227b8e28dac') with 1.500000 MB
> 2013-05-10 16:09:13,250 10441:69: adding 
> UUID('d6a9c9fc-5775-4da7-92db-2f7cc89e6761') with 1.500000 MB
> 2013-05-10 16:09:13,250 10441:69: adding 
> UUID('970b13be-b61d-4ff5-b13c-dd329f307337') with 1.500000 MB
> 2013-05-10 16:09:13,251 10441:69: adding 
> UUID('ba608688-adc1-4b3c-b97d-64bc7c26b997') with 1.500000 MB
> 2013-05-10 16:09:13,251 10441:69: adding 
> UUID('c72574d5-f562-4780-8a86-b509138bf3d0') with 1.500000 MB
> 2013-05-10 16:09:13,251 10441:69: adding 
> UUID('6e795783-2eb3-4a1f-84a5-0554d568c461') with 1.500000 MB
> 2013-05-10 16:09:13,251 10441:69: adding 
> UUID('8f79775e-ef9d-4879-8b0f-4b2dd42c3f2b') with 1.500000 MB
> 2013-05-10 16:09:13,251 10441:69: adding 
> UUID('59f14d33-c4a6-49ce-80ce-48a3ee433cbf') with 1.500000 MB
You are sending 10 rows with 1.5MB of column data, plus the extra data for the 
row key, is over 15MB. 

It's the size of the request, not the size of individual rows. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/05/2013, at 4:39 AM, John R. Frank <j...@mit.edu> wrote:

> C* users,
> 
> The simple code below demonstrates pycassa failing to write values containing 
> more than one-tenth that thrift_framed_transport_size_in_mb. It writes a 
> single column row using a UUID key.
> 
> For example, with the default of
> 
>       thrift_framed_transport_size_in_mb: 15
> 
> the code below shows pycassa failing on rows of 1.5MB of data:
> 
> 2013-05-10 16:09:13,251 10441:69: adding 
> UUID('59f14d33-c4a6-49ce-80ce-48a3ee433cbf') with 1.500000 MB
> Connection 140020167928080 (ec2-23-22-224-180.compute-1.amazonaws.com:9160) 
> was checked out from pool 140020167927056
> Connection 140020167928080 (ec2-23-22-224-180.compute-1.amazonaws.com:9160) 
> in pool 140020167927056 failed: [Errno 104] Connection reset by peer
> 2013-05-10 16:09:13,370 10441:25: connection_failed: reset pool??
> 
> Sometimes it says "TSocket read 0 bytes" and other times "Connection reset by 
> peer".  I have tried changing the thrift_framed_transport_size_in_mb and the 
> 10% pattern holds.  Possibly related to:
> 
>       https://github.com/pycassa/pycassa/issues/168
> 
> I thought the whole point of framed transport was to split things up so they 
> can be larger than a single frame.  Is that wrong?  This is breaking at just 
> 10% of the frame size.
> 
> Maybe we broke something else?
> 
> Pycassa is using framed transport -- see assert in code below.
> 
> This AWS m1.xlarge was constructed just for this test using DataStax AMI:
> 
>  http://www.datastax.com/docs/1.2/install/install_ami
> 
> http://ec2-23-22-224-180.compute-1.amazonaws.com:8888/opscenter/index.html
> 
> 
> The code and log output are attached.  The log was generating running on 
> another m1.xlarge running in the same Amazon data center.
> 
> Thanks for any ideas.
> -John
> 
> import uuid
> import getpass
> import logging
> logger = logging.getLogger('test')
> logger.setLevel( logging.INFO)
> 
> ch = logging.StreamHandler()
> ch.setLevel( logging.DEBUG )
> formatter = logging.Formatter('%(asctime)s %(process)d:%(lineno)d: 
> %(message)s')
> ch.setFormatter(formatter)
> logger.addHandler(ch)
> 
> ## get the Cassandra client library
> import pycassa
> from pycassa.pool import ConnectionPool
> from pycassa.system_manager import SystemManager, SIMPLE_STRATEGY, \
>    LEXICAL_UUID_TYPE, ASCII_TYPE, BYTES_TYPE
> 
> log = pycassa.PycassaLogger()
> log.set_logger_name('pycassa_library')
> log.set_logger_level('debug')
> log.get_logger().addHandler(logging.StreamHandler())
> 
> class Listener(object):
>    def connection_failed(self, dic):
>        logger.critical('connection_failed: reset pool??')
> 
> ## this is an m1.xlarge doing nothing but supporting this test
> server = 'ec2-23-22-224-180.compute-1.amazonaws.com:9160'
> keyspace = 'testkeyspace_' + getpass.getuser().replace('-', '_')
> family = 'testcf'
> sm = SystemManager(server)
> try:
>    sm.drop_keyspace(keyspace)
> except pycassa.InvalidRequestException:
>    pass
> sm.create_keyspace(keyspace, SIMPLE_STRATEGY, {'replication_factor': '1'})
> sm.create_column_family(keyspace, family, super=False,
>                        key_validation_class = LEXICAL_UUID_TYPE,
>                        default_validation_class  = LEXICAL_UUID_TYPE,
>                        column_name_class = ASCII_TYPE)
> sm.alter_column(keyspace, family, 'test', ASCII_TYPE)
> sm.close()
> 
> pool = ConnectionPool(keyspace, [server], max_retries=10, pool_timeout=0, 
> pool_size=10, timeout=120)
> pool.fill()
> pool.add_listener( Listener() )
> 
> ## assert that we are using framed transport
> import thrift
> conn = pool._q.get()
> assert isinstance(conn.transport, 
> thrift.transport.TTransport.TFramedTransport)
> pool._q.put(conn)
> 
> try:
>    for k in range(20):
>        ## write some data to cassandra using increasing data sizes
>        big_data = ' ' * 2**18 * k
>        num_rows = 10
>        keys = []
>        rows = []
>        for i in xrange(num_rows):
>            key = uuid.uuid4()
>            rows.append((key, dict(test=big_data)))
>            keys.append(key)
> 
>        testcf = pycassa.ColumnFamily(pool, family)
>        with testcf.batch() as batch:
>            for (key, data_dict) in rows:
>                data_size = len(data_dict.values()[0])
>                logger.critical('adding %r with %.6f MB' % (key, 
> float(data_size)/2**20))
>                batch.insert(key, data_dict)
> 
>        logger.critical('%d rows written' % num_rows)
> 
> finally:
>    sm = SystemManager(server)
>    try:
>        sm.drop_keyspace(keyspace)
>    except pycassa.InvalidRequestException:
>        pass
>    sm.close()
>    logger.critical('clearing test keyspace: %r' % 
> keyspace)<log.txt><test_pycassa_big_writes.py>

Reply via email to