Hi, I'm working on some use cases to understand how cassandra-hadoop integration works.
I have a very basic scenario: I have a column family that keeps the session id and some bson data that contains the username in two separate columns. I want to go through all rows and dump the row to a file when the username is matching to a certain criteria. And I don't need any Reducer or Combiner for now. After I've written the following very simple hadoop job, I see from the logs that my mapper function is called per each row. Is that normal? If that is the case, doing such a search operation in a big dataset would take hours if not days...Besides that, I see many small output files being created on HDFS. I guess i need a better understanding on how splitting the job into tasks works exactly.. @Override public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns, Context context) throws IOException, InterruptedException { String rowkey = ByteBufferUtil.string(key); String ip = context.getConfiguration(). get(IP); IColumn column = columns.get(sourceColumn); if (column == null) return; ByteBuffer byteBuffer = column.value(); ByteBuffer bb2 = byteBuffer.duplicate(); DataConvertor convertor= fromBson(byteBuffer, DataConvertor.class); String username= convertor.getUsername(); BytesWritable value = new BytesWritable(); if (username != null && username.equals(cip)) { byte[] arr = convertToByteArray(bb2); value.set(new BytesWritable(arr)); Text tkey = new Text(rowkey); context.write( tkey, value); } else { log.info("ip not match [" + ip + "]"); } } Thanks in advance Kind Regards -- "Find a job you enjoy, and you'll never work a day in your life." Confucius