Friday, September 26, 2014

Collections in CQL3 - How they are stored

If you don’t already know about collections in Cassandra CQL, following page provides excellent details about the same –

However, if you have been using cassandra from pre-CQL days, you would have worked with low level thrift APIs and hence you would be tempted to think how the data looks like in cassandra’s internal storage structure [which is very well articulated (exposed?) by thrift APIs]!
I have a big hangover of my extensive work with thrift API and hence I always get tempted to think how my CQL data looks like in internal storage structure.

Following is a CQL column-family containing  different types of collections – set/list/map followed by corresponding look of data in internal storage structure:


cqlsh:dummy> describe table users;

  user_id text,
  emails set<text>,
  first_name text,
  last_name text,
  numbers list<int>,
  todo map<timestamp, text>,
  top_places list<text>,
  PRIMARY KEY (user_id)

cqlsh:dummy> select * from users;

user_id    | emails                                 | first_name   | last_name      | numbers       | todo                                     | top_places             | numbermap
   frodo   | {'', ''} |      Frodo    |   Baggins     | [1, 2, 3]      | {'2012-09-24 00:00:00-0700': value1’} | ['rivendell', 'rohan']   | {1: 11, 2: 12}
Internal Storage Structure: (CLI result)
RowKey: frodo
=> (name=, value=, timestamp=1411701396643000)
=> (name=emails:62616767696e7340676d61696c2e636f6d, value=, timestamp=1411701396643000)
=> (name=emails:664062616767696e732e636f6d, value=, timestamp=1411701396643000)
=> (name=first_name, value=46726f646f, timestamp=1411701396643000)
=> (name=last_name, value=42616767696e73, timestamp=1411701396643000)
=> (name=numbermap:00000001, value=0000000b, timestamp=1411703133238000)
=> (name=numbermap:00000002, value=0000000c, timestamp=1411703133238000)
=> (name=numbers:534eaca0452c11e4932561c636c97db3, value=00000001, timestamp=1411701740650000)
=> (name=numbers:534eaca1452c11e4932561c636c97db3, value=00000002, timestamp=1411701740650000)
=> (name=numbers:534eaca2452c11e4932561c636c97db3, value=00000003, timestamp=1411701740650000)
=> (name=todo:00000139f7134980, value=76616c756531, timestamp=1411702558812000)
=> (name=top_places:a3300bb0452c11e4932561c636c97db3, value=726976656e64656c6c, timestamp=1411701874667000)
=> (name=top_places:a3300bb1452c11e4932561c636c97db3, value=726f68616e, timestamp=1411701874667000)

Some important points to note:
-          The ‘set’ field (emails) do not have any column-values in CLI result. As set is expected to store unique items, the values (rather, hash values) are stored as part of column-names only!
-          On the contrary, as ‘list’ field (numbers/top_places) is expected to have duplicate values, the actual value of list elements is stored in column-value and not in column-name to avoid overwrite of duplicate elements!
-          ‘map’ field (numbermap/todo): hash/hex of key is used in column-name and hash of values is used in column-values.

Saturday, September 13, 2014

How to Stop compaction for specific keyspace in cassandra

Problem statement
How to stop compaction for a specific keyspace in cassandra?

Reason behind the problem statement
We had 17TB of data in a keyspace and lot of heavy compaction was running/was pending on this keyspace. UAT was running on another keyspace and during the UAT, we wanted the compaction for this keyspace to run smoothly. (The other keyspace was not very important during UAT). So temporarily we wanted to stop compaction on that keyspace to ensure it does not eat up cluster resources.

Set compaction threshold’s min value high enough to ensure compaction does not get triggered. The min threshold means that compaction will not be triggered until the sstables count reaches this minThreshold value. So if you set the value to let’s say 100,000; compaction will not be triggered until sstables count reach 100,000 – which practically may mean never!
>> nodetool setcompactionthreshold <keyspace> <cfname> <minthreshold> <maxthreshold>
(I am yet to try this)

Apparently setting the min/max threshold value to 0 stops the compaction. Also, this is only possible from JMX and not from CLI!
Another option is to use 'disableautocompaction' using "nodetool disableautocompaction"

Yet to verify this.

Friday, September 12, 2014

Compaction Strategy in cassandra

Cassandra supports two basic compaction strategies:
  • Size tiered compaction
  • Level tiered compaction

Before talking about these two compaction strategies, let’s take a look at what is compaction.

In Cassandra, each write is written into MemTables and commitlogs in realtime and when Memtable fills up (or when manually memtables are flushed) data is written into the persistent files called SSTables.
Memtable, as the name suggests is in memory table and commitlogs are the files to ensure retrieval of data if node crashes before data is written from memtable to SSTables.

Even for any update operations, the data is written into SSTables in sequential manner only; unlike RDBMS where the existing rows are searched in data files and then the said rows are updated with new data. This is done to ensure good write performance as the heavy operation to seek an existing row in data files (here, SSTables) is taken out.

However, the flipside of this approach is that data for a single row may span across multiple SSTables. This impacts the read operation because multiple SSTables will have to be accessed in order to read a single row.
To avoid this issue with read operation, cassandra performs ‘compaction’ on each column family. During compaction, the data of each row is compacted into a single SSTable. This ensures good read performance by placing data for each row in single file.
However, we must recognize at this point that as the compaction is an asynchronous process; in worst case scenario, there would be situations where read will be performed from multiple SSTables until compaction is performed for the SSTables containing said row.
This becomes critical factor for selection of compaction strategy for your column family.

As mentioned above there are two strategies for compaction process.

Size Tiered Compaction

In this compaction process, SSTables of fixed size are created initially and ones the number of SSTables reach a threshold count, they are compacted into a bigger SSTable.
For example, let’s consider initial SSTable size to be 5MB and the threshold count to be 5: In this case, compaction is triggered when 5 SSTables of 5MB each are filled and the compaction process will create a single SSTable of bigger size by compacting all these 5 SSTables.

Level Tiered Compaction

In this compaction process, sstables of a fixed, relatively small size (5MB by default in Cassandra’s implementation) are created and grouped into “levels”. Within each level, sstables are guaranteed to be non-overlapping. Each level is ten times as large as the previous.
For example,
L0: SSTables with size 5MB
L1: SSTables with size 50MB
L2: SSTables with size 500MB

New SSTables being added at level L(n) are immediately compacted with the SSTables at Level L(n+1). Once Level L(n+1) gets filled; extra SSTables are promoted to level L(n+2) and subsequent SSTables are compacted with SStables in Level L(n+2).
This strategy ensures that 90% of all reads will be satisfied from single SSTable.

Comparison at high level

  • Size Tiered compaction is better suited for write heavy applications. As the sstables are not immediately compacted with next level, the write performance will be better.
  • For column families having very wide rows with frequent updates (for example, manual index rows) the read performance will be severely hit. I have observed a case where a single read accessed almost 800 SSTables for a column family with wide row containing more than a million columns – case of frequent readTimeOuts! (This is also a learning for Data Model design and explains why wide rows with millions of columns should be avoided!) 
  •  Level Tiered compaction is better suited for column families with heavy read operations and/or heavy update-operations! (Important: Different between a write and an update!)

Thursday, September 4, 2014

JVM Command: JInfo

In linux, typically ‘top’ is used to view the details of running processes. Often it is required that you may want to see full process command instead of just the process name. Using option ‘c’ in top, you can see the full command almost always.
However,  typically OS has limits the length of full command  4096 characters! So if your process has longer full command, specaially applicable for java processes; you wouldn’t see it. Neither in ‘top’ nor in ‘ps –ef | grep <pid>’.

Good way to see exact details of the java process is to use ‘jinfo’. Following is the example of such a long command.

<JAVA_HOME>/bin/jinfo <process-id>

[root@home ~]$ /usr/java/jdk1.6.0_45/bin/jinfo 55761
Attaching to process ID 55761, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.45-b01
Java System Properties: = Java(TM) SE Runtime Environment
sun.boot.library.path = /usr/java/jdk1.6.0_45/jre/lib/amd64
java.vm.version = 20.45-b01
java.vm.vendor = Sun Microsystems Inc.
java.vendor.url =
storm.home = /opt/storm-0.8.2
path.separator = : = Java HotSpot(TM) 64-Bit Server VM
file.encoding.pkg = = SUN_STANDARD = US
sun.os.patch.level = unknown = Java Virtual Machine Specification
user.dir = /home/gse
java.runtime.version = 1.6.0_45-b06
java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment
java.endorsed.dirs = /usr/java/jdk1.6.0_45/jre/lib/endorsed
os.arch = amd64 = /tmp
line.separator =

java.vm.specification.vendor = Sun Microsystems Inc. = Linux
log4j.configuration =
sun.jnu.encoding = UTF-8
java.library.path = /usr/local/lib:/opt/local/lib:/usr/lib = Java Platform API Specification
java.class.version = 50.0 = HotSpot 64-Bit Tiered Compilers
os.version = 2.6.32-358.11.1.el6.x86_64
user.home = /root
user.timezone = GMT
java.awt.printerjob = sun.print.PSPrinterJob
file.encoding = UTF-8
java.specification.version = 1.6
java.class.path = /opt/storm-0.8.2/storm-0.8.2.jar:/opt/storm-0.8.2/lib/jgrapht-0.8.3.jar:/opt/storm-0.8.2/lib/log4j-1.2.16.jar:/opt/storm-0.8.2/lib/kryo-2.17.jar:/opt/storm-0.8.2/lib/guava-13.0.jar:/opt/storm-0.8.2/lib/disruptor-2.10.1.jar:/opt/storm-0.8.2/lib/objenesis-1.2.jar:/opt/storm-0.8.2/lib/junit-3.8.1.jar:/opt/storm-0.8.2/lib/joda-time-2.0.jar:/opt/storm-0.8.2/lib/commons-exec-1.1.jar:/opt/storm-0.8.2/lib/tools.logging-0.2.3.jar:/opt/storm-0.8.2/lib/curator-client-1.0.1.jar:/opt/storm-0.8.2/lib/zookeeper-3.3.3.jar:/opt/storm-0.8.2/lib/ring-core-1.1.5.jar:/opt/storm-0.8.2/lib/commons-fileupload-1.2.1.jar:/opt/storm-0.8.2/lib/tools.macro-0.1.0.jar:/opt/storm-0.8.2/lib/carbonite-1.5.0.jar:/opt/storm-0.8.2/lib/core.incubator-0.1.0.jar:/opt/storm-0.8.2/lib/commons-codec-1.4.jar:/opt/storm-0.8.2/lib/ring-jetty-adapter-0.3.11.jar:/opt/storm-0.8.2/lib/minlog-1.2.jar:/opt/storm-0.8.2/lib/clojure-1.4.0.jar:/opt/storm-0.8.2/lib/clj-time-0.4.1.jar:/opt/storm-0.8.2/lib/commons-logging-1.1.1.jar:/opt/storm-0.8.2/lib/libthrift7-0.7.0.jar:/opt/storm-0.8.2/lib/clout-1.0.1.jar:/opt/storm-0.8.2/lib/jline-0.9.94.jar:/opt/storm-0.8.2/lib/slf4j-log4j12-1.5.8.jar:/opt/storm-0.8.2/lib/jzmq-2.1.0.jar:/opt/storm-0.8.2/lib/servlet-api-2.5.jar:/opt/storm-0.8.2/lib/commons-io-1.4.jar:/opt/storm-0.8.2/lib/httpcore-4.1.jar:/opt/storm-0.8.2/lib/jetty-util-6.1.26.jar:/opt/storm-0.8.2/lib/httpclient-4.1.1.jar:/opt/storm-0.8.2/lib/commons-lang-2.5.jar:/opt/storm-0.8.2/lib/math.numeric-tower-0.0.1.jar:/opt/storm-0.8.2/lib/snakeyaml-1.9.jar:/opt/storm-0.8.2/lib/json-simple-1.1.jar:/opt/storm-0.8.2/lib/compojure-1.1.3.jar:/opt/storm-0.8.2/lib/servlet-api-2.5-20081211.jar:/opt/storm-0.8.2/lib/hiccup-0.3.6.jar:/opt/storm-0.8.2/lib/jetty-6.1.26.jar:/opt/storm-0.8.2/lib/curator-framework-1.0.1.jar:/opt/storm-0.8.2/lib/ring-servlet-0.3.11.jar:/opt/storm-0.8.2/lib/asm-4.0.jar:/opt/storm-0.8.2/lib/reflectasm-1.07-shaded.jar:/opt/storm-0.8.2/lib/tools.cli-0.2.2.jar:/opt/storm-0.8.2/lib/slf4j-api-1.5.8.jar:/opt/storm-0.8.2/log4j:/opt/storm-0.8.2/conf = root
java.vm.specification.version = 1.0 = backtype.storm.daemon.worker xyz56-36-1406612749 d39a4666-52a2-4099-8ca5-ee9aac670783 6702 d6db40f4-032b-443b-9c7e-b0ddc6b80b31
java.home = /usr/java/jdk1.6.0_45/jre = 64
user.language = en
java.specification.vendor = Sun Microsystems Inc. = mixed mode
java.version = 1.6.0_45
java.ext.dirs = /usr/java/jdk1.6.0_45/jre/lib/ext:/usr/java/packages/lib/ext = worker-6702.log
sun.boot.class.path = /usr/java/jdk1.6.0_45/jre/lib/resources.jar:/usr/java/jdk1.6.0_45/jre/lib/rt.jar:/usr/java/jdk1.6.0_45/jre/lib/sunrsasign.jar:/usr/java/jdk1.6.0_45/jre/lib/jsse.jar:/usr/java/jdk1.6.0_45/jre/lib/jce.jar:/usr/java/jdk1.6.0_45/jre/lib/charsets.jar:/usr/java/jdk1.6.0_45/jre/lib/modules/jdk.boot.jar:/usr/java/jdk1.6.0_45/jre/classes
java.vendor = Sun Microsystems Inc.
file.separator = /
java.vendor.url.bug = = UnicodeLittle
sun.cpu.endian = little
sun.cpu.isalist =

VM Flags:

-Xmx768m -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.home=/opt/storm-0.8.2

    jinfo [option] <pid>
        (to connect to running process)
    jinfo [option] <executable <core>
        (to connect to a core file)
    jinfo [option] [server_id@]<remote server IP or hostname>
        (to connect to remote debug server)

where <option> is one of:
    -flag <name>         to print the value of the named VM flag
    -flag [+|-]<name>    to enable or disable the named VM flag
    -flag <name>=<value> to set the named VM flag to the given value
    -flags               to print VM flags
    -sysprops            to print Java system properties
    <no option>          to print both of the above
    -h | -help           to print this help message