Friday, September 26, 2014

Collections in CQL3 - How they are stored


If you don’t already know about collections in Cassandra CQL, following page provides excellent details about the same –

However, if you have been using cassandra from pre-CQL days, you would have worked with low level thrift APIs and hence you would be tempted to think how the data looks like in cassandra’s internal storage structure [which is very well articulated (exposed?) by thrift APIs]!
I have a big hangover of my extensive work with thrift API and hence I always get tempted to think how my CQL data looks like in internal storage structure.

Following is a CQL column-family containing  different types of collections – set/list/map followed by corresponding look of data in internal storage structure:

CQL:

cqlsh:dummy> describe table users;

CREATE TABLE users (
  user_id text,
  emails set<text>,
  first_name text,
  last_name text,
  numbers list<int>,
  todo map<timestamp, text>,
  top_places list<text>,
  PRIMARY KEY (user_id)
)

cqlsh:dummy> select * from users;

user_id    | emails                                 | first_name   | last_name      | numbers       | todo                                     | top_places             | numbermap
---------+----------------------------------------+------------+-----------+-----------+--------------------------------------------------------------------------------------------------------+------------------------
   frodo   | {'baggins@gmail.com', 'f@baggins.com'} |      Frodo    |   Baggins     | [1, 2, 3]      | {'2012-09-24 00:00:00-0700': value1’} | ['rivendell', 'rohan']   | {1: 11, 2: 12}
Internal Storage Structure: (CLI result)
RowKey: frodo
=> (name=, value=, timestamp=1411701396643000)
=> (name=emails:62616767696e7340676d61696c2e636f6d, value=, timestamp=1411701396643000)
=> (name=emails:664062616767696e732e636f6d, value=, timestamp=1411701396643000)
=> (name=first_name, value=46726f646f, timestamp=1411701396643000)
=> (name=last_name, value=42616767696e73, timestamp=1411701396643000)
=> (name=numbermap:00000001, value=0000000b, timestamp=1411703133238000)
=> (name=numbermap:00000002, value=0000000c, timestamp=1411703133238000)
=> (name=numbers:534eaca0452c11e4932561c636c97db3, value=00000001, timestamp=1411701740650000)
=> (name=numbers:534eaca1452c11e4932561c636c97db3, value=00000002, timestamp=1411701740650000)
=> (name=numbers:534eaca2452c11e4932561c636c97db3, value=00000003, timestamp=1411701740650000)
=> (name=todo:00000139f7134980, value=76616c756531, timestamp=1411702558812000)
=> (name=top_places:a3300bb0452c11e4932561c636c97db3, value=726976656e64656c6c, timestamp=1411701874667000)
=> (name=top_places:a3300bb1452c11e4932561c636c97db3, value=726f68616e, timestamp=1411701874667000)

Some important points to note:
-          The ‘set’ field (emails) do not have any column-values in CLI result. As set is expected to store unique items, the values (rather, hash values) are stored as part of column-names only!
-          On the contrary, as ‘list’ field (numbers/top_places) is expected to have duplicate values, the actual value of list elements is stored in column-value and not in column-name to avoid overwrite of duplicate elements!
-          ‘map’ field (numbermap/todo): hash/hex of key is used in column-name and hash of values is used in column-values.

No comments:

Post a Comment