If you don’t already know about collections in Cassandra
CQL, following page provides excellent details about the same –
However, if you have been using cassandra from pre-CQL days,
you would have worked with low level thrift APIs and hence you would be tempted
to think how the data looks like in cassandra’s internal storage structure [which
is very well articulated (exposed?) by thrift APIs]!
I have a big hangover of my extensive work with thrift API
and hence I always get tempted to think how my CQL data looks like in internal
storage structure.
Following is a CQL column-family
containing different types of collections – set/list/map followed by
corresponding look of data in internal storage structure:
CQL:
cqlsh:dummy> describe table users; CREATE TABLE users ( user_id text, emails set<text>, first_name text, last_name text, numbers list<int>, todo map<timestamp, text>, top_places list<text>, PRIMARY KEY (user_id) ) cqlsh:dummy> select * from users; user_id | emails | first_name | last_name | numbers | todo | top_places | numbermap ---------+----------------------------------------+------------+-----------+-----------+--------------------------------------------------------------------------------------------------------+------------------------ frodo | {'baggins@gmail.com', 'f@baggins.com'} | Frodo | Baggins | [1, 2, 3] | {'2012-09-24 00:00:00-0700': ‘value1’} | ['rivendell', 'rohan'] | {1: 11, 2: 12}
Internal Storage Structure: (CLI result)
RowKey: frodo
=> (name=, value=, timestamp=1411701396643000)
=> (name=emails:62616767696e7340676d61696c2e636f6d,
value=, timestamp=1411701396643000)
=> (name=emails:664062616767696e732e636f6d,
value=, timestamp=1411701396643000)
=> (name=first_name, value=46726f646f,
timestamp=1411701396643000)
=> (name=last_name, value=42616767696e73,
timestamp=1411701396643000)
=> (name=numbermap:00000001, value=0000000b,
timestamp=1411703133238000)
=> (name=numbermap:00000002, value=0000000c,
timestamp=1411703133238000)
=> (name=numbers:534eaca0452c11e4932561c636c97db3,
value=00000001, timestamp=1411701740650000)
=> (name=numbers:534eaca1452c11e4932561c636c97db3,
value=00000002, timestamp=1411701740650000)
=> (name=numbers:534eaca2452c11e4932561c636c97db3,
value=00000003, timestamp=1411701740650000)
=> (name=todo:00000139f7134980,
value=76616c756531, timestamp=1411702558812000)
=> (name=top_places:a3300bb0452c11e4932561c636c97db3,
value=726976656e64656c6c, timestamp=1411701874667000)
=> (name=top_places:a3300bb1452c11e4932561c636c97db3,
value=726f68616e, timestamp=1411701874667000)
Some important points to note:
-
The ‘set’ field (emails) do not have any
column-values in CLI result. As set is expected to store unique items, the
values (rather, hash values) are stored as part of column-names only!
-
On the contrary, as ‘list’ field
(numbers/top_places) is expected to have duplicate values, the actual value of
list elements is stored in column-value and not in column-name to avoid
overwrite of duplicate elements!
-
‘map’ field (numbermap/todo): hash/hex of key is
used in column-name and hash of values is used in column-values.