Enumeration
Enumeration is a way to encode symbols so that they take less storage space.
For example, given this list of symbols
`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
The actual domain is just three symbols
`AAPL`GOOG`MSFT
Without enumeration, each symbol takes 8 bytes, and the list above will take 80 bytes.
With enumeration, we can encode the list as a list of numeric values. Note that 1 byte can represent 256
different values, which is far more than what we need to encode 3 symbols, so we can represent everything in our
domain with only 1 byte. The list will only take 10 bytes with enumeration.
There are two ways to create enumeration in KDB:
- Enumerate
$
With this method, all values to be enumerated must be in the domain. If there is an unseen new value,
the enumeration will fail with a cast error. Therefore, it is best to use this method when the domain is
fixed.
This method does not preserve attributes on the list.
- Enumerate extend
?
The difference is that with this method, if we encounter a new value, it will append to the domain
automatically.
This method preserves attributes on the list.
To illustrate the two methods:
q) stocks:`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
q) domainStocks: distinct stocks
q) show enumStocks:domainStocks$stocks
`s$`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
// alternatively, use ? and provide an empty domain, then let it update itself
// the backtick before the domain ` is essential
q) `createDomain?`stocks
`s$`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
q) createDomain
`AAPL`GOOG`MSFT
Several things to note here:
- The syntax is always
domain$list or domain?list for both methods.
domainStocks
This is the domain of the list. Usually we don't want the domain to have duplicates.
If you update the domain, your enumerated list will also be updated.
enumStocks
The `s$ in the output means the list is enumerated.
Essentially, an enumerated list is also a list of symbols, so anything you can apply to a regular list
of symbols also applies here.
Two ways to unenumerate a list:
get or
value
q) enumStocks
`s$`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
q) unenumeratedStocks: get enumStocks
q) unenumeratedStocks2: value enumStocks
Advanced enumeration:
The sym file in a database is essentially all symbols enumerated (not just from sym column, but all symbol
columns). There are helper functions in the
.Q namespace to deal with this file, such as
.Q.en.
Database
// taken from https://code.kx.com/q/database/segment/ with some minor edits
DISK 0 DISK 1 DISK 2
db db db
├── sym ├── sym ├── sym
└── par.txt ├── 2020.10.03 ├── 2020.10.04
│ ├── quotes │ ├── quotes
│ │ ├── .d │ │ ├── .d
│ │ ├── sym │ │ ├── sym
│ │ ├── price │ │ ├── price
│ │ └── time │ │ └── time
│ └── trades │ └── trades
│ ├── .d │ ├── .d
│ ├── sym │ ├── sym
│ ├── price │ ├── price
│ ├── time │ ├── time
│ └── vol │ └── vol
├── 2020.10.05 ├── 2020.10.06
│ ├── quotes │ ├── quotes
Overview
The image above demonstrates KDB architecture.
If you look at the entire structure, it is a segmented database, suitable for scalability and parallel access.
If you only look at Disk 1, it is a partitioned dtabase, suitable for more than 100 million rows, or steadily
growing data. But if the database grows large enough, since all partitions share I/O, it can become a
bottleneck, and is better to use a segmented database.
If you only look at one table such as
quotes, it is a splayed table, stores up to 100 million rows.
The smallest unit is a flat table, which stores up to a few million rows.
Some important files you need to pay attention to:
par.txt - used for segmented database, contains location of each partition
If you read this txt file, you will see something like this:
/disk1/db1
/disk2/db2
sym - this file appears at two levels, both in the form of a flat file, but serve different
purposes.
- at the database level, it contains the enumeration domain
of all the symbols (not just those in sym columns, but all columns with type symbol) in the
database.
- at the table level, it is the same as any other columns in this table, stored as a flat file. It
contains the actual symbols in the sym column.
.d - contains a list of symbols that records the column order in each table.
Flat tables
For flat tables, see
here.
Splayed tables
Saving splayed tables
Saving a splayed table without symbols is just very slightly different from saving a flat table.
// just make sure the path ends with a slash
q) `:db/tbl/ set tbl
But if the table contains symbols, we need to enumerate, otherwise there will be a type error.
Two ways to do it:
q) `:db/tbl/ set .Q.en[`:db/; tbl]
q) `:db/tbl/ set .Q.ens[`:db/; tbl; symFile]
The only difference is that
.Q.ens let you use a custom symbol file instead of the default one
located at the root level of the database.
Loading splayed tables
Two ways to load splayed tables: either use
get or just
\l the table.
What's happening under the hood isn't really "loading" the data into memory, but "mapping" the data on disk into
memory, meaning we only loaded in the "structure" of the table and where to find the data on disk.
q) tbl: get `:db/tbl
q) \l db/tbl
Querying splayed tables
If you only want to select or sort the table, you don't even have to load it. You can directly operate on the
splayed table as if it is in memory. But you won't be able to exec or update it.
q) select from `:db/tbl where col1=1 // this is okay
q) delete col2 from `:db/tbl // this will error
Updating splayed tables
To update a splayed table, you need to consider three things:
- the actual update similar to what you'd do for in-memory tables, like
upsert or
delete
- enumeration if you touch any symbols
- fixing the .d file if you touch any columns
Here are some examples:
// only update rows but with symbols
q) `:db/tbl/ upsert .Q.en[`:db/tbl;]([]sym:`AAPL`GOOG;col1:100 200)
// add a new column
q) `:db/tbl/newCol set 100 200 300 // after this step, you won't be able to see the column yet
q) `:db/tbl/.d set get[`:db/tbl/.d],`newCol // now you can see the column
// delete a column
q) system "rm db/tbl/newCol" // again, after this step, it's not done yet
q) `:db/tbl/.d set get[`:db/tbl/.d] except `newCol // now it's done
Partitioned tables
A partitioned database is essentially a collection of splayed tables.
They can be partitioned by dates, or other stuff like months. But one database can only be partitioned by one
thing.
Once you decide how to partition it, KDB will add a virtual column corresponding to the data type you choose.
For example, if you choose to partition by date, it will add a virtual date column.
Saving partitioned tables
Similar to what we do with splayed tables, but requires one extra step to find the actual path based on the
date.
q) enumTable: .Q.en[`:db/; tbl]
q) dt: first exec distinct date from tbl // e.g, 2020.01.01
q) path: ` sv `:db/,(`$string dt),`tbl` // get the path `:db/2020.01.01/tbl/
q) path set select from enumTable where date=dt
Above only saves one date. To partition the entire table, you can do something like this:
saveTableParted:{[dir;t;tableName]
dts: exec distinct date from t;
enumTable: .Q.en[dir; t];
{[dir;tab;tableName;dt]
toSave: select from enumTable where date=dt;
path: ` sv dir, (`$string dt), tableName,`;
path set toSave;
}[dir;enumTable;tableName] each dts
}
Loading, querying and updating partitioned tables
Loading a partitioned table is the same as loading a splayed table.
Similar to splayed table, you can't
exec or update a partitioned table.
To update a partitioned table is similar to updating a splayed table. You need to consider the same three
things.
Sorting large partitions
For large partitioned table, it might not be a good idea to sort the entire table.
Instead we sort one column and then use the sorted index to sort other columns.
// function to perform sorting
indexSort:{[par;col;ordering]
@[hsym par; col; @[;ordering]];
// how to understand this function: say ordering is [2 0 3 1], and col is `quantity (20 40 10 30)
// @[;ordering] is a partial function waiting for a list input
// so @[;ordering] on `quantity is the same as (20 40 10 30)[2 0 3 1], which will return 10 20 30 40
// then we use @[hsym par; col; ...] to write this new value back to the column on disk
}
// using the function
q) sortedIndex: iasc get `:/db/2020.01.01/trades/time // get the sorted index
q) indexSort[`:db/2020.01.01/trades; sortedIndex] each colsOnDisk // sort each column using the sorted index
q) @[partition; `time; `s#] // apply sorted attribute to the time column
Useful utility functions
There are many useful utility functions in the
.Q namespace for working with partitioned tables.
Another very handy thing is
dbmaint.
This script is VERY, VERY useful.
Segmented database
Similar to what we did for partitioned tables to find correct path of a partition, we need to do one extra step
for segmented database to find the correct path of the disk mount.
Saving to segmented database
The idea is that, data should be distributed as evenly across all mounts as possible to ensure best performance.
One vanilla way to do this is to use the mod operator to determine which mount to use.
Using an example to illustrate: if we want to find out which disk does 2016.01.01 go to, and we have 4 mounts,
we do
2016.01.01 mod 4.
This gives us 0 so this partition should go to the first mount.
In general, to save to a segmented database, we do something like this:
saveSegDB:{[dt; tabName]
mounts: read0 `:par.txt; // read all mounts
numMounts: count mounts;
allocateTo: mounts dt mod numMounts; // determine which mount to use, e.g., `:/disk1/db1
path: ` sv hsym[`$allocateTo], (`$string dt), tabName; // construct the path, e.g., `:/disk1/db1/2016.01.01/trades/
path set tabName;
}
If you update a segmented database with a new mount, you need to update
par.txt manually.
Loading from segmented database
For segmented database, you almost never load the entire database into memory.
All you need is actually just the
par.txt file which contains the location of all partitions.
This can be done by simply loading the root of the database, e.g.,
\l segDBRoot
And you will be able to query the database as if it is in memory.
Useful utility functions