KX class

Enumeration

Enumeration is a way to encode symbols so that they take less storage space.
For example, given this list of symbols `AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
The actual domain is just three symbols `AAPL`GOOG`MSFT
Without enumeration, each symbol takes 8 bytes, and the list above will take 80 bytes.
With enumeration, we can encode the list as a list of numeric values. Note that 1 byte can represent 256 different values, which is far more than what we need to encode 3 symbols, so we can represent everything in our domain with only 1 byte. The list will only take 10 bytes with enumeration.

There are two ways to create enumeration in KDB:

Enumerate $
With this method, all values to be enumerated must be in the domain. If there is an unseen new value, the enumeration will fail with a cast error. Therefore, it is best to use this method when the domain is fixed.
This method does not preserve attributes on the list.
Enumerate extend ?
The difference is that with this method, if we encounter a new value, it will append to the domain automatically.
This method preserves attributes on the list.

To illustrate the two methods:


            q) stocks:`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
            q) domainStocks: distinct stocks
            q) show enumStocks:domainStocks$stocks
            `s$`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG

            // alternatively, use ? and provide an empty domain, then let it update itself
            // the backtick before the domain ` is essential
            q) `createDomain?`stocks
            `s$`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
            q) createDomain
            `AAPL`GOOG`MSFT

Several things to note here:

The syntax is always domain$list or domain?list for both methods.
domainStocks
This is the domain of the list. Usually we don't want the domain to have duplicates.
If you update the domain, your enumerated list will also be updated.
enumStocks
The `s$ in the output means the list is enumerated.
Essentially, an enumerated list is also a list of symbols, so anything you can apply to a regular list of symbols also applies here.

Two ways to unenumerate a list: get or value


            q) enumStocks
            `s$`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
            q) unenumeratedStocks: get enumStocks
            q) unenumeratedStocks2: value enumStocks

Advanced enumeration:
The sym file in a database is essentially all symbols enumerated (not just from sym column, but all symbol columns). There are helper functions in the .Q namespace to deal with this file, such as .Q.en.

Database


            // taken from https://code.kx.com/q/database/segment/ with some minor edits
            DISK 0             DISK 1                    DISK 2  
            db                 db                        db             
            ├── sym            ├── sym                   ├── sym
            └── par.txt        ├── 2020.10.03            ├── 2020.10.04                         
                               │   ├── quotes            │   ├── quotes                         
                               │   │   ├── .d            │   │   ├── .d
                               │   │   ├── sym           │   │   ├── sym                          
                               │   │   ├── price         │   │   ├── price                            
                               │   │   └── time          │   │   └── time                           
                               │   └── trades            │   └── trades                         
                               │       ├── .d            │       ├── .d
                               │       ├── sym           │       ├── sym                          
                               │       ├── price         │       ├── price                            
                               │       ├── time          │       ├── time                           
                               │       └── vol           │       └── vol                          
                               ├── 2020.10.05            ├── 2020.10.06                         
                               │   ├── quotes            │   ├── quotes

Overview

The image above demonstrates KDB architecture.
If you look at the entire structure, it is a segmented database, suitable for scalability and parallel access.
If you only look at Disk 1, it is a partitioned dtabase, suitable for more than 100 million rows, or steadily growing data. But if the database grows large enough, since all partitions share I/O, it can become a bottleneck, and is better to use a segmented database.
If you only look at one table such as quotes, it is a splayed table, stores up to 100 million rows.
The smallest unit is a flat table, which stores up to a few million rows.

Some important files you need to pay attention to:

par.txt - used for segmented database, contains location of each partition
If you read this txt file, you will see something like this:
/disk1/db1
/disk2/db2
sym - this file appears at two levels, both in the form of a flat file, but serve different purposes.
- at the database level, it contains the enumeration domain of all the symbols (not just those in sym columns, but all columns with type symbol) in the database.
- at the table level, it is the same as any other columns in this table, stored as a flat file. It contains the actual symbols in the sym column.
.d - contains a list of symbols that records the column order in each table.

Flat tables

For flat tables, see here.

Splayed tables

Saving splayed tables

Saving a splayed table without symbols is just very slightly different from saving a flat table.


            // just make sure the path ends with a slash
            q) `:db/tbl/ set tbl

But if the table contains symbols, we need to enumerate, otherwise there will be a type error.
Two ways to do it:


            q) `:db/tbl/ set .Q.en[`:db/; tbl]
            q) `:db/tbl/ set .Q.ens[`:db/; tbl; symFile]

The only difference is that .Q.ens let you use a custom symbol file instead of the default one located at the root level of the database.

Loading splayed tables

Two ways to load splayed tables: either use get or just \l the table.
What's happening under the hood isn't really "loading" the data into memory, but "mapping" the data on disk into memory, meaning we only loaded in the "structure" of the table and where to find the data on disk.


            q) tbl: get `:db/tbl
            q) \l db/tbl

Querying splayed tables

If you only want to select or sort the table, you don't even have to load it. You can directly operate on the splayed table as if it is in memory. But you won't be able to exec or update it.


            q) select from `:db/tbl where col1=1        // this is okay
            q) delete col2 from `:db/tbl                // this will error

Updating splayed tables

To update a splayed table, you need to consider three things:

the actual update similar to what you'd do for in-memory tables, like upsert or delete
enumeration if you touch any symbols
fixing the .d file if you touch any columns

Here are some examples:


            // only update rows but with symbols
            q) `:db/tbl/ upsert .Q.en[`:db/tbl;]([]sym:`AAPL`GOOG;col1:100 200)
            
            // add a new column
            q) `:db/tbl/newCol set 100 200 300                  // after this step, you won't be able to see the column yet
            q) `:db/tbl/.d set get[`:db/tbl/.d],`newCol         // now you can see the column
            // delete a column
            q) system "rm db/tbl/newCol"                        // again, after this step, it's not done yet
            q) `:db/tbl/.d set get[`:db/tbl/.d] except `newCol  // now it's done

Partitioned tables

A partitioned database is essentially a collection of splayed tables.
They can be partitioned by dates, or other stuff like months. But one database can only be partitioned by one thing.
Once you decide how to partition it, KDB will add a virtual column corresponding to the data type you choose. For example, if you choose to partition by date, it will add a virtual date column.

Saving partitioned tables

Similar to what we do with splayed tables, but requires one extra step to find the actual path based on the date.


            q) enumTable: .Q.en[`:db/; tbl]
            q) dt: first exec distinct date from tbl            // e.g, 2020.01.01
            q) path: ` sv `:db/,(`$string dt),`tbl`             // get the path `:db/2020.01.01/tbl/
            q) path set select from enumTable where date=dt

Above only saves one date. To partition the entire table, you can do something like this:


            saveTableParted:{[dir;t;tableName]
                dts: exec distinct date from t;
                enumTable: .Q.en[dir; t];
                {[dir;tab;tableName;dt]
                    toSave: select from enumTable where date=dt;
                    path: ` sv dir, (`$string dt), tableName,`;
                    path set toSave;
                }[dir;enumTable;tableName] each dts
            }

Loading, querying and updating partitioned tables

Loading a partitioned table is the same as loading a splayed table.
Similar to splayed table, you can't exec or update a partitioned table.
To update a partitioned table is similar to updating a splayed table. You need to consider the same three things.

Sorting large partitions

For large partitioned table, it might not be a good idea to sort the entire table. Instead we sort one column and then use the sorted index to sort other columns.


            // function to perform sorting
            indexSort:{[par;col;ordering]
                @[hsym par; col; @[;ordering]];
                // how to understand this function: say ordering is [2 0 3 1], and col is `quantity (20 40 10 30)
                // @[;ordering] is a partial function waiting for a list input
                // so @[;ordering] on `quantity is the same as (20 40 10 30)[2 0 3 1], which will return 10 20 30 40
                // then we use @[hsym par; col; ...] to write this new value back to the column on disk
            }

            // using the function
            q) sortedIndex: iasc get `:/db/2020.01.01/trades/time                   // get the sorted index
            q) indexSort[`:db/2020.01.01/trades; sortedIndex] each colsOnDisk       // sort each column using the sorted index
            q) @[partition; `time; `s#]                                             // apply sorted attribute to the time column

Useful utility functions

There are many useful utility functions in the .Q namespace for working with partitioned tables.

.Q.chk - fill missing tables
.Q.dpft - partition a table
This is basically what saveTableParted does above.
The usage is .Q.dpft[dir;date;keyCol;tableName].
e.g. .Q.dpft[`:db/;2020.01.01;`sym;`trades]

.Q.view - limit how much data the user can access without specifying a where clause

q) .Q.view 5 sublist date                   // only allow access to the first 5 dates
q) select count i by date from trades       // this will only count the first 5 dates even without a where clause
q) .Q.view[]                                // remove the setting

.Q.pt - returns a list of partitioned tables, e.g., trades, quotes
.Q.pv - returns a lsit of partition values, e.g., 2020.01.01, 2020.01.02
.Q.pf - returns the partitioned field, e.g., date
.Q.pd - returns the path of all partitions, e.g., `:db/2020.01.01, `:db/2020.01.02
.Q.cn - returns the count of a partitioned table for each date
.Q.pn - similar to .Q.cn, but returns the count of all partitioned tables

Another very handy thing is dbmaint. This script is VERY, VERY useful.

Segmented database

Similar to what we did for partitioned tables to find correct path of a partition, we need to do one extra step for segmented database to find the correct path of the disk mount.

Saving to segmented database

The idea is that, data should be distributed as evenly across all mounts as possible to ensure best performance.
One vanilla way to do this is to use the mod operator to determine which mount to use.
Using an example to illustrate: if we want to find out which disk does 2016.01.01 go to, and we have 4 mounts, we do 2016.01.01 mod 4.
This gives us 0 so this partition should go to the first mount.
In general, to save to a segmented database, we do something like this:


            saveSegDB:{[dt; tabName]
                mounts: read0 `:par.txt;                                    // read all mounts
                numMounts: count mounts;
                allocateTo: mounts dt mod numMounts;                        // determine which mount to use, e.g., `:/disk1/db1
                path: ` sv hsym[`$allocateTo], (`$string dt), tabName;      // construct the path, e.g., `:/disk1/db1/2016.01.01/trades/
                path set tabName;
            }

If you update a segmented database with a new mount, you need to update par.txt manually.

Loading from segmented database

For segmented database, you almost never load the entire database into memory.
All you need is actually just the par.txt file which contains the location of all partitions.
This can be done by simply loading the root of the database, e.g., \l segDBRoot
And you will be able to query the database as if it is in memory.

Useful utility functions

.Q.P and .Q.PD - returns a list of segments, e.g., `:/disk1/db1, `:/disk2/db2.
This is equivalent to reading par.txt
.Q.D and .Q.PV - returns a list of partitions in each segment, e.g., 2016.01.01, 2016.01.02
So if you want to have a dictionary of all partitions in each segment, you can do .Q.P!.Q.D and you'll get something like:
```
`:/disk1/db1 | 2016.01.01 2016.01.02
`:/disk2/db2 | 2016.01.03 2016.01.04
```
You can also use them to get full paths of all partitions:
` sv 'raze .Q.P (,/:)'`$string .Q.D
And you'll get something like :/disk1/db1/2016.01.01, :/disk1/db1/2016.01.02, :/disk2/db2/2016.01.03, :/disk2/db2/2016.01.04
.Q.par - returns the path of the partition
The usage is .Q.par[dir;date;table].
e.g. .Q.par[`:.;2020.01.01;`trade] will return `:/disk1/db1/2020.01.01

Database
Management

Enumeration

Foreign Keys

Compression

Database

Overview

Flat tables

Splayed tables

Saving splayed tables

Loading splayed tables

Querying splayed tables

Updating splayed tables

Partitioned tables

Saving partitioned tables

Loading, querying and updating partitioned tables

Sorting large partitions

Useful utility functions

Segmented database

Saving to segmented database

Loading from segmented database

Useful utility functions